Data Quality - Data Profiling

Dataquality Metrics

Description

Data profiling is a set of algorithms for statistical analysis and assessment of the quality of data values within a data set, as well as exploring relationships that exists between value collections within and across data sets.

On this page, you can see a demo of such tool in OWB

For each column in a table, a data profiling tool will provide a frequency distribution of the different values, providing insight into the type and use of each column. Cross-column analysis can expose embedded value dependencies, while inter-table analysis explores overlapping values sets that may represent foreign key relationships between entities, and it is in this way that profiling can be used for anomaly analysis and assessment, which feeds the process of defining data quality metrics.

The analysis performed by data profiling tools exposes :

  • data rule (For instance, identifies dependencies that represent business rules embedded within the data)
  • anomalies that exist within the data sets

The data profiling give a lot of insight about data in a set of data but the most important goal is to derive data rule and to achieve it, the set of algorithm is normally classified by data rule type to find :

  • Domain List
  • Domain Pattern List
  • Domain Range

Refer to data rule type for more details on data rule type.

Data profiling analysis type

The data rules may be discover or classify through three type of data profiling analysis:

  • Attribute Analysis. Analysis between values of the same column.
  • Functional dependency. Analysis between columns of the same row.
  • Referential Analysis. Analysis between columns from different tables to discover Jo0ins, Redundant Attributes, Orphans, Childless.

Request

  • Candidate Key: Determines whether a column can be used as a table key
  • Column Length Distribution reports the range of lengths for string values in a column.
  • Column Null Ratio reports the percentage of null values in a column.
  • Column Pattern identifies regular expressions that are applicable to the values in a column.
  • Column Statistics reports statistics such as minimum, maximum, and average values for a column.
  • Column Value Distribution reports the groupings of distinct values in a column.
  • Functional Dependency determines whether the value of a column is dependent on the value of other columns in the same table.
  • Value Inclusion reports the percentage of time that a column value in one table matches a column in another table.

Performance

A lot of the profiling resources are spent on working on column to find relationship. So don’t just profile every column and try to see if there are relationships between all of them, because you are not efficiently using the resources.

Documentation / Reference





Discover More
Etl Basic Functionnality
Data Integration - ETL evolves into Data Integration

Data integration suffers from an image problem. It has become synonymous with extract, transform and load. Likewise, ETL has been regarded as a data warehousing technology. Both of these viewpoints...
Division
Data Mining - (Descriptive|Discovery) (Analysis|Statistics)

Descriptive analysis is also known as Descriptive statistics They are procedures used to summarize, organize, and simplify data. Descriptive function are always unsupervised See also . Visual...
Dataquality Metrics
Data Quality

measures the quality of data through metrics and try to improve them. You will find it in two main domains : The management of attribute data with the Master Data Management (MDM) The management...
Dataquality Metrics
Data Quality - Data Rules

Data rules are rule that can have various designations such as: business rules (in the data modeling), data test, quality screen. They follow the same concept than the rules from an event driven...
Data System Architecture
Data Warehousing - 34 Kimball Subsytems

This page takes back the Kimball Datawarehouse 34 Subsystem as a table of content and links them to a page on this website....
Mdm Sap
Master Data Management (MDM)

solutions are considered to hold the master for any given entity. In computing, master data management (MDM) comprises a set of processes and tools that consistently defines and manages the non-transactional_data...
Data Profiling Step
OWB - Data Profiling

If you don't know anything data profiling, you can have a look to the Data Profiling Page for a description. Nothing is better than a video : Most organizations build a data warehouse to provide an...
Card Puncher Data Processing
OWB - Data Profiling Demo

This data profiling demo of the data profiling product of OWB was made with the help of this Oracle by Example series : Examining...
Ssis Data Profiling Task
SSIS - Data Profiling

Data profiling capabilities The Data Profiling task gathers the requested profile statistics and writes them to an XML document. This can be: saved as a file for later analysis, or written...



Share this page:
Follow us:
Task Runner