Data Quality - Data Profiling
Table of Contents
1 - Description
Data profiling is a set of algorithms for statistical analysis and assessment of the quality of data values within a data set, as well as exploring relationships that exists between value collections within and across data sets.
For each column in a table, a data profiling tool will provide a frequency distribution of the different values, providing insight into the type and use of each column. Cross-column analysis can expose embedded value dependencies, while inter-table analysis explores overlapping values sets that may represent foreign key relationships between entities, and it is in this way that profiling can be used for anomaly analysis and assessment, which feeds the process of defining data quality metrics.
The analysis performed by data profiling tools exposes :
- data rule (For instance, identifies dependencies that represent business rules embedded within the data)
- anomalies that exist within the data sets
The data profiling give a lot of insight about data in a set of data but the most important goal is to derive data rule and to achieve it, the set of algorithm is normally classified by data rule type to find :
- Domain List
- Domain Pattern List
- Domain Range
Refer to data rule type for more details on data rule type.
2 - Articles Related
3 - Data profiling analysis type
The data rules may be discover or classify through three type of data profiling analysis:
- Attribute Analysis. Analysis between values of the same column.
- Functional dependency. Analysis between columns of the same row.
- Referential Analysis. Analysis between columns from different tables to discover Jo0ins, Redundant Attributes, Orphans, Childless.
4 - Request
- Candidate Key: Determines whether a column can be used as a table key
- Column Length Distribution reports the range of lengths for string values in a column.
- Column Null Ratio reports the percentage of null values in a column.
- Column Pattern identifies regular expressions that are applicable to the values in a column.
- Column Statistics reports statistics such as minimum, maximum, and average values for a column.
- Column Value Distribution reports the groupings of distinct values in a column.
- Functional Dependency determines whether the value of a column is dependent on the value of other columns in the same table.
- Value Inclusion reports the percentage of time that a column value in one table matches a column in another table.
5 - Performance
A lot of the profiling resources are spent on working on column to find relationship. So don’t just profile every column and try to see if there are relationships between all of them, because you are not efficiently using the resources.