About

Data Preparation is a data step that prepares your data for further analyis.

It's a key factor in any data project:

  • mining, ai
  • analytics

Preparing has several steps that are explained below.

Output

Data Mining

For data mining, the data exists within a single table or view where each case (record) is a row.

If you want to mine data stored in a star schema, you still need to create a unique table with a record by case and the variable of interests.

The data mining development process may require several data sets.

A data set may be:

Type Preparation / Data Transformation

Data transformations may be required by algorithms.

Data Cleansing

The data must be properly cleansed to eliminate inconsistencies and support the needs of the mining application.

Data Discretization

Put data into bin: binning (discretization)

Data Correction

Data Correction - There is always a bad input

Data Normalization

Normalize the data to be able to have:

Example:

  • Email and URL should be all lowercase
  • Metrics should be expressed in a rate rather than in a number

Outlier suppression

outlier suppression is required to not skew the result.

Tools

  • Google Cloud Dataprep, an intelligent, fully-managed cloud service (built in collaboration with Trifacta) that visually explores, cleans and prepares structured and unstructured data for analysis or training machine-learning models.
  • In Oracle, DBMS_DATA_MINING_TRANSFORM is a data transformation package that includes a variety of missing value and outlier treatments, as well as binning and normalization capabilities.

Documentation / Reference