Data Mining - Data (Preparation | Wrangling | Munging)

Thomas Bayes

About

Data Preparation is a data step that prepares your data for further analyis.

It's a key factor in any data project:

  • mining, ai
  • analytics

Preparing has several steps that are explained below.

Output

Data Mining

For data mining, the data exists within a single table or view where each case (record) is a row.

If you want to mine data stored in a star schema, you still need to create a unique table with a record by case and the variable of interests.

The data mining development process may require several data sets.

A data set may be:

Type Preparation / Data Transformation

Data transformations may be required by algorithms.

Data Cleansing

The data must be properly cleansed to eliminate inconsistencies and support the needs of the mining application.

Data Discretization

Put data into bin: binning (discretization)

Data Correction

Data Correction - There is always a bad input

Data Normalization

Normalize the data to be able to have:

Example:

  • Email and URL should be all lowercase
  • Metrics should be expressed in a rate rather than in a number

Outlier suppression

outlier suppression is required to not skew the result.

Tools

  • Google Cloud Dataprep, an intelligent, fully-managed cloud service (built in collaboration with Trifacta) that visually explores, cleans and prepares structured and unstructured data for analysis or training machine-learning models.
  • In Oracle, DBMS_DATA_MINING_TRANSFORM is a data transformation package that includes a variety of missing value and outlier treatments, as well as binning and normalization capabilities.

Documentation / Reference





Discover More
Model Funny
Data Mining - (Function|Model)

The model is the function, equation, algorithm that predicts an outcome value from one of several predictors. During the training process, the models are build. A model uses a logic and one of several...
P Value Pipeline
Data Mining - (Life cycle|Project|Data Pipeline)

Data mining is an experimental science. Data mining reveals correlation, not causation. With good data, you will make good algorithm. The most preferable solution is then to work on good features....
Thomas Bayes
Data Mining - Result Considerations

Before tackling a data mining problem, some considerations must be take into account in order to get good interpretations of the results. Strong correlations of data do not necessarily prove a cause-and-effect...
Data Mining Tool 2
Oracle Data Mining - Data Miner GUI

Oracle Data Miner is the graphical user interface for Oracle Data Mining. Oracle Data Miner provides wizards that guide you through: the data preparation, data mining, model evaluation, and...
Data Mining Tool 2
Oracle Data Mining - PL/SQL DBMS Package

The PL/SQL interface to Oracle Data Mining is implemented in three packages: DBMS_DATA_MINING, the primary interface to Oracle Data Mining DBMS_DATA_MINING_TRANSFORM, convenience routines for data...



Share this page:
Follow us:
Task Runner