(Statistics|Machine Learning|Data Mining) - (Unit|Individual|Case|Subject|Observation|Instance|Input)

Card Puncher Data Processing

About

Logical Data Modeling - Entity (Row | Tuple | Member | Unit | Instance ) in Statistics.

Each member of a sample is also known as:

  • a unit
  • an individual
  • a case
  • a subject
  • an instance
  • an observation
  • input data

Data contains values grouped into variables and observations.

They are formally composed by attributes (features) which together constitute a description of their characteristics.

They are also known as:

This is the piece of input data for all algorithm for which an output value must be generated.

Oracle Data Mining

The data that you wish to mine must be defined within a single table or view. The information for each record must be stored in a separate row. The data records are commonly called cases. Each case can be identified by a unique case ID. The table or view itself is referred to as a case table.

Oracle Data Mining requires that the data be presented as a case table in single-record case format. All the data for each record (case) must be contained within a row.

Most typically, the case table is a view that presents the data in the required format for mining.

Model Details

Model details reveal information about model attributes and their treatment by the algorithm. There is a separate GET_MODEL_DETAILS routine for each algorithm.

Nested Data

Oracle Data Mining requires a case table in single-record case format, with each record in a separate row.

Oracle Data Mining supports dimensioned/transactional data through nested columns. Each row in the nested column consists of an attribute name/value pair. Oracle Data Mining internally processes each nested row as a separate attribute.

Algorithms that support nested data:

Algorithm Mining Function
Apriori association rules
GLM classification and regression
k-Means clustering
MDL attribute importance
Naive Bayes classification
NMF feature extraction
SVM classification, regression, and anomaly detection

Note on data format

Previous versions of Oracle Data Mining allowed two distinct data formats:

  • Single Row per Record, in which all the information about an individual resides in a single row of the table/view,
  • and Multiple row per Record (sometimes called “Transactional” format), in which information for a given individual may be found in several rows (for example if each row represents an item purchased).

In ODM 10g Release 2 and ODM 11g Release 1, only Single Row per Record format is acceptable (except in the case of Association Rules).

The database feature called Nested Column is used to accommodate the use case previously handled by Transactional format.

The possibilities for gathering data are:

  • The case table or view contains all the data to be mined.
  • Other tables or views contain additional simple attributes of an individual, such as FIRST_NAME, LAST_NAME, etc.
  • Other tables or views contain complex attributes of an individual such as a list of products purchased or a list of telephone calls for a given period (sometimes called “transactional” data).
  • The data to be mined consists of transactional data only; in this case, the case table must be constructed from the transactional data, and might consist only of a column containing the unique identifiers for the individuals and a target column.

Transactional Data Only

In special situations such as in Life Sciences problems, where each individual may have a very high number (perhaps thousands) of attributes, all the data is contained in a transactional-format table. This table must contain at least the three columns indicating the unique case ID, the attribute name, and the attribute value. For example, the attributes may be gene expression names and the attribute value is a gene expression value. Typically, the attribute values have been normalized and binned to obtain binary values of 0 and 1 (representing, for example, that the gene expression for a particular case is above (1) or below (0) the average value for that gene. For each case, there is one attribute name and value pair representing the target value – for example Target=1 means “responds to treatment” and Target=0 means “does not respond to treatment”. Suppose that we have a transactional table LYMPH_OUTCOME_BINNED with 5591 gene expressions for each of 58 patients and the binary target OUTCOME (0/1) indicating the success in treating Lymphoma patients. The business problem consists of the likely success in treating a particular patient based only on the values of gene expressions for that patient. The first step is to separate the case table information (ID, OUTCOME) from the gene information to be joined in as a nested column.

Documentation / Reference





Discover More
Anomalies Election Fraud
Data Mining - (Anomaly|outlier) Detection

The goal of anomaly detection is to identify unusual or suspicious cases based on deviation from the norm within data that is seemingly homogeneous. Anomaly detection is an important tool: in data...
Model Funny
Data Mining - (Function|Model)

The model is the function, equation, algorithm that predicts an outcome value from one of several predictors. During the training process, the models are build. A model uses a logic and one of several...
Thomas Bayes
Data Mining - Data (Preparation | Wrangling | Munging)

Data Preparation is a data step that prepares your data for further analyis. It's a key factor in any data project: mining, ai analytics Preparing has several steps that are explained below. ...
Thomas Bayes
Data Mining - Decision boundary Visualization

Classifiers create boundaries in instance space. Different classifiers have different biases. You can explore them by visualizing the classification boundaries. Logistic Regression method produces...
Data Minig Naives Bayes
Data Mining - Naive Bayes (NB)

Naive Bayes (NB) is a simple supervised function and is special form of discriminant analysis. It's a generative model and therefore returns probabilities. It's the opposite classification strategy...
Support Vector Geometry
Data Mining - Support Vector Machines (SVM) algorithm

A support vector machine is a Classification method. supervised algorithm used for: Classification and Regression (binary and multi-class problem) anomalie detection (one class problem) Supports:...
Thomas Bayes
Data Mining - k-Means Clustering algorithm

k-Means is an Unsupervised distance-based clustering algorithm that partitions the data into a predetermined number of clusters. Each cluster has a centroid (center of gravity). Cases (individuals...
Data System Architecture
Logical Data Modeling - Entity (Row | Tuple | Member | Unit | Instance )

An entity isone thing which can be distinctly identified. Example: A specific person A specific company A specific group of persons, A specific place, A specific event It's a concept to...
Bed Overfitting
Machine Learning - (Overfitting|Overtraining|Robust|Generalization) (Underfitting)

A learning algorithm is said to overfit if it is: more accurate in fitting known data (ie training data) (hindsight) but less accurate in predicting new data (ie test data) (foresight) Ie the model...
Anscombe Regression
Machine Learning - (Supervised|Directed) Learning ( Training ) (Problem)

Supervised Learning has the goal of predicting a value (outcome) from particular characteristics (predictors) that describes some behaviour. The attribute used to trained and being predicted is called...



Share this page:
Follow us:
Task Runner