(Statistics|Machine Learning|Data Mining) - (Unit|Individual|Case|Subject|Observation|Instance|Input)

1 - About

Each member of a sample is also known as:

  • a unit
  • an individual
  • a case
  • a subject
  • an instance
  • an observation
  • input data

Data contains values grouped into variables and observations.

They are formally composed by attributes (features) which together constitute a description of their characteristics. They can be stored:

This is the piece of input data for all algorithm for which an output value must be generated.

3 - Oracle Data Mining

The data that you wish to mine must be defined within a single table or view. The information for each record must be stored in a separate row. The data records are commonly called cases. Each case can be identified by a unique case ID. The table or view itself is referred to as a case table.

Oracle Data Mining requires that the data be presented as a case table in single-record case format. All the data for each record (case) must be contained within a row.

Most typically, the case table is a view that presents the data in the required format for mining.

3.1 - Model Signature

The model signature is the set of data attributes used to build a model. Some or all of the attributes in the signature should be present for scoring.

If some columns are not present, they are disregarded. If columns with the same names but different data types are present, the model attempts to convert the data type.

The model signature does not necessarily include all the columns in the build data. Algorithm-specific criteria may cause the model to ignore certain columns. Other columns may be eliminated by transformations. Only the data attributes actually used to build the model are included in the signature.

The target and case ID columns are not included in the signature.

3.2 - Model Details

Model details reveal information about model attributes and their treatment by the algorithm. There is a separate GET_MODEL_DETAILS routine for each algorithm.

3.3 - Nested Data

Oracle Data Mining requires a case table in single-record case format, with each record in a separate row.

Oracle Data Mining supports dimensioned/transactional data through nested columns. Each row in the nested column consists of an attribute name/value pair. Oracle Data Mining internally processes each nested row as a separate attribute.

Algorithms that support nested data:

Algorithm Mining Function
Apriori association rules
GLM classification and regression
k-Means clustering
MDL attribute importance
Naive Bayes classification
NMF feature extraction
SVM classification, regression, and anomaly detection

4 - Note on data format

Previous versions of Oracle Data Mining allowed two distinct data formats:

  • Single Row per Record, in which all the information about an individual resides in a single row of the table/view,
  • and Multiple row per Record (sometimes called “Transactional” format), in which information for a given individual may be found in several rows (for example if each row represents an item purchased).

In ODM 10g Release 2 and ODM 11g Release 1, only Single Row per Record format is acceptable (except in the case of Association Rules).

The database feature called Nested Column is used to accommodate the use case previously handled by Transactional format.

The possibilities for gathering data are:

  • The case table or view contains all the data to be mined.
  • Other tables or views contain additional simple attributes of an individual, such as FIRST_NAME, LAST_NAME, etc.
  • Other tables or views contain complex attributes of an individual such as a list of products purchased or a list of telephone calls for a given period (sometimes called “transactional” data).
  • The data to be mined consists of transactional data only; in this case, the case table must be constructed from the transactional data, and might consist only of a column containing the unique identifiers for the individuals and a target column.

5 - Transactional Data Only

In special situations such as in Life Sciences problems, where each individual may have a very high number (perhaps thousands) of attributes, all the data is contained in a transactional-format table. This table must contain at least the three columns indicating the unique case ID, the attribute name, and the attribute value. For example, the attributes may be gene expression names and the attribute value is a gene expression value. Typically, the attribute values have been normalized and binned to obtain binary values of 0 and 1 (representing, for example, that the gene expression for a particular case is above (1) or below (0) the average value for that gene. For each case, there is one attribute name and value pair representing the target value – for example Target=1 means “responds to treatment” and Target=0 means “does not respond to treatment”. Suppose that we have a transactional table LYMPH_OUTCOME_BINNED with 5591 gene expressions for each of 58 patients and the binary target OUTCOME (0/1) indicating the success in treating Lymphoma patients. The business problem consists of the likely success in treating a particular patient based only on the values of gene expressions for that patient. The first step is to separate the case table information (ID, OUTCOME) from the gene information to be joined in as a nested column.

6 - Documentation / Reference

data_mining/case.txt · Last modified: 2017/09/13 21:21 by gerardnico