Data Mining - Algorithms

Thomas Bayes

About

An Algorithm is a mathematical procedure for solving a specific kind of problem.

For some data mining functions, you can choose among several algorithms.

Data Mining Algorithm

List

Algorithm Function Type Description
Decision Tree (DT) Classification supervised Decision trees extract predictive information in the form of human-understandable rules. The rules are if-then-else expressions; they explain the decisions that lead to the prediction.
Generalized Linear Models (GLM) Classification and Regression supervised GLM implements logistic regression for classification of binary targets and linear regression for continuous targets. GLM classification supports confidence bounds for prediction probabilities. GLM regression supports confidence bounds for predictions.
Minimum Description Length (MDL) Attribute Importance supervised MDL is an information theoretic model selection principle. MDL assumes that the simplest, most compact representation of data is the best and most probable explanation of the data.
Naive Bayes (NB) Classification supervised Naive Bayes makes predictions using Bayes' Theorem, which derives the probability of a prediction from the underlying evidence, as observed in the data.
Support Vector Machine (SVM) Classification and Regression supervised Distinct versions of SVM use different kernel functions to handle different types of data sets. Linear and Gaussian (nonlinear) kernels are supported.
SVM classification attempts to separate the target classes with the widest possible margin.
SVM regression tries to find a continuous function such that the maximum number of data points lie within an epsilon-wide tube around it.
Apriori (AP) Association Unsupervised Apriori performs market basket analysis by discovering co-occurring items (frequent itemsets) within a set. Apriori finds rules with support greater than a specified minimum support and confidence greater than a specified minimum confidence.
k-Means (KM) Clustering Unsupervised k-Means is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters. Each cluster has a centroid (center of gravity). Cases (individuals within the population) that are in a cluster are close to the centroid.
Oracle Data Mining supports an enhanced version of k-Means. It goes beyond the classical implementation by defining a hierarchical parent-child relationship of clusters.
Non-Negative Matrix Factorization (NMF) Feature Extraction Unsupervised NMF generates new attributes using linear combinations of the original attributes. The coefficients of the linear combinations are non-negative. During model apply, an NMF model maps the original data into the new set of attributes (features) discovered by the model.
One Class Support Vector Machine (One- Class SVM) Anomaly Detection Unsupervised One-class SVM builds a profile of one class and when applied, flags cases that are somehow different from that profile. This allows for the detection of rare cases (such as outliers) that are not necessarily related to each other.
Orthogonal Partitioning Clustering (O-Cluster or OC) Clustering Unsupervised O-Cluster creates a hierarchical, grid-based clustering model. The algorithm creates clusters that define dense areas in the attribute space. A sensitivity parameter defines the baseline density level.
Maximum Entropy (MaxEnt) Classification Supervised

Machine learning techniques:

Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.

Comparison

Weka

In the experimenter, the result will show:

at a significance level (5% and 1% are common). The “null hypothesis” is that two classifiers perform the same.

Dataset                   (1) trees.J4 | (2) rules (3) rules (4) bayes (5) lazy. (6) funct (7) funct (8) meta.
--------------------------------------------------------------------------------------------------------------
iris                     (100)   94.73 |   92.53     33.33 *   95.53     95.40     97.07     96.27     95.40  
breast-cancer            (100)   74.28 |   66.91 *   70.30     72.70     72.85     67.77 *   69.52 *   71.62  
german_credit            (100)   71.25 |   65.91 *   70.00     75.16 v   71.88     75.24 v   75.09 v   71.27  
pima_diabetes            (100)   74.49 |   71.52     65.11 *   75.75     70.62     77.47     76.80     74.92  
Glass                    (100)   67.63 |   57.40 *   35.51 *   49.45 *   69.95     62.84     57.36 *   44.89 *
ionosphere               (100)   89.74 |   82.28 *   64.10 *   82.17 *   87.10     87.72     88.07     90.89  
--------------------------------------------------------------------------------------------------------------
                               (v/ /*) |   (0/2/4)   (0/2/4)   (1/3/2)   (0/6/0)   (1/4/1)   (1/3/2)   (0/5/1)
                               
Key:
(1) trees.J48 '-C 0.25 -M 2' -217733168393644444
(2) rules.OneR '-B 6' -3459427003147861443
(3) rules.ZeroR '' 48055541465867954
(4) bayes.NaiveBayes '' 5995231201785697655
(5) lazy.IBk '-K 1 -W 0 -A \"weka.core.neighboursearch.LinearNNSearch -A \\\"weka.core.EuclideanDistance -R first-last\\\"\"' -3080186098777067172
(6) functions.Logistic '-R 1.0E-8 -M -1' 3932117032546553727
(7) functions.SMO '-C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K \"functions.supportVector.PolyKernel -C 250007 -E 1.0\"' -6585883636378691736
(8) meta.AdaBoostM1 '-P 100 -S 1 -I 10 -W trees.DecisionStump' -7378107808933117974

Documentation / Reference





Discover More
Card Puncher Data Processing
(Statistics|Machine Learning|Data Mining) - (Unit|Individual|Case|Subject|Observation|Instance|Input)

in Statistics. Each member of a sample is also known as: a unit an individual a case a subject an instance an observation input data Data contains values grouped into variables and observations....
Model Funny
Data Mining - (Function|Model)

The model is the function, equation, algorithm that predicts an outcome value from one of several predictors. During the training process, the models are build. A model uses a logic and one of several...
Thomas Bayes
Data Mining - Apriori algorithm

Apriori is an Unsupervised Association algorithm performs market basket analysis by discovering co-occurring items (frequent itemsets) within a set. Apriori finds rules with support greater than a specified...
Thomas Bayes
Data Mining - Data (Preparation | Wrangling | Munging)

Data Preparation is a data step that prepares your data for further analyis. It's a key factor in any data project: mining, ai analytics Preparing has several steps that are explained below. ...
Thomas Bayes
Data Mining - Decision Tree (DT) Algorithm

Desicion Tree (DT) are supervised Classification algorithms. They are: easy to interpret (due to the tree structure) a boolean function (If each decision is binary ie false or true) Decision trees...
Thomas Bayes
Data Mining - Result Considerations

Before tackling a data mining problem, some considerations must be take into account in order to get good interpretations of the results. Strong correlations of data do not necessarily prove a cause-and-effect...
Support Vector Geometry
Data Mining - Support Vector Machines (SVM) algorithm

A support vector machine is a Classification method. supervised algorithm used for: Classification and Regression (binary and multi-class problem) anomalie detection (one class problem) Supports:...
Thomas Bayes
Data Mining - k-Means Clustering algorithm

k-Means is an Unsupervised distance-based clustering algorithm that partitions the data into a predetermined number of clusters. Each cluster has a centroid (center of gravity). Cases (individuals...
Thomas Bayes
Machine Learning - Deep Learning (Network)

Deep Learning (Networks) is an algorithms which is basically neural networks with many layers. Deep learning is also known as: deep machine learning, deep structured learning, hierarchical learning,...
Thomas Bayes
Machine Learning - Unsupervised Learning ( Mining )

Unsupervised learning is the second type of function that an algorithm can perform. The algorithm is said to be unsupervised when no response is used in the algorithm. Unsupervised Learning has the goal...



Share this page:
Follow us:
Task Runner