Data Mining - (Classifier|Classification Function)
Table of Contents
1 - About
Rows are classified into buckets. For instance, if data has feature x, it goes into bucket one; if not, it goes into bucket two.
The target attribute can be one of k class membership.
To summarize the results of the classifier, a Confusion matrix may be used.
2 - Articles Related
3 - Type
In classification, there is two kind of problem:
3.1 - One
One class classification: Data Mining - (Anomaly|outlier) Detection
3.2 - Binary
A two class problem (binary problem) has possibly only two outcomes:
- “yes or no”
- “success” or “failure”
3.3 - Multi-class
A multi-class problem has more than two possible outcomes.
4 - Example
|Example||Prediction||Illustrate the Model|
|Filter Spam||Yes or No||Binary Classification|
|Purchasing Product X||Yes or No||Binary Classification|
|Defaulting on a loan||Yes or No||Binary Classification|
|Failing in the manufacturing process||Yes or No||Binary Classification|
|Producing revenue||Low, Medium, High||Multi-class Classification|
|Differing from known cases||Yes or No||One-class Classification|
5 - Mathematical Notation
<math> C(X) \in C </math> where:
- X is a feature vector
- Y is a qualitative response taking values in the set C
C of X gives you values in the set C.
6 - Probabilities
Often we are more interested in estimating the probabilities (confidence) that X belongs to each category in C.
For example, it is more valuable to have an estimate of the probability that an insurance claim is fraudulent, than a classification fraudulent or not.
You can imagine, in the one situation, you might have a probability of 0.9 the claim is fraudulent. And in another case, it might be 0.98. Now in both cases, those might both be above the threshold of raising the flag that this is a fraudulent insurance claim. But if you're going to look into the claim, and you're going to spend some hours investigating, you'll probably go for the 0.98 first before the 0.9. So estimating the probabilities is also key.
7 - Data Analysis
8 - Data structure
Most of the algorithms are based on this data structure (knowledge representation):
- Training set: Calculate distance from training instances: Machine Learning - Rote Classifier and Machine Learning - K-Nearest Neighbors (KNN) algorithm - Instance based learning
- Regression: Result depends on a linear combination of attributes: Machine Learning - Linear (Regression|Model)
9 - Logistic Regression versus LDA
- Logistic regression uses the conditional likelihood based on Pr(Y|X) (known as discriminative learning).
- LDA uses the full likelihood based on Pr(X|Y) (known as generative learning).
- Despite these differences, in practice the results are often very similar.
10 - Summary
- Logistic regression is very popular for classification, especially when K = 2.
- LDA is useful when n is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when K > 2.
- Naive Bayes is useful when p is very large
11 - Model Accuracy
12 - Others
13 - Example
Given demographic data about a set of customers, predict customer response to an affinity card program