Classification in Data Mining

Businesses collect enormous amounts of data every day, such as transactions, customer behavior, sensor readings, website activity, etc.

However, raw data alone doesn't create value. Hence, data mining methods are used to give values to data by answering questions that involve prediction and/or labeling, one of which is through a method called classification.

 

What is Classification?

Classification is a data processing technique that filters and groups data according to the similar criteria possessed by each piece of data [1], [2]. Groupings from this method are not random guesses. Instead, this method lets the model learn from past labeled examples to help it predict labels for new data.

For example:

Problem Classification Output
Email filtering Spam / Not Spam
Loan approval Approve / Reject
Customer churn Stay / Leave
Product quality Defective / Non-Defective
Medical diagnosis [Disease] / Healthy

Usually, classification outputs are in discrete classes and not numbers. This is what makes classification different from prediction methods like regression.

 

How Classification Works

A classification model follows three main stages:

1. Training

At this stage, the model studies historical labeled data.

Example dataset:

Customer Age Monthly Usage Churn
21 Low Yes
45 High No
30 Medium No

The algorithm will then learn patterns that connect features to the outcomes.

2. Learning the Pattern

The algorithm builds decision rules based on the learned dataset. Based on the example dataset, it may catch on that customers of young age, with low usage, are more likely to churn. This step is also known as the "intelligence building" phase.

3. Prediction

After understanding the algorithm and identifying the patterns, the model will now be given new data to predict outcomes.

New data example:

Customer Age Monthly Usage
24 Low

The model will most likely predict that it will churn (churn = yes), because this data shows a customer of young age with low monthly usage. Based on historical data, this type of dataset shows a positive result of churning.

 

Common Classification Algorithms

  • Decision Tree: It creates human-readable rules using branching logic, which are easy to interpret. An example rule can be, "IF usage is low AND complaints are high, then it will churn."
  • k-Nearest Neighbor (kNN): It classifies data based on similarity to nearby observations.
  • Logistic Regression: It provides probability to show how likely something will happen. For example, the analysis shows an 82% chance of customer churn.
  • Naïve Bayes: It is a probability-based classifier using statistical independence assumptions. This algorithm is fast and effective for text classification, such as spam detection or sentiment analysis.
  • Support Vector Machine (SVM): This algorithm separates classes using optimal boundaries. It works well for complex patterns and high-dimensional data.

 

References

[1] D. Vaya and T. Hadpawat, “Classification in Data Mining: A Survey,” International Journal of Advanced Science and Technology, vol. 29, no. 3, pp. 13061–13071, Jan. 2020.

[2] N. a’yuni Ramadhani and H. A. Rosyid, “Review: Algoritma Data Mining untuk Klasifikasi Data,” Jurnal Inovasi Teknologi Dan Edukasi Teknik, vol. 2, no. 12, pp. 550–556, Dec. 2022, doi: 10.17977/um068v2i122022p550-556.