Advanced Re-Sampling Techniques for Multi-Class Imbalanced Classification
Main Article Content
Abstract
Imbalanced classification is a common problem in machine learning, where one class significantly outnumbers the others. This imbalance leads to biased model performance, where the classifier favors the majority class, resulting in poor detection of the minority class. Traditional machine learning algorithms assume a balanced distribution, making them ineffective in such scenarios. Various techniques, including resampling methods (such as oversampling and undersampling), cost-sensitive learning, and synthetic data generation, have been proposed to address this challenge. Effective handling of imbalanced data is crucial in applications like fraud detection, medical diagnosis, and anomaly detection, where minority class predictions hold high significance. This study explores different approaches to mitigate class imbalance and improve classification performance, ensuring better generalization and robustness in real-world scenarios.
Introduction: Model predictions are skewed by class imbalances, rendering accuracy metrics less meaningful as is often the case in healthcare and fraud detection. SMOTE and its derivatives are also an example of resampling techniques which creates synthetic data to balance the classes for better learning. Such as Borderline-SMOTE, ADASYN, SMOTEENN or SMOTETomek helped to improve the decision boundaries and noise reduction. These techniques facilitate creation of better models by addressing the issue where minority class isn't represented in feature space fairly.
Methodology: The GMM-SMOTE method addresses imbalanced datasets by utilizing Gaussian Mixture Model (GMM) for clustering and applying SMOTE to oversample minority data in high-density areas. This approach involves clustering data, selecting clusters with significant minority presence, and generating synthetic samples to ensure better balance. GMM enhances clustering by assigning probabilities to data points, while SMOTE focuses on producing samples in less populated regions, effectively reducing noise and improving class representation and model performance in imbalanced situations.
Results: The study evaluates GMM-SMOTE against various oversampling techniques, including KMeans-SMOTE, KMeans-ADASYN, and GMM-ADASYN, using datasets such as Breast Cancer, Crx, and Churn BigML. Performance metrics include accuracy, AUC-ROC score, and computational efficiency across classifiers like Random Forest, SVM, Logistic Regression, and Neural Networks. Results demonstrate that GMM-SMOTE enhances classification through balanced decision boundaries and shows efficiency in training time, making it advantageous for managing imbalanced datasets.
Conclusions: The study assesses the effectiveness of GMM-SMOTE in enhancing minority
class representation and maintaining balanced decision boundaries compared to traditional oversampling methods like SMOTE and ADASYN. GMM-SMOTE generates more
meaningful synthetic samples and mitigates overfitting. Future research will focus on adaptive parameter tuning, integration with deep learning, and real-time applications, with additional exploration into its effects on multi-class imbalance and computational efficiency. Overall, GMM-SMOTE stands out as a valuable resampling method for improving classification performance in imbalanced datasets.