Hybrid random forest–catboost ensemble for heart disease prediction on imbalanced datasets: Toward applications in military health systems

Mahyus Ihsan; Zahnur; Iftahul  Fadlan; Ikhsan Maulidi

doi:10.58524/app.sci.def.v4i1.1148

Authors

Mahyus Ihsan Universitas Syiah Kuala
Zahnur Universitas Syiah Kuala
Iftahul Fadlan Universitas Syiah Kuala
Ikhsan Maulidi Universitas Syiah Kuala

DOI:

https://doi.org/10.58524/app.sci.def.v4i1.1148

Keywords:

CatBoost, Ensemble learning, Heart disease prediction, Machine learning , Random Forest

Abstract

ackground: Heart disease is one of the main causes of death worldwide, with cases increasing every year. This situation highlights the urgent need for early detection systems that are not only fast but also accurate and reliable. In recent years, machine learning has emerged as a promising alternative approach for analyzing medical data, particularly for disease classification and risk prediction tasks.

Aims: This study aims to develop a heart disease prediction model by integrating Random Forest and CatBoost in a hybrid ensemble framework and evaluating its performance on an imbalanced medical dataset.

Method: This study employs a quantitative approach based on supervised learning using the Behavioral Risk Factor Surveillance System (BRFSS) 2021 dataset, which consists of more than 300,000 observations. Data preprocessing includes duplicate removal, BMI categorization, encoding of categorical variables, and exploratory analysis. To address class imbalance, the Borderline-SMOTE technique was applied before splitting the dataset using an 80:20 train-test split. Random Forest and CatBoost models were trained and combined using a soft voting ensemble.

Result: The evaluation results indicate that Random Forest achieved the highest accuracy of 0.94, with well-balanced precision and recall across all classes. CatBoost demonstrated relatively stable performance with accuracy around 0.84. The ensemble approach achieved an accuracy of 0.91 with strong metric stability and good sensitivity to positive cases.

Conclusion: The results indicate that Random Forest performs best for the dataset used in this study, while the ensemble model provides a balanced compromise between predictive performance and robustness. The analysis also shows that Age Category, General Health, and BMI are the most influential predictors of heart disease risk. This model can support early cardiovascular risk detection in military personnel, contributing to maintaining operational readiness in defense systems. Furthermore, the proposed approach provides a reliable decision-support tool for large-scale medical screening in resource-constrained healthcare environments.

References

Alaa, A. M., Bolton, T., Di Angelantonio, E., Rudd, J. H., & der Schaar, M. (2019). Cardiovascular Disease Risk Prediction Using Automated Machine Learning: A Prospective Study of 423,604 UK Biobank Participants. PLOS ONE, 14(5), e0213653. https://doi.org/10.1371/journal.pone.0213653

Ashri, S. E., El-Gayar, M. M., & El-Daydamony, E. M. (2021). HDPF: Heart disease prediction framework based on hybrid classifiers and genetic algorithm. IEEE Access, 9, 146797–146809. https://doi.org/10.1109/ACCESS.2021.3122519

Belgiu, M., & Druaguct, L. (2016). Random Forest in Remote Sensing: A Review of Applications and Future Directions. ISPRS Journal of Photogrammetry and Remote Sensing, 114, 24–31. https://doi.org/10.1016/j.isprsjprs.2016.01.011

Biau, G., & Scornet, E. (2016). A Random Forest Guided Tour. TEST, 25(2), 197–227. https://doi.org/10.1007/s11749-016-0481-7

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Chawla, N. V, Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

Chawla, N. V, Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving Prediction of the Minority Class in Boosting. European Conference on Principles of Data Mining and Knowledge Discovery, 107–119. https://doi.org/10.1007/978-3-540-45167-9_13

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD. https://doi.org/10.1145/2939672.2939785

Cutler, D. R. et al. (2007). Random Forests for Classification in Ecology. Ecology, 88(11), 2783–2792. https://doi.org/10.1890/07-0539.1

Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Multiple Classifier Systems, 1–15. https://doi.org/10.1007/3-540-45014-9_1

Dong, X. et al. (2020). A survey on ensemble learning. Frontiers of Computer Science, 14(2), 241–258. https://doi.org/10.1007/s11704-019-8208-z

Dorogush, A. V, Ershov, V., & Gulin, A. (2018). CatBoost: Gradient Boosting with Categorical Features Support. ArXiv Preprint ArXiv:1810.11363.

Douzas, G., & Bacao, F. (2018). Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems with Applications, 91, 464–471. https://doi.org/10.1016/j.eswa.2017.09.030

Fleg, J. L., Morrell, C. H., Bos, A. G., Brant, L. J., Talbot, L. A., Wright, J. G., & Lakatta, E. G. (2005). Accelerated longitudinal decline of aerobic capacity in healthy older adults. Circulation, 112(5), 674–682. https://doi.org/10.1161/CIRCULATIONAHA.105.545459

Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., & Greenspan, H. (2018). GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing, 321, 321–331. https://doi.org/10.1016/j.neurocomputing.2018.01.093

GBD 2015 Obesity Collaborators. (2017). Health Effects of Overweight and Obesity in 195 Countries Over 25 Years. New England Journal of Medicine, 377(1), 13–27. https://doi.org/10.1056/NEJMoa1614362

Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. International Conference on Intelligent Computing, 878–887.

Hancock, J. T., & Khoshgoftaar, T. M. (2020). CatBoost for Big Data: An Interdisciplinary Review. Journal of Big Data, 7(1), 94. https://doi.org/10.1186/s40537-020-00369-8

Harion, W. J. T., Friedl, K. E., Buller, M. J., Arango, N. H., & Hoyt, R. W. (2018). Evolution of Physiological Status Monitoring for Ambulatory Military Applications. In Human Performance Optimization: The Science and Ethics of Enhancing Human Capabilities (pp. 142–164). Elsevier. https://doi.org/10.1016/B978-0-12-813734-7.00007-0

He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IJCNN), 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

He, H., & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239

Hruby, A., & Hu, F. B. (2015). The epidemiology of obesity: A big picture. Pharmacoeconomics, 33(7), 673–689. https://doi.org/10.1007/s40273-014-0243-x

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer. https://doi.org/10.1007/978-1-4614-7138-7

Johnson, A. E. et al. (2016). Machine learning and decision support in critical care. Proceedings of the IEEE, 104(2), 444–466. https://doi.org/10.1109/JPROC.2015.2501978

Karna, V. V. R. et al. (2025). A Comprehensive Review on Heart Disease Risk Prediction Using Machine Learning and Deep Learning Algorithms. Archives of Computational Methods in Engineering, 32(3), 1763–1795. https://doi.org/10.1007/s11831-024-10015-6

Ke, G. et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS.

Kodama, S., Saito, K., Tanaka, S., Maki, M., Yachi, Y., Asumi, M., Sugawara, A., Totsuka, K., Shimano, H., Ohashi, Y., Yamada, N., & Sone, H. (2009). Cardiorespiratory fitness as a quantitative predictor of all-cause mortality and cardiovascular events in healthy men and women: A meta-analysis. JAMA, 301(19), 2024–2035. https://doi.org/10.1001/jama.2009.681

Krittanawong, C. et al. (2020). Machine learning prediction in cardiovascular diseases: a meta-analysis. Scientific Reports, 10, 16057. https://doi.org/10.1038/s41598-020-72685-1

Latha, C. B. C., & Jeeva, S. C. (2019). Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Informatics in Medicine Unlocked, 16, 100203. https://doi.org/10.1016/j.imu.2019.100203

Mensah, G. A., Roth, G. A., & Fuster, V. (2019). The global burden of cardiovascular diseases and risk factors: 2020 and beyond. Journal of the American College of Cardiology, 74(20), 2529–2532. https://doi.org/10.1016/j.jacc.2019.10.009

Miotto, R. et al. (2016). Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Scientific Reports, 6, 26094. https://doi.org/10.1038/srep26094

Müller, A. C., & Guido, S. (2016). Introduction to Machine Learning with Python. O’Reilly Media.

Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

Nindl, B. C., Jones, B. H., Van Arsdale, S. J., Kelly, K., & Kraemer, W. J. (2016). Operational physical performance and fitness in military women: Physiological, musculoskeletal injury, and optimized physical training considerations for successfully integrating women into combat-centric military occupations. Military Medicine, 181(suppl_1), 50–62. https://doi.org/10.7205/MILMED-D-15-00363

Nissa, N., Jamwal, S., & Neshat, M. (2024). A technical comparative heart disease prediction framework using boosting ensemble techniques. Computation, 12(1), 15. https://doi.org/10.3390/computation12010015

Prokhorenkova, L. et al. (2018). CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31, 6638–6648.

Qi, Y. (2012). Random Forest for Bioinformatics. In Ensemble Machine Learning (pp. 307–323). Springer. https://doi.org/10.1007/978-1-4419-9326-7_11

Rajkomar, A. et al. (2018). Scalable and accurate deep learning with electronic health records. NPJ Digital Medicine, 1, 18. https://doi.org/10.1038/s41746-018-0029-1

Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery, 8(4), e1249. https://doi.org/10.1002/widm.1249

Shorewala, V. (2021). Early detection of coronary heart disease using ensemble techniques. Informatics in Medicine Unlocked, 26, 100655. https://doi.org/10.1016/j.imu.2021.100655

Weng, S. F. et al. (2017). Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLOS ONE, 12(4), e0174944. https://doi.org/10.1371/journal.pone.0174944

World Health Organization. (2000). Obesity: Preventing and Managing the Global Epidemic (Vol. 894).

Zhou, Z.-H. (2012). Ensemble Methods: Foundations and Algorithms. CRC Press.

Hybrid random forest–catboost ensemble for heart disease prediction on imbalanced datasets: Toward applications in military health systems

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Kanan

visitors