Machine Learning Approach for Bottom 40 Percent Households (B40) Poverty Classification

Nor Samsiah Sani, Mariah Abdul Rahman, Azuraliza Abu Bakar, Shahnurbanon Sahran, Hafiz Mohd Sarim

Abstract


Malaysia citizens are categorised into three different income groups which are the Top 20 Percent (T20), Middle 40 Percent (M40), and Bottom 40 Percent (B40). One of the focus areas in the Eleventh Malaysia Plan (11MP) is to elevate the B40 household group towards the middle-income society. Based on recent studies by the World Bank, Malaysia is expected to enter the high-income economy status no later than the year 2024. Thus, it is essential to clarify the B40 population through a predictive classification as a prerequisite towards developing a comprehensive action plan by the government. This paper is aimed at identifying the best machine learning models using Naive Bayes, Decision Tree and k-Nearest Neighbors algorithm for classifying the B40 population. Several data pre-processing task such as data cleaning, feature engineering, normalisation, feature selection: Correlation Attribute, Information Gain Attribute and Symmetrical Uncertainty Attribute and sampling methods using SMOTE has been conducted to the raw dataset to ensure the quality of the training data. Each classifier is then optimized using different tuning parameter with 10-Fold Cross Validation for achieving the optimal values before the performance of the three classifiers are compared to each other. For the experiments, a dataset from National Poverty Data Bank called eKasih obtained from the Society Wellbeing Department, Implementation Coordination Unit of Prime Minister's Department (ICU JPM), consisting of 99,546 households from 3 different states: Johor, Terengganu and Pahang are used to train each of the machine learning model. The experimental results using 10-Fold Cross-Validation method demonstrates that the overall performance of Decision Tree model outperformed the other models and the significance test specified the result is statistically significance.


Keywords


Bottom 40, B40 Classification, Poverty, Decision Tree, k-Nearest Neighbors, Naïve Bayes, Parameter Tuning

Full Text:

PDF

References


EPU, E. P. U. (2013). Tenth Malaysia Plan. Journal of Chemical Information and Modeling, 53(9), 1689–1699. https://doi.org/10.1017/CBO9781107415324.004

Selvaratnam, D. P., Tin, P. B., Bakar, N. A., Idris, N. A. H., & Berma, M. (2017). Social capital accumulation in Malaysia. e-Bangi, 3(1).

Roshaniza, N. A. B. M., & Selvaratnam, D. P. (2015). Gross Domestic Product (GDP) Relationship with Human Development Index (HDI) and Poverty Rate in Malaysia. Prosiding Perkem, 10, 211-217.

Ali, A. F. M., Rashid, Z. A., Johari, F., & Aziz, M. R. A. (2015). The effectiveness of Zakat in reducing poverty incident: An analysis in Kelantan, Malaysia. Asian Social Science, 11(21), 355.

Economic Planning Unit. (2015). Eleventh Malaysia Plan : Anchoring Growth on People. Rancangan Malaysia Kesebelas (Eleventh Malaysia Plan) : 2016-2020.

DOSM. (2017). Department of Statistics Malaysia Press Release Report of Household Income and Basic Amenities Survey 2016, (October). Retrieved from https://www.dosm.gov.my/v1/index.php?r=column/pdfPrev&id=RUZ5REwveU1ra1hGL21JWVlPRmU2Zz09

Holliday, J. D., Sani, N., & Willett, P. (2015). Calculation of substructural analysis weights using a genetic algorithm. Journal of Chemical Information and Modeling, 55(2), 214-221.

Sani, N.S. (2017). The Use of Data Fusion on Multiple Substructural Analysis Based GA Runs. J. Appl. Environ. Biol. Sci., 7(2S)30-36, 2017

Rahman, A. H. A., Ariffinv, K. A. Z., Sani, N. S., & Zamzuri, H. (2017). Pedestrian Detection using Triple Laser Range Finders. International Journal of Electrical and Computer Engineering (IJECE), 7(6), 3037-3045.

Pareek, P., & Prema, K. V. (2012). Classifying the population as BPL or non-BPL using Multilayer Neural Network. International Journal of Scientific and Research Publications, 2(1), 2250–3153.

Zakaria, N. H., Hassan, R., Othman, M. R., Zakaria, Z., & Kasim, S. (2017). A Review on Classification of the Urban Poverty Using the Artificial Intelligence Method. Journal of Asian Scientific Research, 7(11), 450.

Thoplan, R. (2014). Random forests for poverty classification. International Journal of Sciences: Basic and Applied Research (IJSBAR), North America, 17.

Redjeki, S., Guntara, M., & Anggoro, P. (2015). Naive Bayes Classifier Algorithm Approach for Mapping Poor Families Potential. International Journal of Advanced Research in Artificial Intelligence, 4(12), 29–33.

Nataša, P. (2016). Poverty analysis using machine learning methods (Bachelor Thesis). Comenius University, Bratislava, Slovakia.

Terano, R., Mohamed, Z., & Jusri, J. H. H. (2015). Effectiveness of microcredit program and determinants of income among small business entrepreneurs in Malaysia. Journal of Global Entrepreneurship Research, 5(1), 22.

Siwar, C., Idrus, S., Idris, N. D. M., & Zahari, S. Z. Poverty Mapping and Characterizing the Poor Using Geographical Information System: Case Study in Terengganu, Malaysia. [10] Webb, G. I. (2010). Data Preparation. Encyclopedia of Machine Learning, 259–260. https://doi.org/10.1007/978-0-387-30164-8_194

Nawi, N. M., Hussein, A. S., Samsudin, N. A., Hamid, N. A., Yunus, M. A. M., & Ab Aziz, M. F. (2017). The Effect of Pre-Processing Techniques and Optimal Parameters selection on Back Propagation Neural Networks. International Journal on Advanced Science, Engineering and Information Technology, 7(3), 770-777.

Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

SamsiahSani, N., Shlash, I., Hassan, M., Hadi, A., & Aliff, M. (2017). Enhancing Malaysia Rainfall Prediction Using Classification Techniques. J. Appl. Environ. Biol. Sci, 7(2S), 20-29.

Zainudin, S., Jasim, D. S., & Abu Bakar, A. (2016). Comparative analysis of data mining techniques for Malaysian rainfall prediction. International Journal on Advanced Science, Engineering and Information Technology, 6(6), 1148-1153.

Patro, S., & Sahu, K. K. (2015). Normalization: A preprocessing stage. arXiv preprint arXiv:1503.06462.

Kurniawan, R., Nazri, M. Z. A., Irsyad, M., Yendra, R., & Aklima, A. (2015, August). On machine learning technique selection for classification. In Electrical Engineering and Informatics (ICEEI), 2015 International Conference on (pp. 540-545). IEEE.

Shreem, S. S., Abdullah, S., & Nazri, M. Z. A. (2016). Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm. International Journal of Systems Science, 47(6), 1312-1329.

Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.

Alhutaish, R., & Omar, N. (2017). Feature Selection for Multi-label Document Based on Wrapper Approach through Class Association Rules. International Journal on Advanced Science, Engineering and Information Technology, 7(2), 642-649.

Ali, A., Shamsuddin, S. M., & Ralescu, A. L. (2015). Classification with class imbalance problem: A review. International Journal of Advances in Soft Computing and Its Applications, 7(3), 176–204.

Holliday, J., Sani, N., & Willett, P. (2018). Ligand-based virtual screening using a genetic algorithm with data fusion. Match: Communications in Mathematical and in Computer Chemistry, 80, 623-638.

Fernández, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research, 61, 863-905.

Berend, D., & Kontorovich, A. (2015). A finite sample analysis of the Naive Bayes classifier. Journal of Machine Learning Research, 16, 1519-1545.

Sewaiwar, P., & Verma, K. K. (2015). Comparative study of various decision tree classification algorithm using WEKA. International Journal of Emerging Research in Management &Technology, 4, 2278-9359.

Wager, S., & Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association.

Song, Y. Y., & Ying, L. U. (2015). Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 27(2), 130.

Thanh Noi, P., & Kappas, M. (2017). Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery. Sensors, 18(1), 18.

Hu, L. Y., Huang, M. W., Ke, S. W., & Tsai, C. F. (2016). The distance function effect on k-nearest neighbor classification for medical datasets. SpringerPlus, 5(1), 1304.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929-1958.

Cao, H., Sen, P. K., Peery, A. F., & Dellon, E. S. (2016). Assessing agreement with multiple raters on correlated kappa statistics. Biometrical Journal, 58(4), 935-943.

Refaeilzadeh, P., Tang, L., & Liu, H. (2016). Cross-validation. Encyclopedia of database systems, 1-7. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978658/




DOI: http://dx.doi.org/10.18517/ijaseit.8.4-2.6829

Refbacks

  • There are currently no refbacks.



Published by INSIGHT - Indonesian Society for Knowledge and Human Development