Development of Rule-Based Feature Extraction in Multi-label Text Classification

Gugun Mediamer, - Adiwijaya, Said Al Faraby


Hadith is the second main guidelines after the Holy Quran in the Islamic religion, which was revealed through the Messenger of Allah. Today, Hadith can classified by more than one class such as advice class, prohibited, and information to facilitate readers of Hadith in filtering the appropriate classes for each Hadith of Rasulullah SAW. In the course of research, there are many kinds of data involved in a text classification study. Therefore, special handling that fit with the characteristics of certain data is required. This study investigates the handling of multi-label data—Hadith Bukhari in Indonesian translation—focusing on feature extraction, feature weighted, and preprocessing methods. This study uses a rule-based feature extraction combined with several types of preprocessing along with three types of feature-weighted methods: TF-IDF, Word2vec, and Word2vec weighted with TF-IDF, the five preprocessing stages in this research: Case Folding, Tokenization, Remove Punctuation, Stopword Removal, and Stemming. From the 13 experiments conducted in this study consist of 2000 hadiths, it was found that the best performance for multi-label classification of Hadith data produced by the combination of the proposed rule-based feature extraction, Word2vec feature weighted method, and without using Stemming and Stopword Removal in the preprocessing phase. The Hamming Loss value obtained from this combination was 0.0623. The results show that our rule-based feature extraction method better than baseline method.


multi-label classification; Bukhari Hadith; feature-weighted; tf-idf; word2vec; hamming loss.

Full Text:



M. N. Al-Kabi, H. A. Wahsheh, I. M. Alsmadi, and A. Moh’d Ali Al-Akhras, “Extended Topical Classification of Hadith Arabic Text,” Int. J. Islam. Appl. Comput. Sci. Technol., vol. 3, no. 3, pp. 13–23, 2015.

S. Al Faraby, E. R. R. Jasin, A. Kusumaningrum, and others, “Classification of hadith into positive suggestion, negative suggestion, and information,” in Journal of Physics: Conference Series, 2018, vol. 971, no. 1, p. 12046.

D. Rahmawati and M. L. Khodra, “Automatic multilabel classification for Indonesian news articles,” in Advanced Informatics: Concepts, Theory and Applications (ICAICTA), 2015 2nd International Conference on, 2015, pp. 1–6.

D. Rahmawati and M. L. Khodra, “Word2vec semantic representation in multilabel classification for Indonesian news article,” in Advanced Informatics: Concepts, Theory And Application (ICAICTA), 2016 International Conference On, 2016, pp. 1–6.

R. A. Pane, M. S. Mubarok, N. S. Huda, and others, “A Multi-Lable Classification on Topics of Quranic Verses in English Translation Using Multinomial Naive Bayes,” in 2018 6th International Conference on Information and Communication Technology (ICoICT), 2018, pp. 481–484.

A. M. K. Izzaty, M. S. Mubarok, N. S. Huda, and Adiwijaya, “A Multi-label Classification on Topics of Quranic Verses in English Translation Using Tree Augmented Na�ve Bayes,” in 2018 6th International Conference on Information and Communication Technology (ICoICT), 2018.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv Prepr. arXiv1301.3781, 2013.

J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and word2vec for text classification with semantic features,” in Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on, 2015, pp. 136–140.

A. I. Pratiwi and others, “On the Feature Selection and Classification Based on Information Gain for Document Sentiment Analysis,” Appl. Comput. Intell. Soft Comput., vol. 2018, 2018.

M. S. Mubarok, Adiwijaya, and M. D. Aldhi, “Aspect-based sentiment analysis to review products using Na{"i}ve Bayes,” in AIP Conference Proceedings, 2017, vol. 1867, no. 1, p. 20060.

M. S. Sorower, “A literature survey on algorithms for multi-label learning,” Oregon State Univ. Corvallis, vol. 18, 2010.

Z. Hao and B. Liu, “A rule based feature selection approach for target classification in wireless sensor networks with sensitive data applications,” Int. J. Distrib. Sens. Networks, vol. 10, no. 4, p. 429651, 2014.

M.-L. Zhang, J. M. Peña, and V. Robles, “Feature selection for multi-label naive Bayes classification,” Inf. Sci. (Ny)., vol. 179, no. 19, pp. 3218–3229, 2009.

N. D. Patel and C. Chand, “Selecting Best Features Using Combined Approach in POS Tagging for Sentiment Analysis.” IJCSMC, 2014.

B. M. Badr and S. S. Fatima, “Using skipgrams, bigrams, and part of speech features for sentiment classification of twitter messages,” in Proceedings of the 12th International Conference on Natural Language Processing, 2015, pp. 268–275.

Z. Su, H. Xu, D. Zhang, and Y. Xu, “Chinese sentiment classification using a neural network tool — Word2vec,” in 2014 International Conference on Multisensor Fusion and Information Integration for Intelligent Systems (MFI), 2014, pp. 1–6.

B. Babic, N. Nesic, and Z. Miljkovic, “A review of automated feature recognition with rule-based pattern recognition,” Comput. Ind., vol. 59, pp. 321–337, 2008.

D. Fu, B. Zhou, and J. Hu, “Improving SVM based multi-label classification by using label relationship,” in Neural Networks (IJCNN), 2015 International Joint Conference on, 2015, pp. 1–6.

C. D. Manning, P. Raghavan, and H. Schutze, “Introduction to Information Retrieval,” vol. 39, 2008.

A. Dinakaramani, F. Rashel, A. Luthfi, and R. Manurung, “Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus,” in Asian Language Processing (IALP), 2014 International Conference on, 2014, pp. 66–69



  • There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development