Categorization of Malay Social Media Text and Normalization of Spelling Variations and Vowel-less Words

Ruhaila Maskat, Nurazzah Abdul Rahman


As more data are being introduced, it brings along with it missing values, inconsistencies, and heterogeneities, or so-called unclean aspects. Text analytics relies on clean data to produce reliable results. Pre-processing is an essential phase in text analytics, specifically language detection and normalization. The problem with conducting text analytics on Malay social media text is how substantially it has transformed from formal Malay in terms of spelling and construction, making it difficult to process them. Recent advances have shown works to normalize yet cherry-picked specific types of Malay social media text where their descriptions were listed in simple and narrow categorizations. A formal categorization is necessary to provide significant description of the different patterns of Malay social media text, allowing the selection of suitable methods in handling them. In this paper, we propose an inexhaustive formal categorization for Malay social media text based on inherent nature. We refer to them as Social Media Malay Language (SMML) to differentiate them from the standard Malay language. They are spelling variations, Malay-English mix sentences, loan words/phrases, slang-based words, and vowel-less words. Also, in this work, we conducted a normalization on two of the SMML categories, spelling variations, and vowel-less words, using two similarity matching techniques (i.e., nGram Tversky Index and Levenshtein). Our result shows that similarity-matching techniques can detect both categories, but a more sophisticated technique is necessary to improve the precision score. The normalization of the rest of the categories is extensive research works.


text analytics; social media; data pre-processing; normalization; malay language.

Full Text:



E. Haddi, X. Liu, and Y. Shi, “The role of text pre-processing in sentiment analysis,†Procedia Comput. Sci., vol. 17, pp. 26–32, 2013.

L. Derczynski, D. Maynard, N. Aswani, and K. Bontcheva, “Microblog-genre noise and impact on semantic annotation accuracy,†in Proceedings of the 24th ACM Conference on Hypertext and Social Media, 2013, pp. 21–30.

“Malay Language,†Encyclopedia Britannica. [Online]. Available:

Statista, “Number of Facebook users in Malaysia from 2017 to 2023.†[Online]. Available: .

N. Elgendy and A. Elragal, “Big data analytics: a literature review paper,†in Industrial Conference on Data Mining, 2014, pp. 214–227.

R. Kitchin, “The real-time city? Big data and smart urbanism,†GeoJournal, vol. 79, no. 1, pp. 1–14, 2014.

X. Hu and H. Liu, “Text analytics in social media,†in Mining text data, Springer, 2012, pp. 385–414.

N. N. Yusof, A. Mohamed, and S. Abdul-Rahman, “Reviewing classification approaches in sentiment analysis,†in International conference on soft computing in data science, 2015, pp. 43–53.

S. Abdul-Rahman, A. A. Bakar, and Z.-A. Mohamed-Hussein, “An intelligent data pre-processing of complex datasets,†Intell. Data Anal., vol. 16, no. 2, pp. 305–325, 2012.

S. B. Rodzman, M. F. I. A. Ronie, N. K. Ismail, N. A. Rahman, F. Ahmad, and Z. M. Nor, “Analyzing Malay Stemmer Performance Towards Fuzzy Logic Ranking Function on Malay Text Corpus,†in 2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP), 2018, pp. 1–6.

I. Balazevic, M. Braun, and K.-R. Müller, “Language Detection For Short Text Messages In Social Media,†arXiv Prepr. arXiv1608.08515, 2016.

M. Lui and T. Baldwin, “Accurate language identification of twitter messages,†in Proceedings of the 5th workshop on language analysis for social media (LASM), 2014, pp. 17–25.

“Loanword,†Lexico. [Online]. Available:

S. B. Basri, R. Alfred, and C. K. On, “Automatic spell checker for Malay blog,†in 2012 IEEE International Conference on Control System, Computing and Engineering, 2012, pp. 506–510.

N. Samsudin, M. Puteh, A. R. Hamdan, and M. Z. A. Nazri, “Normalization of noisy texts in Malaysian online reviews,†J. ICT, vol. 12, pp. 147–159, 2013.

M. A. Saloot, N. Idris, and A. Aw, “Noisy text normalization using an enhanced language model,†in Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, 2014, pp. 111–122.

N. A. B. Muhamad, N. Idris, and M. A. Saloot, “Proposal: A Hybrid Dictionary Modelling Approach for Malay Tweet Normalization,†in Journal of Physics: Conference Series, 2017, vol. 806, no. 1, p. 12008.

M. A. Saloot, N. Idris, and R. Mahmud, “An architecture for Malay Tweet normalization,†Inf. Process. Manag., vol. 50, no. 5, pp. 621–633, 2014.

“Panduan singkatan khidmat pesanan ringkas,†Dewan Bahasa dan Pustaka. [Online]. Available:

R.-M. Bali and N. P. Kuan, “Language Identifier for Bahasa Malaysia and Bahasa Indonesia.â€

J. Williams and C. Dagli, “Twitter language identification of similar languages and dialects without ground truth,†in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 2017, pp. 73–83.

M. Puteh, N. Isa, S. Puteh, and N. A. Redzuan, “Sentiment mining of Malay newspaper (SAMNews) using artificial immune system,†in Proceedings of the World Congress on Engineering, 2013, vol. 3, pp. 1498–1503.

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,†IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1–16, 2006.

A. Tversky, “Features of similarity.,†Psychol. Rev., vol. 84, no. 4, p. 327, 1977.

L. Yujian and L. Bo, “A normalized Levenshtein distance metric,†IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 1091–1095, 2007.



  • There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development