Translated vs Non-Translated Method for Multilingual Hate Speech Identification in Twitter

Muhammad Okky Ibrohim; Indra Budi

doi:10.18517/ijaseit.9.4.8123

Translated vs Non-Translated Method for Multilingual Hate Speech Identification in Twitter

Muhammad Okky Ibrohim, Indra Budi

Abstract

Nowadays social media is often misused to spread hate speech. Spreading hate speech is an act that needs to be handled in a special way because it can undermine or discriminate other people and cause conflict that leading to both material and immaterial losses. There are several challenges in building a hate speech identification system; one of them is identifying hate speech in multilingual scope. In this paper, we adapt and compare two methods in multilingual text classification which are translated (with and without language identification) and non-translated method for multilingual hate speech identification (including Hindi, English, and Indonesian language) using machine learning approach. We use some classification algorithms (classifiers) namely Support Vector Machine (SVM), Naive Bayes (NB), and Random Forest Decision Tree (RFDT) with word n-grams and char n-grams (character n-grams) as feature extraction. Our experiment result shows that the non-translated method gives the best result. However, the use of non-translated method needs to be reconsidered because this method needs more cost for data collection and annotation. Meanwhile, translated without language identification method give a poor result. To address this problem, we combine translated method with monolingual hate speech identification, and the experiment result shows that this approach can increase the multilingual hate speech identification performance compared to translate without language identification. This paper discusses the advantages and disadvantages for all method and the future works to enhance the performance in multilingual hate speech identification.

Keywords

social media; multilingual hate speech identification; machine learning.

Full Text:

PDF

References

Komnas HAM, Buku Saku Penanganan Ujaran Kebencian (Hate Speech). Komisi Nasional Hak Asasi Manusia, Jakarta, 2015.

G. H. Stanton, â€œThe Rwandan genocide: Why early warning failed,â€ Journal of African Conflicts and Peace Studies, vol. 1(2), pp. 6â€“25, 2009.

Z. Waseem and D. Hovy, â€œHateful symbols or hateful people? Predictive features for hate speech detection on twitter,â€ in Proceedings of the NAACL Student Research Workshop. San Diego, California: Association for Computational Linguistics, June 2016, pp. 88â€“93.

I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekananta, â€œHate speech detection in the Indonesian language: A dataset and preliminary study,â€ in International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2017, pp. 233â€“238.

S. B. Shende and L. Deshpande, â€œA computational framework for detecting offensive language with support vector machine in social communities,â€ in 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT), July 2017, pp. 1â€“4.

F. D. Vigna, A. Cimino, F. Dellâ€™Orletta, M. Petrocchi, and M. Esconi, â€œHate me, hate me not: Hate speech detection on facebook,â€ in Proceedings of the First Italian Conference on Cybersecurity (ITASEC17), 2017, pp. 86â€“95.

S. Tulkens, L. Hilte, E. Lodewyckx, B. Verhoeven, and W. Daelemans, â€œA dictionary-based approach to racism detection in dutch social media,â€ in First Workshop on Text Analytics for Cybersecurity and Online Safety (TACOS), 2016, pp. 11â€“17.

S. A. Ozel, E. Sarac, S. Akdemir, and H. Aksu, â€œDetection of cyberbullying on social media messages in Turkish,â€ in 2017 International Conference on Computer Science and Engineering, Oct 2017, pp. 366â€“370.

T. Davidson, D. Warmsley, M. W. Macy, and I. Weber, â€œAutomated hate speech detection and the problem of offensive language,â€ in International AAAI Conference on Web and Social Media (ICWSM), 2017, pp. 512â€“515.

S. Agarwal and A. Sureka, â€œBut I did not mean it!â€“ Intent classification of racist posts on Tumblr,â€ in 2016 European Intelligence and Security Informatics Conference (EISIC), Aug 2016, pp. 124â€“127.

M. Sabou, K. Bontcheva, L. Derczynski, and A. Scharl, â€œCorpus annotation through crowdsourcing: Towards best practice guidelines,â€ in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014). European Language Resources Association (ELRA), 2014.

C. Goncalves, C. Goncalves, R. Camacho, and E. Oliveira, â€œThe impact of pre-processing on the classification of MEDLINE documents,â€ in Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems, 2010, pp. 53â€“61.

T. Baldwin and Y. Li, â€œAn in-depth analysis of the effect of text normalization in social media.â€ in Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL (HLT-NAACL). The Association for Computational Linguistics, 2015, pp. 420â€“429.

P. C. Gaigole, L. H. Patil, and P. M. Chaudhari, â€œPreprocessing techniques in text categorization,â€ IJCA Proceedings on National Conference on Innovative Paradigms in Engineering & Technology 2013, vol. 3, no. 3, pp. 1â€“3, December 2013.

I. Kanaris, K. Kanaris, I. Houvardas, and E. Stamatatos, â€œWords vs. character n-grams for anti-spam filtering,â€ International Journal on Artificial Intelligence Tools, vol. 20, no. 10, pp. 1â€“20, 2006.

R. Kohavi, â€œA study of cross-validation and bootstrap for accuracy estimation and model selection,â€ in Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, ser. IJCAIâ€™95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995, pp. 1137â€“1143.

V. Ganganwar, â€œAn overview of classification algorithms for imbalanced datasets,â€ International Journal of Emerging Technology and Advanced Engineering, vol. 2(4), pp. 42â€“47, 2012.

M. Suzuki, N. Yamagishi, Y. Tsai, and S. Hirasawa, â€œMultilingual text categorization using character n-gram,â€ in 2008 IEEE Conference on Soft Computing in Industrial Applications, June 2008, pp. 49â€“54.

B. Plank, â€œALL-IN-1: short text classification with one model for all languages,â€ CoRR, 2017.

L. Shi, R. Mihalcea, and M. Tian, â€œCross-language text classification by model translation and semi-supervised learning,â€ in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 1057â€“1067.

I. Alfina, S. H. Pratiwi, I. Budi, R. Mulia, and Y. Ekanata, â€œDetecting hate speech against religion in the Indonesian language,â€ Journal of Telecommunication, Electronic and Computer Engineering (JTEC), 2018.

M. O. Ibrohim and I. Budi, â€œA dataset and preliminaries study for abusive language detection in indonesian social media,â€ Procedia Computer Science, vol. 135, pp. 222 â€“ 229, 2018.

A. Bohra, D. Vijay, V. Singh, S. S. Akhtar, and M. Shrivastava, â€œA dataset of Hindi-English code-mixed social media text for hate speech detection,â€ in Proceedings of the Second Workshop on Computational Modeling of Peopleâ€™s Opinions, Personality, and Emotions in Social Media. Association for Computational Linguistics, 2018, pp. 36â€“41.

M. Hossin and M. N. Sulaiman, â€œA review on evaluation metrics for data classification evaluations,â€ International Journal of Data Mining & Knowledge Management Process, vol. 5, pp. 1â€“11, 03, 2015.

P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, â€œDeep learning for hate speech detection in tweets,â€ in International World Wide Web Conference Committee, 2017, p. 759760.

E. Sazany and I. Budi, â€œDeep Learning-Based implementation of hate speech identification on texts in Indonesian: Preliminary study,â€ in 2018 International Conference on Applied Information Technology and Innovation (ICAITI 2018), Padang, Indonesia, Sep. 2018.

M. O. Ibrohim, E. Sazany, and I. Budi, â€œIdentify abusive and offensive language in indonesian twitter using deep learning approach,â€ Journal of Physics: Conference Series, 2018.

DOI: http://dx.doi.org/10.18517/ijaseit.9.4.8123

Refbacks

There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development

International Journal on Advanced Science, Engineering and Information Technology

Translated vs Non-Translated Method for Multilingual Hate Speech Identification in Twitter

Abstract

Keywords

Full Text:

References

Refbacks