Abstract: The successive and mounting flow of data that is generated by social media platforms everyday raises a question of the validity of such data and to what extent can such data be used by machines. One of the most challenging problems with social media data is the inconsistency of texts which is influenced by many factors. Such a product appears mingled with many of ill-formed data which are problematic for machines and natural language processing tasks. The primary aim of this paper is to identify and discuss different errors generated by the social media text of San’ani Arabic and hence, develop a text normalizer for correcting and standardizing such errors. The identification process is performed manually on a corpus of 158,279 tokens and 20,000 types from texts that are extracted from Facebook and Telegram platforms. As a result, 64,040 tokens and 1,741 types with errors are identified. These errors were classified into two broad categories (i.e., tokenization as well as typographical) and about 15 sub-types based on their frequency as well as typology. Further classification is made based on regularities of these errors. A rule-based as well as dictionary-lookup normalizer is developed using python programming and Django API which shows 99% performance among tokens and 98% among types.
Keywords: Social Media Texts, San’ani Arabic, Normalizer, Error types, Error Identification.
Title: Developing a Normalizer for San’ani Arabic Social Media Texts
Author: Mohammed Sharaf Addin
International Journal of Interdisciplinary Research and Innovations
ISSN 2348-1218 (print), ISSN 2348-1226 (online)
Research Publish Journals