Improving Stemming Algorithm Using Morphological Rules

— Stemming words to remove suffixes has applications in text search, translation machine, summarization document, and text classification. For example, Indonesian stemming reduces the words “kebaikan”, “perbaikan”, “memperbaiki” and “sebaik-baiknya” to their common morphological root “baik”. In text search, this permits a search for a player to find documents containing all words with the stem play. In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make them match to relate difficult words. This research proposed a stemmer with more accurate word results by employing an algorithm which gave more than one word candidate results and more than one affix combinations. New stemming algorithm is called CAT stemming algorithm. Here, the word results did not depend on the order of the morphological rule. All rules were checked, and the word results were kept in a candidate list. To make an efficient stemmer, two kinds of word lists (vocabularies) were used: words that had more than one candidate words and list of root word as a candidate reference. The final word results were selected with several rules. This strategy was proved to have a better result than the two most known about Indonesian stemmers. The experiments showed that the proposed approach gave higher accuracy than the compared systems known.


I. INTRODUCTION
Stemming is a core natural language processing technique for efficient and effective Information Retrieval [1], and one that is widely accepted by users. It is used to transform word variants to their common root of the word by applying it in most cases of morphological rules [2]. For example, in text searching, it should permit a user searching by using the query term stemming to find documents that contain the terms stemmer and stems because all share the common root word stem. It also has applications in translation machine [4], document summarization [9], and text classification [8].
For English, stemming is well-understood, with techniques such as those of Lovin and Porter [10] in widespread use. However, stemming for other languages is less well-known: while there are several approaches available for languages such as French [11], Malaysian [7], and Indonesian [5].
Several techniques have been proposed for stemming Indonesian. We evaluate these techniques through a user study, where we compare the performance of the scheme to the results of manual stemming by four native speakers. Our results show that an existing technique, proposed by Nazief and Adriani (1996) in an unpublished technical report, correctly stems around 93% of all word occurrences (or 92% of unique words). After classifying the failure cases, and adding our own rules to address these limitations, we show this can be improved to 95% for both unique and all word occurrences. We believe that adding a complete dictionary of root words would improve these results even further. We conclude that our modified Nazief and Adriani stemmer should be used in practice for stemming Indonesian.
All of the previous works used a way in solving morphological problems where there was only one rule path that could be executed for a single input. The formula was used to check particle, suffix, and prefix in rule order [6]. This was not suitable for an ambiguous morphological word such as "mereka", "menggulai" or "kemeja", where there should be more than one correct result, depending on the context. This paper proposes an algorithm where it can have more than one candidate for those ambiguous words.
The rule order of prefix and suffix gives another error of stemmed word [12]. The error is obtained when the input words end in lexical similar with suffixes/ particle/ possessive pronoun. For example, the word "penemu" is stemmed into "pene"+"mu" because the possessive pronoun rule is first executed before the prefix rule. Although Adriani [5] has tried to fix the problem by adding several exception rules to check the prefix first before the suffix, this approach still can't handle the exception conditions if the conditions are not listed in the exception rules. In Adriani [5], the exception rules are "ber"+word+"lah" (such as word "bersekolah"), "ber" + word +"an" (such as word "berlainan"), "di"-"i", "ter"-"i", "me"-"i" and "pe"-"i". It still can't handle other patterns such as words "penemu", "penanya", etc. The algorithm proposed in this paper will give all candidates and then select the best candidate.
Another weakness of Adriani [5], there is no dictionary of stemmed word list involved which causes inaccurate result such as "perbaikan" is stemmed into "bai" root word with "per" as the prefix and "kan" as the suffix. This is due to the rule position of "kan" is prior to the rule position of "an". One can argue that the solution is to change the rule order where rule of "an" affix is positioned prior than the "kan" affix rule, but then this solution will result in similar problems for words with "kan" affix such as "menarikan" which then will be morphologically analysed into "me", "tarik" and "an" [12].
Meanwhile, in Adriani [5], a complete root of word dictionary is used to handle this problem. For the word "perbaikan", because the root of word "bai" is not in the dictionary then the system will not choose "per" -"bai" -"kan". Instead, the stemmer will choose "per" -"baik" -"an", due to the root of word "baik" is in the dictionary.
Although Adriani [5] stemmer gives a better result, it consumes much time because, for each rule, the resulted word is searched into the root of the word in the dictionary. As an alternative solution, this paper proposes the usage of a restricted list of the root of a word called as vocabulary which size is less than the complete dictionary used by Adriani [5] stemmer. By using this, the proposed stemmer will give good accuracy score such as Adriani [5] stemmer with less complexity.

A. Nazief Adriani Algorithm
The stemming scheme of Nazief and Adriani is described in an unpublished technical report from the University of Indonesia (1996). In this section, we describe the steps of the algorithm, and illustrate each with examples; however, for compactness, we omit the detail of selected rule tables. We refer to this approach like Nazief.
The algorithm is based on comprehensive morphological rules that group together and encapsulate allowed and disallowed affixes, including prefixes, suffixes, infixes (insertions) and confixes (combination of prefixes and suffixes). The algorithm also supports recoding, an approach to restore an initial letter that was removed from the root of the word prior to prepending prefix. In addition, the algorithm makes use of an auxiliary dictionary of the root of words that are used in most steps to check if the stemming has arrived at a root of the word.
Before considering how the scheme works, we consider the basic groupings of affixes used as a basis for the approach, and how these definitions are combined to form a framework to implement the rules. The scheme groups affixes into the following categories: 1) Inflectional Suffixes: the set of suffixes that do not alter the root of the word. For example, "pulang" (sit) may be suffixed with "-lah" to give "pulanglah" (please sit). The inflections are further divided into: • Particles (P): including "-lah" and "-kah", as used in words such as "duduklah" (please sit). • Possessive pronouns (PP): including "-ku", "-mu", and "-nya", as used in "ibunya" (a third person possessive form of "mother"). Particle and possessive pronoun inflections can appear together and, if they do, possessive pronouns appear before particles. A word can have at most one particle and one possessive pronoun, and these may be applied directly to the root of words or to words that have a derivation suffix. For example, "makan" (to eat) may be appended with derivation suffix "-an" to give "makanan" (food). This can be suffixed with "-nya" to give "makanannya" (a possessive form of "food") 2) Derivational Suffixes: the set of suffixes that are directly applied to the root of words. There can be only one derivation suffix per word. For example, the word "lapor" (to report) can be suffixed by the derivation suffix "-kan" to become "laporkan" (go to report). In turn, this can be suffixed with, for example, an inflectional suffix "-lah" to become "laporkanlah" (please go to report).

3) Derivational Prefixes
: the set of prefixes that are applied either directly to the root of words, or to words that have up to two other derivational prefixes. For example, the derivational prefixes "mem-" and "per-"may be prepended to "indahkannya" to give "memperindahkannya" (the act of beautifying).
The classification of affixes as inflections and derivations leads to an order of use:
The square brackets indicate that an affix is optional. The previous definition forms the basis of the rules used in the approach. However, there are exceptions and limitations that are incorporated in the rules: 1) Not All Combinations are Possible: after a word is prefixed with "di-", the word is not allowed to be suffixed with "-an". A complete list is shown in Table 1.
2) The Same Affix Cannot be Repeatedly Applied: after a word is prefixed with "te-" or one of its variations, it is not possible to repeat the prefix "te-" or any of those variations.
3) If a Word Has One or Two Characters: then stemming is not attempted.

4) Adding a Prefix May Change the Root of Word or a Previously Applied Prefix:
we discuss this further in our description of the rules. To illustrate, consider "meng-" that has the variations "mem-", "meng-","meny-", and "men-". Some of these may change the prefix of a word, for example, for the root of word "sapu" (broom), the variation applied is "meny-" to produce the word "menyapu" (to sweep) in which the "s" is removed.
The latter complication requires that an effective Indonesian stemming algorithm is able to add deleted letters through the recoding process. The algorithm itself employs three components: the affix groupings, the order of using rules (and their exceptions), and dictionary. The dictionary is checked after any stemming rule succeeds: if the resultant word is found in the dictionary, then stemming has succeeded in finding a root of the word, the algorithm returns the dictionary word, and then stops; we omit this lookup from each step in our listing rule. In addition, each step checks if the resultant word is less than two characters in length and, if so, no further stemming is attempted.
For each word to be stemmed, the following steps are followed: 1) The unstemmed word is searched in the dictionary. If it is found in the dictionary, it is assumed that the word is a root of the word, and so the word is returned, and the algorithm stops.
3) Derivational suffix ("-i" or "-an") removal is attempted. If this succeeds, Step 4 is attempted. If Step 4 does not succeed: • If "-an" was removed, and the final letter of the word is "-k", then the "-k" is also removed and Step 4 is reattempted. If that fails, Step 3b is performed. Table 2 is determining the prefix type for words prefixed with "te-". If the prefix "te-" does not match one of the rules in the table, then "none" is returned. Similar rules are used for "be-", "me-", and "pe-". • The removed suffix ("-i", "-an", or "-kan") is restored.

4) Derivational Prefix Removal is Attempted:
This has several sub-steps: • If a suffix is removed in Step 3, then disallowed prefix suffix combinations are checked using the list in Table  1. If a match is found, then the algorithm returns. • If the current prefix matches any previous prefix, then the algorithm returns. • If three prefixes have previously been removed, the algorithm returns. • The prefix type is determined by one of the following steps: ο If the prefix of the word is "di-", "ke-", or "se-", then the prefix type is "di", "ke", or "se" respectively. ο If the prefix is "te-", "be-", "me-", or "pe-", then an additional process of extracting character sets to determine the prefix type is required. As an example, the rules for the prefix "te-" are shown in Table 2.
Supposed the word be-ing stemmed is "terlambat" (late). After removing "te-" to give "-rlambat", the first set of characters is extracted from the prefix according to the "Set 1" rules. In this case, the letter following the prefix "te-" is "r", and this matches the first five rows of the table. Following "-r-" is "l-" (Set 2), and so is the third to fifth rows match. Following "-l-" is "-ambat", eliminating the third and fourth rows for Set 3 and determining that the prefix type is "ter-" as shown in the rightmost column. ο If the first two characters do not match "di-", "ke-", "se-", "te-", "be-", "me-", or "pe-" then the algorithm returns. • If the prefix type is "none", then the algorithm returns.
If the prefix type is not "any", then the prefix type is found in Table 3, the prefix to be removed is found, and the prefix is removed from the word; for compactness, Table 3 shows only the simple cases and those matching with Table 2. • If the root of the word has not been found, Step 4 is recursively attempted for further prefix removal. If a root of the word is found, the algorithm returns.
Only simple entries and those for the te-prefix type are shown in Tables 2 and 3. In this case, after removing "ter-", an "r-" is prepended to the word. If this new word is not in the dictionary, Step 4 is repeated for the new word. If a root of the word is not found, then "r-" is removed and "ter-" restored, the prefix is set to "none", and the algorithm returns.

B. Improving Nazief Algorithm
In this section, let see Table 4 about affix removal base on the rules of morphology Indonesian.  We discuss the reasons why the nazief scheme works well, and what aspects that can be improved. We present a detailed analysis of the failure cases, and propose solutions to these problems. We then present the results, including the improvements, and describe our modified nazief approach.
The performance of nazief approach is perhaps unsurprising: it is by far the most complex approach, being based closely on the detailed morphological rules of the Indonesian language. In addition, it supports dictionary lookup and progressive stemming, allowing it to evaluate each step to test if a root of the word has been found and to recover from errors by restoring affixes to attempt different combinations. However, despite these features, the algorithm can still be improved.
In summary, three opportunities exist to improve stemming with nazief. First, a more complete and accurate root of word dictionary may reduce errors. Second, features can be added to support stemming of hyphenated words. Last, new rules and adjustments to rule precedence may reduce over and under stemming, as well as support affixes not currently catered for in the algorithm. We will discuss the improvements we propose in the next section.
To address the limitations of nazief scheme, we propose the following improvements: 1) Using a More Complete Dictionary: we have experimented with two other dictionaries, and present our results later. 2) Adding Rules to Deal With Plurals: when plurals, such as "bola-bola" (balls) are encountered, we propose stemming these to "bola" (ball). However, care must be taken with other hyphenated words such as "bolak-balik" (to and for), "berbalas-balasan" (mutual action or interaction) and "seolah-olah" (as though). For these later examples, we propose stemming the words preceding and follow the hyphen separately and then, if the words have the same root of the word, to return the singular form. For example, in the case of "berbalas-balasan", both "berbalas" and "balasan" stem to "balas" (response or answer), and this is returned. In contrast, the words "bolak" and "balik" do not have the same stem, and so "bolak-balik" is returned as the stem; in this case, this is the correct action, and this works for many hyphenated non-plurals.

1) Adding Prefixes and Suffixes, and Additional Rules:
• Adding the particle (inflection suffix) "-pun". This is used in words such as "siapapun" (where the root of the word is "siapa" (who). • For the prefix type "ter", we have modified the conditions so that row 4 in Table 2 sets the type to "ter" instead of "none". This supports cases such as "terpercaya" (the most trusted), which has the root of word "percaya" (believe). • For the prefix type "pe", we have modified the conditions (similar to those listed in Table 2 so that words such as "pekerja" (worker) and "peserta" (member) have prefix type "pe", instead of the erroneous "none". • For the prefix type "mem", we have modified the conditions so that words beginning with the prefix "memp-" are of type "mem".
• For the prefix type "meng", we have modified the conditions so that the words beginning with the prefix "mengk-" are of type "meng".

2) Adjusting Rule Precedence:
• If a word is prefixed with "ber-" and suffixed with the inflection suffix "-lah", try to remove prefix before the suffix. This addresses problems with words such as "bermasalah" having a problem where the root of the word is "masalah" (problem) and "bersekolah" (be at school) where the root of the word is "sekolah" (school). • If a word is prefixed with "ber-" and suffixed with the derivational suffix "-an", try to remove prefix before the suffix. This solves problems with, for example, "berbadan" (having the body of) the root of the word is "badan" (body). • If a word is prefixed with "men-" and suffixed with the derivational suffix "-i", try to remove prefix before the suffix. This solves problems with, for example, "menilai" (to mark) the root of the word is "nilai" (mark). If a word is prefixed with "di-" and suffixed with the derivational suffix "-i", try to remove prefix before the suffix. This solves problems with, for example, "dimulai" (to be started) the root of the word is "mulai" (start). • If a word is prefixed with "pe-" and suffixed with the derivational suffix "-i", try to remove prefix before the suffix. This solves problems with, for example, "petani" (farmer) the root of the word is "tani" (farm). • If a word is prefixed with "ter-" and suffixed with the derivational suffix "-i", try to remove prefix before the suffix. This solves problems with, for example, "terkendali" (can be controlled) the root of the word is "kendali" (control).

III. RESULTS AND DISCUSSION
In this section, we compare the result of algorithm before and after stemming with Nazief Andriani's and CAT. CAT approach provides the easy way of stemming Indonesian language through flexibility affix classification. Therefore, the affix additional can be applied in easy way. We experiment used to test are the students' journals of Department of Information Technology of Faculty of Communication and Information Technology of Semarang University. There are 10 articles used for testing.
Based on the test results (see Table 6), it is clear that there is increasing on commonality document measurement results, the amount of which depends on how similar and not similar documents were tested with one of the documents in the database. Based on data tested above, the average of the last increase is 14% if it is done with CAT algorithm stemming.  Stemming is an important information retrieval technique. In this paper, we have investigated Indonesian stemming and presented an experimental evaluation of Indonesian stemmers. The results show that a successful stemmer is complex, and requires the careful combination of several features: support for complex morphological rules, progressive stemming of words, dictionary check after each step, trial-and-error combinations of affixes, and recoding support after prefix removal. Our results show that the new stemmer is the most effective scheme. It will increase about 14,741 % if we use the new stemmer.
We intend to continue this work. We will improve the dictionaries by curating them to remove non-root and add root words. We also plan to extend the nazief stemmer further to deal with cases where the root of the word is ambiguous.