Automatic Rule Generator via FP-Growth for Eye Diseases Diagnosis

— The conventional approach in developing a rule-based expert system usually applies a tedious, lengthy and costly knowledge acquisition process. The acquisition process is known as the bottleneck in developing an expert system. Furthermore, manual knowledge acquisition can eventually lead to erroneous in decision-making and function ineffective when designing any expert system. Another dilemma among knowledge engineers are handing conflict of interest or high variance of inter and intrapersonal decisions among domain experts during knowledge elicitation stage. The aim of this research is to improve the acquisition of knowledge level using a data mining technique. This paper investigates the effectiveness of an association rule mining technique in generating new rules for an expert system. In this paper, FP-Growth is the machine learning technique that was used in acquiring rules from the eye disease diagnosis records collected from Sumatera Eye Center (SMEC) Hospital in Pekanbaru, Riau, Indonesia. The developed systems are tested with 17 cases. The ophthalmologists inspected the results from automatic rule generator for eye diseases diagnosis. We found that the introduction of FP-Growth association rules into the eye disease knowledge-based systems, able to produce acceptable and promising eye diagnosing results approximately 88% of average accuracy rate. Based on the test results, we can conclude that Conjunctivitis and Presbyopia disease are the most dominant suffering in Indonesia. In conclusion, FP-growth association rules are very potential and capable of becoming an adequate automatic rules generator, but still has plenty of room for improvement in the context of eye disease diagnosing.


I. INTRODUCTION
The development of the rule-based expert system has been continuously developed as reported [1]- [3]. The strength of a rule-based expert system is, it can be developed quickly. However, a study proved that a rule-based is one of "barrier in the process of development of expert systems" is knowledge acquisition [4]. Making a mistake during eliciting knowledge from various sources may eventually lead to inaccuracies in decision-making and cause expert system becomes ineffective. On top of that, time and ambiguity resources of knowledge may also affect the achievement of expert systems before rule error. Apart from that, mistakes may also occur due to conflict of interest and insufficient time spending while interviewing with the expert domain [2].
There are two most important persons in the process of acquiring knowledge is the programmer and domain experts. With proper skillset, a programmer can obtain extensive information and knowledge from experts during interview sessions. The programmer should also be well-trained in determining the quality and effectiveness of an expert system [5]. Even though, a standard framework in the process of acquiring knowledge such a guided interview has been established [6], it solely relies on the transfer of manual knowledge from expert to a programmer. Hence, the achievement of the knowledge elicitation depends absolutely on the ability of a programmer to obtain information.
Furthermore, personal issues and conflicts of interest are also common problems in the rule-based expert system development. Lack of communication and less preparation between programmers and experts may also encounter inadequate and vague knowledge during the knowledge acquisition phase. Sometimes, many terms from a particular domain, e.g., medical terms, can create some obstacle in understanding and remarking immediate and significant knowledge. Apart from confusing terms, personality issues such as misunderstanding and low self-esteem may arise uneasy moment between programmers and experts that may contribute to a severe obstacle to the knowledge acquisition phase [4]. In some case, experts are not familiar with information technology, and they refuse to take part in expert system development efforts. Sometimes, they reject applying automated system for supporting neither easy nor complicated problem-solving. Moreover, certain experts think that the expert system can weaken their role and make their work no longer useful, so knowledge acquisition is impossible to be conducted [2].
Before the issues as mentioned earlier,the development of a rule-based expert system faces a higher risk of creating 'noise rules' in coincidence. Therefore, this study aims to substitute traditional knowledge acquisition method into automatic knowledge acquisition based on the FP Growth method by using historical medical records.
Kurniawan has built the case based reasoning expert system for diagnosing eye diseases using the Naïve Bayes algorithm, whereby it capable of handling 140 cases [7]. The expert system via case-based reasoning is claimed to be faster in the knowledge acquisition process. Developing the knowledge model in the rule-based expert system is the easiest, but maybe tiresome and pricey because of an expert expressed inconsistently and piece by piece of valuable knowledge and information [8] to the knowledge engineer. Indeed, the rule-based expert system is becoming more challenging in the knowledge acquisition processes.
Millette overcomes the problem mentioned above of knowledge acquisition during rule-based expert system development by using data mining techniques and data warehouse techniques. Millette obtained the rule base essentially by generating a set of data using the association rule technique to address the problem of knowledge acquisition on expert systems. In correspondence, association rules technique has become more popular to overcome the weaknesses of the rule-based expert system [3], [9].
On the other hand, association rules have been used to generate the rules [10], [11]. Karabatak has combined clustering based learning, namely Apriori algorithm and supervised learning namely Neural Networks, to produce rules automatically for an expert system in the field of a breast cancer diagnosis. They were able to produce better result via association rules by reducing rules' dimensions. This hybridization method proved that breast cancer expert system is more effective rather than applying Neural Networks alone. Another study by Weng, Liu and Wu used data mining techniques and Bayesian Network for making the best product recommendations [12]. Similarly, the study of Ikram has used association rules on expert systems for earthquake prediction [13].
Association rules algorithm is a data mining technique for finding associative rules between combinations of items on an extensive database. Association rules is an alternative technique when statistical approach result is not satisfactory when applying to real cases [14]. In this case, the association rules are defined as an implication of the form A → B, with A and B being an item. The form of rule A → B can be interpreted as if there is item A then raised item B. The parameters, i.e. support and confidence influenced the formation of an association rule. Support presents a combination of items in the database, while confidence is the probability of an item purchased along with others.
Association rule is a first algorithm used to analyze supermarkets problem [15], to find out the potential customers. For example, who buys coffee and sugar? With this knowledge, supermarket owners can place correlated products nearby. Supermarket owners can also determine the best way to market their products or give discount coupons to attract customers. Data mining for generating rules inthe knowledge base has shown good potential to apply. However, presently, the FP-Growth algorithm has not yet been explored its effectiveness as an automatic rule generator.
Based on the literature review, there were also many researchers [2], [3], [9]- [11], [13] used data mining techniques to obtain essential or significant rules. However, most of them used the basic techniques of association rules [2], [3], [9], [11], [13] and Apriori [10] algorithms. Apriori scans the dataset repeatedly to generate frequent itemsets candidate item sets [16]. Overcoming previous automatic rule generators' weaknesses, we continue exploring better design for fully automatic knowledge acquisition by using primary and explicit performing algorithms. Therefore, this study is seeking a better way technique for generating rules.
In this research, we employ FP-Growth, an improvement of the Apriori algorithm. It is notable that the FP-Growth algorithm can fix the lacking of Apriori algorithm. Eventually, their frequent item set search process between are unique and diverse [17]. Nowadays, the FP-Growth is one of the quickest algorithms among the rules of association [18]. Frequent Pattern Growth is one of the alternative methods to determine the most frequent item set in a data set. The main feature of the FP-Growth algorithm is FP-Tree treated as its data structures. We organized the rest of this paper as follows. Section II described overview of the materials and methods of this study. Section III presents the implementation of the FP-Growth algorithm, experimental results, and discussion. We present the conclusions in the final section.

II. MATERIALS AND METHOD
We obtained eye disease data from Medical Record in Sumatera Eye Center (SMEC) Hospital Pekanbaru. The medical records were collected only for the year 2016 containing 1600 patients with six attributes (No Medical Record, Date, Gender, Age, Symptoms, and Diagnosis). The strategy used in this study is both expert reviews (opinions) and extensive experiments in accordance to research methodology standard of the procedure [19]. We need the opinions and assessments of medical experts as a strategy in conducting research. Furthermore, this opinion strategy is also conducted by interviewing experts, Delphi, and brainstorming. Our experiments were conducted through simulation and analysis of experimental results. Analysis is the stage of understanding of a problem before taking action and decisions. The system testing and validation based on support, the confidence of rules and an expert assessment of the rules.
There are four steps to designing experiments using the FP-Growth algorithm. The first step is the selection. We perform attribute selection by removing unnecessary attributes. In the second step, we impose data pre-processing by deleting duplicate and deleting incomplete of eye disease data. The third step is data transformation by changing and storing eye disease data into a specialized format, i.e. attribute-relation file format. At this stage, we initialize attributes and modifies the type of data, such as conversion of numeric data types into binominal. The last step is an experiment to get rules based on the FP-Growth algorithm.
In Figure 1 illustrates a cycle of knowledge acquisition process, it starts with generating the rules based on FP Growth association rule algorithm then the final stage is comparing expert decision results and our validated eye disease rules. We carry out an analysis procedure for eye disease rule validation as the following: The following section of the complete attributes of the eye disease data covering age, gender, symptoms and disease, and its data representation for age ϵ {U1,..U8}, gender ϵ {L, P}, symptoms ϵ {G1, G18}, diseases ϵ {P1, P9} are described as accordance to Table I.

A. Selection
We found that medical record data in that hospitals are kept in a manual form. Then, we digitised those manual eye disease medical data using data processing tools. We exclude the medical record number, name, and date, from the six attributes the patient's medical report form such as gender, age, symptom and diagnosis of the disease. Next, our task is to find the relationship among age, gender, symptom and disease.

B. Pre-processing
The next process is data pre-processing whereby we eliminate noise or clean the missing data value, inconsistent data and outliers. The data extraction process is challenging due to irrelevant, redundant, noisy, unreliable attribute and unsuitability of a method to obtain unidentified patterns from data may effect to an inaccurate of result [20]. The data preprocessing stage purposes to enhance machine-learning performance. We use several techniques to address missing values, such as replace missing values with their mean value and delete rows with more than one missing values. In the case of the age attribute, we assume that missing values distributed similarly to existing values. This technique was called Missing Completely at Random (MCAR). Due to the missing value in the eye disease dataset, we removed records with the missing value. The remaining records in the dataset are approximately 1214 records which are used in the data mining process for generating rules.

C. Example to Generate Rules Using FP-Growth
Here, we illustrate in the case of 20 sample data of medical record on eye disease to find the relationship between age, gender, symptoms and kinds of eye disease. In this sample, we assume that the support count ≥ 10%. This support count value affects FP-tree construction. The support count 10% as the threshold for filtering the item set, i.e. only the item set have support count is equal, or more than 10% appeared. We illustrate the searching pattern process as shown in Table III. We can observe that items U3, U6, U2, G9, G12, P2 and U4 have not appeared in the table. The best item based on minimum support program works as follow: Input: Eye-diseases-data Output: The best item based on minimum support Procedure best-itemset 1 Scan eye-diseases-data to find support 2 For each support do 3 Sort (support) order by support desc; 4 If item set support <10% then 5 discard item set The next step is to scan the data and re-arrange data by the minimum support. The confidence and support value are parameters for measuring the best rules. The data does not meet the minimum support and does not contain antecedent and consequent will removed from the transaction. The attributes become antecedent are age, gender, symptoms, while the consequence is the disease.
The next step is FP-Tree construction based on the transaction scanned. Based on the sample data provided, P1, P2, P3, P4, P5, P6, P7, P8 and P9 are the consequence, but only P3 and P4 meet the minimum support. The next step is to get a frequent item set by generating a conditional pattern base then create the rules by calculating the confidence of each rules combination. The FP-Tree construction works as follow:  After the rules obtained from frequent itemset, then the next step is to evaluate the strong association rules obtained by using the lift ratio. Lift ratio is the ratio between the confidence of a rule and the benchmark confidence value. Benchmark Confidence is a comparison between the numbers of consequent with the total of transactions. We consider lift ratio more than 1 indicating as possible rules. Below is the expected confidence equation,  The implementation performed on minimum support and minimum confidence parameters. The minimum support used 3%, 10% and 20% while for the minimum confidence used 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100%. The minimum support and minimum confidence affect the rules generated. We initiate with a high value in minimum support and confidence, and then we decrease the values until we find a value that generates enough patterns.
The value of minimum confidence and support are high then the rules are getting less than shown on Table V. Concerning performance, when minimum support is higher, it obtains less pattern, and the algorithm is faster. In the meanwhile, for minimum confidence when it is set higher, the number of patterns are lesser, but it may not be faster because many algorithms do not use minimum confidence to prune the search space. Naturally, setting these parameters also depends on the desire on how many rules seem justifiable. We obtained 2208 rules of FP-Growth association rule mining test. All the rules have qualified the minimum support and the minimum confidence with the lift ratio > 1 so that the resulting rule is feasible to use. The value of lift ratio gives us evidence about the increase in the probability of the then consequent given the antecedent.
Furthermore, there are 17 rules have qualified the antecedent and consequent with antecedent is gender, age, and symptoms while the consequent is eye disease. In this case, there are nine kinds of eye disease, but only two diseases that meet the requirements of frequent itemset namely Presbyopia and Conjunctivitis. We have done the testing with the minimum support of 3% and minimum confidence of 90%.
Association rule testing accomplished in two ways: a test based on minimum support, minimum confidence and lift ratio. We used the lift ratio to evaluate the association rule.
Ophthalmologists perform the second tests. We showed the rules generated by FP-Growth for the feasibility of rules. We have tested the rules obtained automatically by FP-Growth with expert knowledge. Here are the expert test results. Based on an ophthalmologist's acceptance test, the average accuracy of the FP-Growth association rules algorithm as rules automatic generator is 88%.
We have also conducted experiments by examining the suitability of Lift Ratio and ophthalmologist's Confidence Level. The Confidence Level usually used as a reference to assess the strength of a rule. Eye diseases diagnosis should use the accurate rules and acceptable. The ophthalmologists inspected the results from automatic rule generator. In this study, we have obtained the rules 6, 9, 16 and 17 have the suitability between Confidence Level of ophthalmologists and Lift Ratio values. We have obtained the strong relationship between symptom and disease based on the ophthalmologist's assessment. We can use accurate rules for diagnose eye diseases. Fig. 4 shows the relationship of symptoms and diseases that have been generated automatically using FP-Growth. Fig. 4 The relationship between symptoms and eye diseases using FP-Growth algorithm

IV. CONCLUSION
We have successfully conducted research using association rules as a knowledge base in expert systems of eye disease. We used medical record data from Sumatera Eye Center (SMEC) Hospital in Pekanbaru, Riau, Indonesia. Based on the experiments conducted, the determination of minimum support and minimum confidence in FP-Growth should be appropriate for the outcome of the rules. We assigned the minimum support 3% and minimum confidence 90% to attained 2208 rules.
We have successfully employed FP-Growth algorithm as a rules producer with good accuracy 88%. FP-Growth algorithm fast to create rules in medical records data that only twice database scan. FP-growth constructs a highly compressed FP-tree, which is substantially reduced the original database.
Based on the test results, we obtained that Conjunctivitis and Presbyopia disease is the most dominant suffered from 225 and 167 cases. Male or female aged 31-59 years are very dominant suffering Conjunctivitis and Presbyopia. Based on minimum support score concluded if female, aged 31-59 years, dizzy, near blurred vision then suffering presbyopia with support score 4.9% as well as this rule appear more often together than expected with the highest lift 7.27%. It means that the occurrence of this symptom has a positive effect on the occurrence of diseases, i.e. symptoms positively correlated with diseases.
There are three critical symptoms for presbyopia: blurred vision (near), dizziness, and discomfort on eyes, while for conjunctivitis there are four influencing symptoms: watery eyes, itchy eyes, sticky eyes, and red eyes, but the most important symptom is the sticky eyes. FP-Growth could be generated rules as a knowledge base in rule-based expert systems. This algorithm considered to be a fast technique in knowledge acquisition, even replace of traditional knowledge acquisition.
However, we need an enlargement study to produce the completely automatic rules as knowledge base expert system in future works. In future, we have planned to get more data from kinds of eye diseases and to develop an actual expert system.