Breast Tissue Classification via Interval Type 2 Fuzzy Logic Based Rough Set

— BIRADS is a Breast Imaging, Reporting and Data System. A tool to standardize mammogram reports and minimizes ambiguity during mammogram image evaluation. Classification of BIRADS is one of the most challenging tasks to radiologist. An apt treatment can be administered to the patient by the oncologist upon acquiring sufficient information at BIRADS stage. This study aspired to build a model, which classifies BIRADS using mammograms images and reports. Through the implementation of type-2 fuzzy logic as classifier, an automatically generated rules will be applied to the model. To evaluate the proposed model, accuracy, specificity and sensitivity of the modal will be calculated and compared vis-à-vis rules given by the experts. The study encompasses a number of steps beginning with collection of the data from Radiology Department, Hospital of National University of Malaysia (UKM). The data was initially processed to remove noise and gaps. Then, an algorithm developed by selecting type-2 fuzzy logic using Mamdani model. Three types of membership functions were employed in the study. Among the rules that used by the model were obtained from experts as well as generated automatically by the system using rough set theory. Finally, the model was tested and trained to get the best result. The study shows that triangular membership function based on rough set rules obtains 89% whereas expert rules achieve 78% of accuracy rates. The sensitivity using expert rules is 98.24% whereas rough set rules obtained 93.94%. Specificity for using expert rules and rough set rules are 73.33%, 84.34% consecutively. Conclusion: Based on statistical analysis, the model which employed rules generated automatically by rough set theory fared better in comparison to the model using rules given by the experts.


I. INTRODUCTION
Cancer is among the major cause of fatality in Malaysia. Around 18, 219 new cancer cases were diagnosed in 2007 based on National Cancer Registry [1]. Most of these cases (55.4%) were female. In Malaysia, Breast cancer is the most occurring cancers among population in 2007. Fig.1(a) shows details of the most ten common cancers in Malaysia in 2007.
Cancer occurred when a cell in human body divides uncontrollably creating mass, calcification or distortion known as tumors. Malignant tumors in breast cell is known as breast cancer [2]. In Malaysia, this type of disease is predominant among women between age 20 to 75 for every 100, 000 populations ( Fig. 1(b)). Most of them were diagnosed at stage II ( Fig. 1(c)).
All Malaysian especially women should be well educated on the issue of breast cancer. Berita Harian, a renowned newspaper in Malaysia reported that over 700,000 women in Malaysia had been afflicted by this disease [3]. In addition to that, it is also reported that mammogram is one of the best screening method to retard the spread of said cancer to other part of human body by way of early detection. Timely diagnosis of the cancer will translate to more effective treatment to be administered on the patients [4]. This tool utilizes X-Ray system to diagnose the disease [5].
Presently, digital mammogram and film mammography are the two type of mammogram. The drawback of film mammography is that; the breast image cannot be altered after obtaining it. This led to inability to restore the breast's missing information caused by contrast, which is the difference between the lightest and darkness area on the display screen or because underexposure of the film. To overcome this problem, Computer Aid Detection (CAD) is using digital mammography, which can improve image storage and transmission by taking electronic image of breast and save in computer.
Screening mammogram, which is used for early detection of breast cancer, will improve the treatment given and reduce mortality. However, human intervention in screening and identifying process will certainly incur a significant error and thusly, an erroneous diagnosis. Kerlikowske et. al. [6] claimed that about 10% to 30% error occurs in determination of cancer performed by typical radiologist.
In the recent years, CAD has been used as a second reader for radiologist. This made the radiologist decision making of abnormalities detection of medical image faster, more sensitive and less cost [7]. The common steps of CAD system are image acquisition image, prepressing if needed segmentation. Evaluation method maybe applied to test the system.
In breast cancer, mammogram, which is the input image of CAD system, will be reviewed by radiologist after CAD segmented it to identify the suspicious region [8]. The importance of analysing mammogram using CAD to detect breast cancer in early stage has been proven by earlier researchers [8,9].
Machine learning has ability in handling the different data problem like noise, complexity and big. For those reasons, machine learning is using in CAD system to improve it in dealing with mammogram data. Thus, it is highly useful in cancer detection because it can learn from past examples [9,10,11]. Breast cancer medical data is not only the mammogram but it also large information about patients describe their medical condition. Moreover, it also has incomplete data such that to derive a conclusion from it is still inadequate.
To handle uncertainties machine learning like Type-2 Fuzzy Logic, and rough set theory are used to classify BIRADS [12,13,14]. These two methods are selected because of their ability to handle uncertainty and have been used to classify BIRADS based on extracted features from mammogram only [15]. Some important information and sign from the data are extracted. The different signs of breast cancer which can be seen in mammogram are mass, calcification, and architectural distortion [2]. The result from mammogram reports will determine the severity of patients.
This paper presents a type-2 fuzzy logic classifier model integrated with an automatically generated rules by rough set to classify BIRADS using mammogram images and reports. The evaluation will be done by using accuracy, specificity and sensitivity of the modal compared with the rules provided by the experts.

II. MATERIAL AND METHOD
The proposed method was given in Fig. 2.

A. Data collection:
The study commenced with collection of patient data from UKM Medical Center. The data collection process is done in Radiology department of hospital UKM. Patient with BIRADS 1,2,3,4 and 5 was considered in this study. In more details, 100 mammogram images with their reports was collected. In general, data was incomplete and rife with noise and uncertainty for BIRADS classification of the patients. All of the data was crucial to be pre-processed as shown in Fig.3 [16,17]. The actual number of attributes were 13 consisting of Density, Mass, Calcification, Others, Impression (normal, benign, mostly likely benign, suspicious of malignancy, malignant), recommendation (follow-up mammogram, biopsy and others), and results (BIRADS class either 1 to 5).
While developing a good model, a clean data had been allocated for training and testing dataset. The first step was determining class attribute for the model using training datasets followed by the utilization of the model to classify testing data in order to obtain the most accurate model.

B. Pre-processing:
The pre-processing steps started with compensating incomplete data using mean values. To avoid missing data, discretization was employed and thusly, to define the data range. Data cleaning is needed to be cleaned thoroughly so as to enhance the model.
Attribute selection that will be utilized in the model has been executed as shown in Fig.4. As shown in Table 1, expert notates the attribute based on its degree of seriousness. Examples of mammogram images for attribute selected were given in Fig.5.

C. Development of Proposed Type-2 Fuzzy Logic Model
To classify BIRADS for breast cancer, type-2 fuzzy logic had been selected. This fuzzy rule based system was also known as Mamdani. According to Caramihai et al. [18], Fuzzy logic model can be developed based on three steps. Firstly, was the determination of input and output variables that described breast cancer tissue problem and choose interval for the variable. The second step was defining linguistic set values of breast cancer tissue along with its membership function that could be mapped onto fuzzy variable range and finally was constructing rule sets using radiologist rules and auto-generated rules using rough set in parallel by associating their input and output.
In this study, about five assumptions were drawn to develop type-2 fuzzy logic model for mammogram images: 1. All fuzzy set were interval type-2 fuzzy set (distortion, masses, Calcification).
2. Antecedence and consequence were changed according to its membership function.
3. Testing input on membership functions (Gaussian, Triangular and Trapezoidal methods).
4. Operation of fuzzy were t-norm and product implication methods.
An Interval type-2 fuzzy logic had upper and lower membership functions. Three types of membership function were selected including trapezoidal, triangular and Gaussian.
The Gaussian membership function has regular standard deviation ( and mean value between , . The mean value gains from Eq.1 Upper membership function was defined by: Lower membership function was defined by: where, , , ∝ exp . The triangular membership function uses three parameters(), *, +,: Parameter (), *, +, (with ) * + ) were used to determine the coordinate for 3 cornered in triangular membership function. Membership functions for trapezoid as equation 5: -@A)BCDEFG : ), *, +, G where parameter (), *, +, G, (with ) * + G ) was to determine coordinate for 4 existing corners in trapezoidal membership function. Membership function employed for the model is shown in Fig.6. First is to change input to crisp value. It will be implemented by fuzzifier. Three types of inputs which are calcification ( , distortion and mass I . are gained from mammogram reports.

D. Rules generation by expert:
Subsequently was the rules generation phase. Based on expert knowledge, the rules were developed. In this study, the M-rules can be generated as: It also could be written as: where O is antecedent and V is consequence for fuzzy rules .
The details of the linguistic variables for the fuzzy membership function using expert rule are shown in Table 2.

E. Proposed Rules Generation using Rough Set:
Besides using rules from experts, the study also generated rules automatically using mammogram reports from Radiology Department. From the existing obtained mammogram report, a set of rules were drawn by performing subgroup for data that possessed many criteria. RN patient numbers and their mammogram reports noting about density, mass, calcification and BIRADS decisions are shown in Fig.4. Here, density, mass and calcification constituted the conditions while BIRADS as the target decisions.
Rough set theory was used for producing rules from a set of attribute association. This method was selected because of the ability to handle inconsistence data in dataset. It also was also capable in recognizing redundancy of information and match them together.
The first step was to identify similar classes containing a set of conditions for each patient (P numbers). For example, P7, P10, P12 were patient numbers (Table 3). It generated an equivalent class by identifying similar attribute values of each patient number. In this context, equivalent class namely E1, E2 and E3, represented its own set of patient numbers and its similar attribute in relation to its equivalent class.
The second step was to develop a differential matrix as shown in Table 4. Here, the overlapping or similar attributes from each equivalence class were then identified in detail. It could be observed that, architectural distortion was an attribute that discriminated between E1 and E2 classes. Apart from that, the mass and calcification were attributes that differentiated between E1 and E3 classes.
The third step was to calculate relative differential function (f). It was used to obtain the minimum set attributes that distinguished between classes. Thus, the function of Table 4  The last step was to generate rules using relative degradation. OR operator was incorporated to divide into two rules, while AND operator was to unite into one rule. The example is as followed: From RED(E1) = if Architectural Distortion = yes AND Mass = speculate  BIRADS = 5. About 21 rules had been generated by rough set theory for mammogram reports. Examples of automatically generated rules using rough set are as per below: IF calcification distribution is large, no distortion and less mass THEN BIRADS 2.
2. IF calcification distribution is medium, no distortion and less mass THEN BIRADS 2 OR BIRADS 3.

IF calcification distribution is little, no distortion and several mass THEN BIRADS 2 OR BIRADS 4. 4. IF the calculation distribution is very large AND no
distortion AND lot of mass THEN BIRADS 5 OR BIRADS 4 Table 5 shows the details of the linguistic variable of the fuzzy memberships function using rule generated by rough set.

F. Inferencing Process
Development of inference engine was implemented thereafter. This was a basic component in fuzzy logic to calculate firing level for every input and antecedence rules. Upon that, the firing level was applied into consequence of fuzzy set. Fuzzy inference engine for BIRADS which was output for the model can be written as below b -:    [r ggg F membership function for mass, calcification and distortion is, Or b F is type-2 membership function for antecedence F at rules s and Vs gggis type-2 membership function for consequence at rules-s.
The equation above also can be written as: where O t u is firing lever for set of mass, calcification and distortion data. When interval type-2 is used in mammogram, the firing level will be in the interval of the set value like: The fourth step was type-reduction. The study used center of type reduction method due to its the ability to reduce every rules F to type-1 set lt w ,y w o.
where v y w , v t w is firing level fory w andt w for rule I. It maximizedy and minimizedt . v y w v t w . in interval v t , v t .Karnik-Mendel algorithm also was applied in type reduction step.
Last step was defuzzification. This step made use of centroid of defuzzification where range ofhas been discretized into point. + is centroid method. In the study, interval set of lt ,y o was obtained from type reduction by using average oft andy . Then, crisp value of type-2 fuzzy logic system for the model was simplified through:

III. RESULTS AND DISCUSSION
To test the proposed model, self-collected data from Hospital UKM, radiology department has been used. The size of data is 100 of mammogram images with their reports. The objective of the model is to classify the data based on three input variables mass, calcification and distortion into

Calcification
Less Upper corresponding BIRADS. Selecting these inputs is because their effects in determining BIRADS class based on the expert. K-fold cross validation used to divide data into training data and testing data because every data had a tendency to be trained and tested. This method started by randomly dividing data into k blocks with equal size. One block was used in testing while the others were used for training. A method of k-fold was shown in Fig.7.
Whenever a user wishes to utilize the system, it is prerequisite upon him to identify which type of membership function he desires to use as well as the rules to be applied in the system, either expert based rules or rough ret rules.
The user is firstly required to log into the system as Fig. 8  (a) and select the preferred classifier as in Fig. 8 (b).
Thereafter, the user needs to fill in the form about mammogram reports as Fig. 8 (c). The form consists of three input used for the model and one output which is BIRADS, that will be displayed later.
Coefficient matrix for every membership function is detailed in Table 7.
Sensitivity, accuracy and specificity can be calculated for the coefficient matrix using formulae below: Sensitivity is probability of correct positive detection of breast cancer from the total number of positive detection where #TP (true positive) is number of patients predict as having breast cancer and they are having.
#FN (false negative) is number of patients predict as having breast cancer and they haven't in the real.
Specificity is probability of correct negative detection of breast cancer from the total number of negative detection.
where #TN (true negative) is number healthy patients predict as healthy #FP is number of unhealthy patients predict as healthy.
Moreover, accuracy is as probability for correct detection (positive or negative) to the total number of the population.  A comparison of rough set rules using standard voting and Naive Bayes is tabulated in Table 7. This was the best result achieved with data folding. At fold-8, Naive Bayes shows the highest result of classification which is 71.79% which divided data into training and testing (60:40). It was due to the small sample size of data and also ensure that the highest accuracy was obtained. It also produced 21 rules that was applied into the model. Naive Bayes is a good classifier too as it is capable to give an exceptional result for small data and reduce error more effectively.
The study also made a comparison between rules given by expert and rules generated automatically by rough set theory. Triangular membership function, trapezoidal membership function and Gaussian membership function were used in the model. Table 8 shows that, rough set rule outperforms expert rule based on the three fuzzy membership functions in term of accuracy and specificity. Even if the sensitivity using expert rule is a little bit higher than using rough set rules but the different is not that much. Moreover, the sensitivity obtained by rough set rules still acceptable and comparable in term of medical application needed.
T-test has been applied to prove the significant between rules given by expert and rules generated automatically by rough set theory. Based on T-Test result, the proposed method using rough set rules is statistically significant than expert rules with (P= 0.0047< 0.05).
Rough set theory is a good method because it has an efficient algorithm to recognize the pattern of mammogram data.
This method has an ability to identify relationship between the data used. Besides, rough set theory also can evaluate the data clearly. For example, while generating rules, rough set will identify pattern and relationship of the data to yield a sound rule. In other word, Fuzzy rough set can generalize the rules and overcome the limitation of rule generated by expert. In consequent, model that utilized rules from rough set perform batter compare to expert rules. Furthermore, there is a higher probability of expert diagnosing patient as the lowest BIRADS that yields to lower accuracy of the model Breast cancer detection in early stage can reduce mortality rate among women. Uncertainty exists in determination of BIRADS of breast cancer can be removed by applying fuzzy logic method. The study utilizes type-2 fuzzy logic and produced rules from rough set and expert. Comparison between the models was carried out to identify a better model for developing BIRADS. The model also utilizes three types of different membership function and it is observed that triangular membership function is better in relative to others. This project has proven that rules generated by Rough Set produces better model and accuracy for predicting BIRADS classification. On the other hand, expert rules are still insufficient to model real-life scenario. A second opinion regarding to BIRADS classification is highly imperative to support and substantiate the decision made by a specific expert. Therefore, BIRADS classification intelligent system has shown the significance of using general rules derived from a set of knowledge base instead of a single expert.