A Comparative Analysis of Rough Sets for Incomplete Information System in Student Dataset

— Rough set theory is a mathematical model for dealing with the vague, imprecise, and uncertain knowledge that has been successfully used to handle incomplete information system. Since we know that in fact, in the real-world problems, it is regular to find conditions where the user is not able to provide all the necessary preference values. In this paper, we compare the performance accuracy of the extension of rough set theory, i.e. Tolerance Relation, Limited Tolerance Relation, Non-Symmetric Similarity Relation and New Limited Tolerance Relation of Rough Sets for handling incomplete information system in real-world student dataset. Based on the results, it is shown that New Limited Tolerance Relation of Rough Sets has outperformed the previous techniques.


I. INTRODUCTION
Data mining is the process that include the collection, use historical data to find regularities, patterns, relationships in large data sets and represent the new knowledge to make the data understandable [1]. It has been widely used and has solved many real-life problems [2][3][4]. In fact, besides enhancing the method used in the learning system [5], data mining has also been used in the field of education to better understand how students learn and identify the settings in which they learn to improve educational outcomes [6].
One of the popular and viable approach for data mining is using rough set theory. Rough set theory [7] was first initiated by Professor Zdzislaw Pawlak as information system i.e. pairs ( ) A U, where U is the universe of objects, while A is the set of conditional attributes. In recent years, it has been successfully applied to knowledge discovery, information system analysis, artificial intelligence, decision analysis, pattern recognition, etc. Various real-life applications of rough set theory have shown its usefulness in many domains, i.e. clustering [8][9][10][11], modeling conflict analysis [12] with improvement by using soft set theory [13] in terms of computational time, association rule [14], medical diagnosis [15], [16], supplier and distributor selection on Supply Chain Management [17], [18], etc.
The rough set theory does not need any initiatory or supplementary information of data. That is its main advantage. The standard rough set theory, however, can only be implemented to deal with the problems of complete information system where all available objects in the system have attribute values. It is based on the indiscernibility relation that conforms with the reflexive, symmetric and transitive properties. Even though in fact, in the real-world problems, it is regular to find conditions where the user is not able to provide all the necessary preference values, and thus, we need to deal with the incomplete information system.
To cope with this incomplete information system, a lot of effort has been run in studying this domain. The simplest method to deal with incomplete information system is to remove the objects with unknown or missing values [19].
However, this approach certainly reduces the sample size of data. Apart from that, a well-known approach is the extension of rough set theory, called tolerance relation [20]. Yet the disadvantage is, tolerance relation approach leads to poor results in terms of approximation. Subsequently, Stefanowski and Tsoukias [21], [22] proposed similarity relation to improve the results obtained by means of tolerance relation technique. However, Wang [23] and Yang et al., [24] showed that when applying similarity relation, some information are lost, hence they introduced limited tolerance relation. Nevertheless, some information are also lost since limited tolerance relation does not consider the similarity precision between two objects. Nguyen et al., [25] improved the tolerance relation by considering the probability matching between two objects. After all, first, we need to know the probability distribution of the data. Consequently, Deris et al. [26] proposed New Limited Tolerance Relation of Rough Sets. This proposal is based on limited tolerance relation by taking into consideration the similarity precision between two objects. The similarity precision is defined when a threshold value is given. Therefore, this study aims to address multiple techniques for incomplete information system based on rough set theory using real-world student data with missing or incomplete values.
This study examines the results of an experimental and comparative analysis on several incomplete information system techniques, including Tolerance Relation (TR), Limited Tolerance Relation (LTR), Non-Symmetric Similarity Relation (NSSR) and New Limited Tolerance Relation of Rough Set (NLTRS) regarding their performance in terms of accuracy. The student data to be analyzed were obtained from the Directorate of Information Systems (SISFO), Telkom University.
The other section of the paper is organized as follows. In Section 2, the basic notion of rough set theory and incomplete information system of the rough set theory are introduced. Afterward, the tolerance relation, limited tolerance relation, non-symmetric similarity relation and new limited tolerance relation of the rough set for handling incomplete information system are briefly described. Section 3 elaborates the results and compares them in terms of accuracy. Finally, the conclusion of this work is presented in Section 4.

II. MATERIALS AND METHODS
In this section, we recalled the notion of information systems, the idea of rough set theory and continued with the essential definitions of incomplete information system techniques based on rough set theory, namely Tolerance Relation, Non-Symmetric Similarity Relation, Limited Tolerance Relation and New Limited Tolerance Relation of Rough Sets.

A. Information System
The idea of information system gives an appropriate tool for the representation of objects in terms of their attribute values. An information system is a 4-tuple (quadruple) , called information (knowledge) function [27]. If U in contains at least one object with an unknown or missing value, then S is called incomplete information system. The unknown or missing value is denoted by "*" in an incomplete information system. In this paper, we used the quadruple to denote an incomplete information system. After the idea of an information system was presented as above, we recalled the notion of rough set theory in the following section.

B. Rough Set Theory
The idea of rough set theory was founded on the assumption that every object of the universe of discourse can be associated with some information (data, knowledge). Objects characterized by the same information are indiscernible (similar) in view of the available information about them. The mathematical basis of a rough set theory is that similarity (indiscernible) relationship. The basic set of any set of all similar objects will form the basic grains (atoms) of knowledge of the universe. Some of the basic sets that form a union are called exact -otherwise those are called rough set (not exact, not clear). On a rough set there are objects in boundary line whose certainty cannot be classified, using existing knowledge, as a member of its set or complement.
Foremost, we recalled some fundamental definitions of rough set theory. Formal definitions and detailed description of rough set theory are originated from [7].
The concept of an information table is a quadruple , where U is a non-empty finite set of objects, A is a non-empty finite set of attributes, V is the union of attribute domains such that The accuracy of approximation of any subset is measured by where X denotes the cardinality of X. For the empty set φ , it is defined as ( ) 1 = φ α B [28]. Clearly, . If X is a union of some equivalence classes of Thus, the set X is exact with respect to B. And, if X is not a union of some equivalence classes of U, . Thus, with respect to B, the set X is not exact. This means that the higher the accuracy of approximation of any subset U X ⊆ , the more precise (the less imprecise) it would be [29].

C. Tolerance Relation
Given a complete decision system C is a set of condition attributes and d is the decision attribute, such that , for any subset C B ⊆ , the tolerance relation T is determined by the following definition.
Obviously, TR is reflexive and symmetric but does not need to be transitive. From Definition 1, we described the notion of tolerance class as follows: , , , * * = be an incomplete information system. The tolerance class ( ) of an object x with reference to an attribute set B is defined as follows: (2) From Definition 2, the notion of lower and upper approximations of tolerance class are described as follows: x of an object set X with reference to attribute set B respectively can be defined as follows:  We can represent the above ideas by using an incomplete information system of Wang [23].

D. Non-Symmetric Similarity Relation
An object x is considered to be similar to object y only if all their known attribute values are the same. Thus, one object may have more complete description than the other, the inverse relation does not hold [30]. The notion of a nonsymmetric similarity relation is given as follows: Definition 4 (See [30], [31]). Let , , , * * = be an incomplete information system. A non-symmetric similarity relation S is defined as It is clear that S is transitive and reflexive but not symmetric. From Definition 4, we can induce two similarity sets as given in Definitions 5 and 6 below.
The approximations showed by the non-symmetric similarity relation are more informative than those resulted by tolerance relation.

E. Limited Tolerance Relation
In an information system, two objects may be distinct because of a little missing information. For example, two objects , , , *, = are similar, but they do not satisfy the non-symmetric similarity relation. To avoid such problem, Wang [22] developed a limited tolerance relation based on the following definition.
. A binary relation L (limited tolerance relation) defined on U is given by Obviously, the limited tolerance relation is symmetric and reflexive but not transitive. In Definition 8, the condition that Thus, two objects that satisfy the tolerance relation but not limited tolerance relation are only those with the formula ( ) ( ) φ In other words, there are two cases where two objects are in a limited tolerance relationship, i.e if two objects lose all attribute values, and the second case is where there is at least an attribute having an ordinary value for both objects and the two objects have the same value for those attributes. The notion of limited tolerance class is given as follows: From Definition 9, the notions of lower approximation and upper approximation of an object x based on the limited tolerance class are given in the following definition. are respectively defined as:

F. New Limited Tolerance Relation of Rough Sets
, C is a set of condition attributes and d the decision attribute, such that * : V where a V is called domain of an attribute a and a subset C B⊆ , the similarity precision is defined as follows: , the similarity precision δ , is defined as: where • represents the cardinality of the set.
From Definition 11, it is clear that Meanwhile, the limited tolerance relation with similarity precision is given as follows:

Definition 12. Given an incomplete information system
The limited tolerance relation with similarity precision δ L is defined as follows: holds, but not vice versa if the particular threshold value of the similarity is given.
Afterward, we recalled the extended tolerance relation by using similarity precision with a threshold.
, , (12) In the next section, we discussed the performance of four (4) techniques in terms of accuracy.

III. RESULT AND DISCUSSION
In this section, we compare all the incomplete information system techniques based on accuracy. A realworld dataset that contains incomplete missing values is used. This dataset was obtained from the Directorate of Information Systems (SISFO), Telkom University. It contains 200 instances and eight (8) categorical attributes. The attributes that have been used are Student ID, 1 st GPA, 2 nd GPA, 3 rd GPA, 4 th GPA, 5 th GPA, 6 th GPA, and Performance of Student. Here, irrelevant attributes such as name, gender, student residential address, etc. have been removed. The occurrence of missing values might be due to several possibilities, such as the student was on leave, the GPA score is not final, the student is not enrolled in certain semester, etc.
The Performance Status field represents the performance of students during their studies. The description of each attribute of the dataset is shown in Table 2 as follows: The values of GPA are in the form of letter representation of their actual numeric score (4.0 scale). The conversion of the actual score of GPA to a letter representation based on a standard that is implemented by Telkom University is as shown in Table 3 below: The sample of 10 out of 200 of student data that are used as a dataset in this paper is shown in Table 4 as follows: In order to apply all the techniques, the experiments are developed using MATLAB version 7.14.0.334 (R2012a). The techniques are executed sequentially on a processor Intel 1.5 GHz CPUs, with total main memory 2G of RAM and the operating system is Windows 7. The computation results comparing all four (4) techniques in terms of accuracy are shown in Figure 1.
Based on the above result, we can see that all techniques have recorded low accuracy, but it is undeniable that New Limited Tolerance Relation of Rough Sets has outperformed other techniques, i.e. Tolerance Relation, Limited Tolerance Relation, and Non-Symmetric Similarity Relation techniques. We have observed that it has low accuracy due to the properties of the dataset, e.g. the data are homogeneous, and because of the small number of instances used, which are only 200 instances. Although all of the techniques have achieved low accuracy, the improvement of NLTRS is very significant. The percentage of improvement of NLTRS as compared to TR, LTR and NSSR in terms of accuracy is shown in Table 5 below.   (13) In summary, based on experiments that we have carried out using the real-world student dataset, the NLTRS achieved higher accuracy than the other techniques. The improvement average of NLTRS as compared other techniques in terms of accuracy of NTRS is 76.99%.

IV. CONCLUSION
A number of techniques of extended rough set theory for handling incomplete information system have been proposed i.e. Tolerance Relation, Limited Tolerance Relation, Non-Symmetric Similarity Relation and New Limited Tolerance Relation of Rough Sets. However, all of the techniques have not been implemented in the real-world dataset that contains missing values or incomplete information system. In this paper, the researchers have applied all of the techniques and compared the performance of each technique in terms of accuracy. From the results, it is illustrated that New Limited Tolerance Relation of Rough Set achieved higher accuracy as compared to Tolerance Relation, Limited Tolerance Relation, and Non-Symmetric Similarity Relation with the improvement average of 76.99%.