Classification Techniques for Predicting Graduate Employability

— Unemployment is a current issue that happens globally and brings adverse impacts on worldwide. Thus, graduate employability is one of the significant elements to be highlighted in unemployment issue. There are several factors affecting graduate employability, traditionally, excellent academic performance (i.e., cumulative grade point average, CGPA) has been the most dominant element in determining an individual’s employment status. However, researches have shown that not only CGPA determines the graduate employability; in fact other factors may influence the graduate achievement in getting a job. In this work data mining techniques are used to determine what are the factors that affecting the graduates. Therefore, the objective of this study is to identify factors that influence graduates employability. Seven years of data (from 2011 to 2017) are collected through the Malaysia’s Ministry of Education tracer study. Total number of 43863 data instances involved in this employability class model development. Three classification algorithms, Decision Tree, Support Vector Machines and Artificial Neural Networks are used and being compared for the best models. The results show decision tree J48 produces higher accuracy compared to other techniques with classification accuracy of 66.0651% and it increased to 66.1824% after the parameter tuning. Besides, the algorithm is easily interpreted, and time to build the model is small which is 0.22 seconds. This paper identified seven factors affecting graduate employability, namely age, faculty, field of study, co-curriculum, marital status, industrial internship and English skill. Among these factors, attribute age, industrial internship and faculty contain the most information and affect the final class, i.e. employability status. Therefore, the results of this study will help higher education institutions in Malaysia to prepare their graduates with necessary skills before entering the job market.


I. INTRODUCTION
According to [1] the concept of marketability refers to various skills in graduates to be hired as an employee. Skills such as communication skills, teamwork, continuous learning, critical thinking, entrepreneurship, and information management are crucial for a graduate to be hired. The numbers of graduates from Malaysian universities have shown a positive increment from 2006 to 2017. In 2006, the total of graduates was 132899 and increased to 299537 in 2017. Based on a report published in [2], the percentage of unemployment of undergraduate students in Malaysia is decreased from 36.4% in 2006 to 26.27.3% in 2017. Even though the rate of unemployment is decreasing, the issues of unemployment in certain disciplines still remain high and the perception of that unemployment of graduates is due to their lack of generic skills. In effort to address these issues and to increase the employability rate, Malaysia Ministry of Education has initiated several steps such as revising curriculum, promoting entrepreneurship courses, emphasizing skill and competencies such as English language, teamwork and analytical skills. Besides that, successful collaborations between university, industries and government may benefit the graduates by promoting their skills to employers in industry [3].
Employability skills have been a subject of research where the skills acquired by graduates could be determined and measured [4]. There are many approaches could be employed in this study, quantitative or qualitative study. One current approach is by employing data mining techniques. Data mining or Knowledge Discovery in Databases (KDD) is a process of extracting knowledge or hidden patterns from a large datasets. It has been proven to be an effective process in solving real-life problems. Several domains, such as financial, climate change, health and safety, stock market and others would benefit from the data mining approach. An example how data mining has been used to predict rainfall has been shown in [5,6,7]. Beside prediction task, data mining has been used to detect e-learning courses anomalies as explored in [8].
In this work, data mining is used to identify graduates employability. This technique requires corresponding data such as the graduates' background, their experiences when studying in university, the effectiveness of the system and self-readiness, current status, employment status (working or unemployed) and others. These data are collected from a Malaysia's tracer study, Sistem Kebolehpasaran Graduan (SKPG) that was managed by Ministry of Education. Every graduates need to submit a survey before their convocation day. Data mining approach with the classification technique can produce a model of graduates' employability. By using different classification techniques such as Artificial Neural Networks (ANN), Decision Tree and Support Vector Machine (SVM), factors that affect graduates employability such as academic's achievement, differences in academic's discipline, family's background and many more can be identified.
Bayes Theorem and Decision Tree are used to build a classification model in classifying graduates whether they were working, not working and undetermined [9]. They used data from Maejo University of Thailand. The data were from three academic years that consist 11,853 instances. Ten algorithms used in modeling the classification, i.e. five types of decision tree and five types of Naïve Bayes. In their work, J48 showed the highest accuracy (98.31%) compared to others decision trees. Meanwhile, algorithm WAODE showed the highest accuracy (i.e. 99.77%).
Meanwhile, a research by [10] compared Bayes approach with a number of decision trees based algorithms. Information gain was used to evaluate attributes and found three main attributes affecting the employability. The attributes were job sector, job status and reason for not employ. Data from tracer study for 2009 was used. It contained 12830 instances with 20 background attributes related to19 public and 138 private universities. The results showed J48 has the highest accuracy (i.e. 92.3%) compared to Bayes. They concluded, decision tree algorithm J48 is a suitable algorithm in tracing the data because of its information enquiry strategy.
A research in [11] used the classification approach with Bayesian technique to build a model of graduates' employability and predict graduates employment status. Graduates data were collected from Khon Kaen University, Thailand in 2009 that consists of 3090 examples and 17 attributes. Six algorithms under the Bayesian technique concluded that Averaged One-Dependence Estimators with subsumption resolution (AODEsr) algorithm achieved the highest percentage of accuracy, which is 98.3%. This followed by AODE algorithm (96.1%). This research showed that three factors that affect jobs which are the place of the job, type of jobs and time of jobs.
Another example of employability research used data from 633 students of MARA Profesional College Malaysia [12]. The objective was to classify whether the graduates are working, not working or further study. Five Weka algorithms were used: Naïve Bayes, Logistik Regression, MLP, k-nearest neighbor and J48. The results showed Logistic Regression give the highest accuracy, i.e. 92.5%.
Graduates data from 1400 students of Master of Computer Application (MCA) of colleges in India have been collected and used in [13]. A number of classification techniques used to predict employability of MCA graduates. In their work, they concluded that J48 is the most suitable technique to predict employability with 70.19% accuracy. Beside accuracy, J48 can be easily interpreted, and the time taken to build the model is less (compared to Random Forest). The study identified student empathy, drive and stress management are the main emotional skill parameters that affect employability.
Research by [14] used data mining approach. Two clustering algorithms, X-Means and Support Vector Clustering, and Naïve Bayes as a classification algorithm were used in their study. The study concluded X-Means able to do the prediction better than other algorithms. Table 1 shows the summary of a number of techniques used in predicting graduates employability. This paper focuses to identify the factors that affect graduates employability and to compare the classification techniques.

II. MATERIAL AND METHOD
Data mining is important, as many sets of data can be extract to usable pattern. The most basic form of data for data mining application are database, data warehouse and transaction data. Most people believe that knowledge discovery of data is used widely and the others believe that data mining is one of the crucial steps in the process of discovery of knowledge [15].
Classification approach is one of the most important data mining task especially for predicting. The approach not only handle a large amount of data sets but also find hidden pattern in making conclusion and reduce data generation structure with ease. It is a process that identify objects categories based on their characteristics. For example, we can use a classification model to classify graduates employability whether they are employed, unemployed or uncertain. Decision tree, Random Forest, Naïve Bayes, Support Vector Machine (SVM), Artificial Neural Network (ANN) and many other algorithms can be used in classification modeling [16].
In this study, three approaches are used, Decision Tree, ANN and SVM. Decision tree is a tree like flowchart where the internal node represents tests on attributes, every branch represent the test's results and every leave nodes represent the class labels or classification [17]. Leave nodes show the example of classes. Examples are classified by arranging them from the bottom of the tree from root nodes to some leaves nodes.
ANN is a mathematical model that tries to simulate structures and functions of biological neural network. Building blocks of every artificial neural networks is an artificial neural which is the basic mathematic model (function). This kind of model consists of three sets of rules: multiplication, addition and activation. The entry of each value from artificial neural is multiplied with individual weights. On the middle side of the artificial neural is the total function that includes all the inputs' weight. At the end of the artificial neural is the total input that has been weighed and already went through activation phase that is also called transfer function [18]. SVM was first introduced by Vapnik in 1960s as a classification model and recently have been an intense field of research as there is a development in the techniques and theories that are widely range from regression estimation to the density. SVM develops from statistical learning theory with the aim to solve problems without causing greater problem as a mid step [19].
This research consists of three phases. The first phase includes identify the issues, collect and choose data from SKPG. Second phase is to clean and process the data. In this phase the data will be analyzed, grouped, cleaned and transformed. The last phase, pattern identification, is where pattern's interpretation and evaluation take place by using data mining approach with classification technique such as Decision Tree, SVM and ANN.

A. Data Pre-processing
The first step until the fourth are different phases of preprocessing data that were used to prepare sets of data for mining. Pre-processing is important in the process of finding results as the quality of the results depends on the quality of data. Detect data anomalies and correct them earlier and diminish some sets of data to be analyzed can brings advantage when deciding on a conclusion.
Data collection phase is the first phase in model's development methodology. This research used data from the SKPG, particularly data of University Kebangsaan Malaysia (UKM) graduates as a case study. These data sets include the seven years of data from 2011 to 2017. Table 2 shows the total amount of data that was collected from SKPG's report. The total data instances are 43863. Data integration process is the first step in the planning and preprocessing data. It is a technical combination that is use to combine sets of data from different sources and information. Data integration from 2011 to 2017 has been carried out as the data are from different datasets. These data have been rearranged by years in excel format. These seven years of data have been integrated by using WEKA with Simple CLI in application menu. Append method have been used in this research.
Graduates with other certificates than degree have been removed as this research only focuses on undergraduate students. Data from other level of studies such as Diploma, Ph.D., Master, Advanced Diploma, Medical Degree and other certificates have been removed from this research.
Data cleaning process is to remove or correct data error, incosistency data, missing data, overlapping records and to identify outliers. Missing data can be replaced with the mean for every attributes involved. The average values were taken and calculated based on overall sets of data. Average values were used to reduce disturbance in the sets of data. Outlier that were found in the sets of data also were replaced by the average values. This research uses sets of data that have been processed through statistic method and this resulting in clean, consistent sets of data and no overlapping records.
Data transformation is a process to ensure that all sets of data that were in continuous form are changed into nominal, numbered and divided into specific scales. This process is to make the modeling process easier where existing sets of data can be understood and can be used to study the pattern for building model's forecast.
Data discretion process converts continuous attributes into numbered, nominal and divided by specific scale. The purpose of this process is to simplify the data analysis process. Next, the last step for data preparation is to transform data that involving normalizations of data and construction of attributes. Normalization process is a process that classifies values of data into specific values by using minimum and maximum steps. This process is also to simplify sets of data by using scales 0.0 to 1.0.
In this work, some of the attributes have been transformed into different category, such as cgpa attribute. Originally this attribute is continuous, but in this project, it is transformed into grade range. The range is classed into four parts: 2.00 -2.49, 2.50 -2.99, 3.00 -3.66 dan 3.67 -4.00. Meanwhile, e_umur is also being transformed intro four range: 16-25, 26-35, 36-45 dan >46. e_pendapatan has been classified into three classes: less than RM1501, RM1501 -3000 and more than RM3000. The continuous attributes have been transformed into nominal in preparing the data for classification. For example, the attributes e_bidang and e_40 have been changed to nominal from previous numeric code values.
Feature selection is used to discrete irrelevant attributes in building a model. It helps to choose the best and useful attributes in building a model. By using related attributes, classification algorithms will increase the accuracy of prediction, shorten the duration of research and also form an easier concept. The aim of features selection process is to choose important and useful attributes to increase the percentage of accuracy in building models.
Before features selection takes place, 357 original attributes have been reduced to 26 attributes. Attributes such as e_nama, e_kp, e_bulan_umur, e_hari_umur, e_matrik, e_alamat, e_emel, e_tel_rumah and others unuseful attributes have been removed before the selection of features. Table 3 shows 26 total attributes of graduates employability before feature selection process. In this work, WEKA is used to select attributes by employing Attribute Evaluator. InfoGainAttributeEval has been selected to evaluate the attributes. It evaluates the worth of an attribute by measuring the information gain with respect to the class.

InfoGain(Class,Attribute) = H(Class) -H(Class | Attribute)
Meanwhile Ranker is used to rank the most informative attributes and Attribute Selection Mode used 10-folds crossvalidation. Results from the WEKA feature selection shows that attribute e_umur, e_17, and e_fakulti are the top three attributes with highest average reading value. Table 4 shows attributes that were chosen to be used in the classification modelling after considering the experts opinion, feature selection and past researches.

III. RESULT AND DISCUSSION
In this section, results for the three algorithms are compared. The testing mechanism used is 10-folds cross validation. In WEKA, with cross validation the data samples are divided once, say 10 pieces. Then, 9 pieces are taken for training and the last piece is for testing. Then, with the same division, another 9 pieces are taken for training and the heldout piece for testing. The whole thing is repeated 10 times, using a different segment for testing each time. In other words, the dataset is divided into 10 pieces and then hold-out each of these pieces in turn for testing, train on the rest, do the testing and average the 10 results. This would be "10fold cross validation".
The performances of the algorithms are compared based on the accuracy, ROC, RMSE and the time taken to build the model. Based on Table 5, the average accuracy for J48 is 66.0651%.  Fig. 1 shows the trend of J48 accuracy in 10-folds cross validation. The accuracy shows an increasing trend. The best accuracy is at 40 training and 60 testing fold.  Based on Table 6, the 10-folds cross validation gives average 65.2937% accuracy.    Table 7, the split of 20% training and 80% testing for SMO gives the highest correctly classified, i.e. 65.9674%. In average, the accuracy is 66.0967.  Table 7, the highest accuracy is at 20/80 fold and the lowest is at 80/20 fold.   The performance of the three techniques is shown in Fig.  7. The J48 and SMO have shown a good performance in terms of accuracy percentage. Meanwhile, MLP has not perform well in this study. In addition, RMSE value for J48 is the lowest (i.e. 0.4584) compared to SMO and MLP. Performance matrix ROC for J48 shows the highest (ROC value approaching 1 is better), 0.707 compared to SMO, 0.661. This is shown in Fig.  8.
In addition, J48 was tuned to make it perform better by try and error approach. For example, as in Table 9, the value of confidenceFactor parameter is between 0.1 and 0.50. This value is manually changed and 0.1 found to be the best value. binarySplits parameter with TRUE value means it used a binary division on nominal attributes while building a tree.   Table 10 shows the comparison between before and after the parameter tuning. The results show an increase of 0.1173% in accuracy (i.e before the accuracy is 66.0651%, and then increase to 66.1824%). It can be shown from Fig. 9.  In addition, the two measures, RMSE and ROC have shown insignificant difference in both cases, before and after tuning. This can be shown in Fig. 10.
J48 suits the problem of identifying the factors of getting employed; hence it is worthwhile to consider the rules generated by J48. These rules give an insight of the attributes that affect the employability of the students.
Rules derived from J48: For WORKING class In this work, the generated rules show that the most influential attribute in classifying working or not working is the age attribute. For age = 20 -29, some instances are working and some are not, but for age other than 20 -29 (more than 29), the instances are working. In classifying the working class for age 20-29, the factors being considered are industrial internship, faculty, English skill and involvement in curriculum activity. Mean while, two other factors influencing the not working class, are marital status and field of study.

IV. CONCLUSIONS
In this work, data mining techniques were used to classify factors affecting graduates employability, particularly UKM. Three methods were used, i.e. J48, MLP and SOM. The results showed that J48 performed better compared to other techniques with 66.0651% and it increased to 66.1824% after the parameter tuning. This paper identified several factors affecting UKM graduate employability such as age, faculty, field of study, co-curriculum, marital status, industrial internship and English skill. Among these factors, attribute age, industrial internship and faculty contain the most information and affect the final class, i.e. employability status.