Improved Support Vector Machine Using Multiple SVM-RFE for Cancer Classification

Support Vector Machine (SVM) is a machine learning method and widely used in the area of cancer studies especially in microarray data. A common problem related to the microarray data is that the size of genes is essentially larger than the number of samples. Although SVM is capable of handling a large number of genes, better accuracy of classification can be obtained using a small number of gene subset. This research proposed Multiple Support Vector MachineRecursive Feature Elimination (MSVMRFE) as a gene selection to identify the small number of informative genes. This method is implemented in order to improve the performance of SVM during classification. The effectiveness of the proposed method has been tested on two different datasets of gene expression which are leukemia and lung cancer. In order to see the effectiveness of the proposed method, some methods such as Random Forest and C4.5 Decision Tree are compared in this paper. The result shows that this MSVM-RFE is effective in reducing the number of genes in both datasets thus providing a better accuracy for SVM in cancer classification. Keywords— support vector machine (SVM); multiple support vector machinerecursive feature elimination (MSVM-RFE); leukemia; lung cancer


I. INTRODUCTION
Cancer or called as malignancy is a group of disease involving uncontrolled and abnormal cell growth [1]. Furthermore, cancer is one of the main cause of death in this world. However, not all of uncontrolled and abnormal cell growths are cancerous, it all depends on the number of active and inactive cell in it. There are more than one hundred types of cancer such as skin cancer, breast cancer, leukemia, prostate cancer and other. Presently, in the field of medical, the major research is in the area of cancer analysis where there is a demand in developing a powerful method for the purpose of a cancer diagnosis.
Customarily, physical analyses of tissues are performed for cancer or tumour prognosis and diagnosis utilizing Computed Tomography (CT) scan, Chest X-ray, and Magnetic Resonance Imaging (MRI) [2], [3]. However, they can only identify the malignant cells in the late stage of cancer and would bring about low survival rates [4]. Thus, the studying of cancer to identify the formation of tumour at the earlier stage propels in molecular biologies such as DNA, protein, and RNA was proposed. However, as that review was investigated and examined utilizing low-throughput information, prior knowledge of disease is required for securing the information of candidate markers. Moreover, there is a limitation in finding the novel markers which is caused by the presence of an only small number of markers [5].
Therefore, microarray method is presented. In Gordon et al., [6] has investigated the cancer diagnosis method based on gene expression and showed the best performance in the accuracy of classification. The usefulness of the microarray data has motivated many researchers to perform large-scale area of this study. The era of microarray technologies for measuring genome-wide expression profiles has prompted the development and improvement of various methods and techniques to distinguish between different classes of a complex disease like cancer through transcriptome analysis [7]- [10]. Besides, many of classification methods such as Support Vector Machine (SVM) [11], Directed Random Walk (DRW) [12] and Artificial Neural Network (ANN) [13] have been developed to help this era. Fig. 1 shows the visualization of the process in the microarray analysis. However, the preparation of microarray data is a critical step in biological function analyses, especially in cancer classification. The microarray data will produce a huge dataset with high dimensionality that contains informative genes, redundant genes, irrelevant genes, and noisy genes. Gene selection is a method to alleviate the problems of the irrelevant or redundant gene.
The pure Support Vector Machines (SVM) is also known as support vector network. It is one of the machine learning models that introduced by Vapnik [14]. Furthermore, it is a machine learning model with a related algorithm that analyzes the data and identifies the pattern of the data. It is used for the classification and regression analysis [11]. An SVM model is a representation data as a point in the space. It will construct the hyperplane on the map to classify the data and predict the group based on which side of the gap it's fall on such as in Fig. 2. Using the kernel trick, it can handle and effectively perform non-linear classification. It has four types of kernel, i.e., linear, polynomial, radial basis function (RBF) and sigmoid.
Besides, it is robust to a high variable-to-test degree and a huge number of variables, able to learn productively complex classification function, and manage to utilize intense regularization standards to abstain from overfitting [14]- [17]. It is broadly utilized as a part of the area of cancer studies and commonly in microarray data [24]- [26], [18]- [20].
Regrettably, the size of features or genes in microarray data is essentially bigger than the number of samples. In any case, the scantiness of a microarrays gene expression is so compelling that even an SVM classifier is unable to accomplish a palatable performance. Thus, the preprocessing step of gene selection or feature selection before undergoing the classification is vital for more trustworthy cancer classification [27], [21].

A. Multiple Support Vector Machine Recursive Feature Elimination (MSVM-RFE)
Currently, most of the microarray data for the cancer classification generated bewildering amounts of raw data, and the number of genes is larger than the number of samples. To secure against spurious results, gene selection is a better solution to solve the vital machine learning problem. Identifying a small number of informative genes is the objective of gene selection. Many evolutions had been made in the Support Vector Machine-Recursive Elimination (SVM-RFE), from the basic of SVM-RFE to two-stage SVM-RFE and multiple SVM-RFE.
Reducing the dimensionality of the dataset will yield a good analysis [24], [18]. Multiple SVM-RFE (MSVM-RFE) [28], [22] is an upgraded version of the original SVM-RFE. MSVM stand for multiple SVMs that use a backward elimination procedure to eliminate the lowest weight of the gene, similar to the SVM-RFE. However, at each step, the computation of feature ranking score is based on the statistical analysis of weight vector of multiple linear SVMs that being trained on a subset of the training data. This approach makes the result of MSVM-RFE to be better and more accurate compared to the SVM-RFE.
Furthermore, repeating the selection procedure on a few subsamples from bootstrap resampling on the training data is one of the ways to stabilize the gene selection method. This idea is applied to every step of recursive MSVM-RFE, rather than apply this idea on SVM-RFE all in all. Moreover, MSVM-RFE also used cross-validation instead of bootstrap sampling as the resampling method explores the higher possibility of choosing and determining a better subset of the gene in the recursive procedure. Hence, MSVM-RFE is a meaningful and powerful approach in gene selection to select the informative genes for cancer classification. Based on these reasons, the MSVM-RFE has been chosen for the purpose of gene selection in this research to enhance SVM.
In order to train on different subsamples of original training data, we have linear SVMs. The is a weight vector of the th linear SVMs, is a corresponding weight value associated with the th feature and let = . The score of feature ranking is computed with the following formula: where is mean and is a standard deviation for the , However it is important to normalize the weight vectors before computed the ranking score for each gene.
The procedure of MSVM-RFE start with ranking the gene set, R = [ ]. From a selected gene subset of S = [1,…,d], the following step is repeated until all the features or genes are ranked. Firstly, the multiple linear SVMs are trained on subsamples of the original training data, with genes in set S as the input variable. Secondly, compute and normalize the weight vectors. By using the first equation, compute the ranking scores c, for genes in S. Next, find the gene with the smallest ranking score and eliminate that gene from the subset S. Lastly, update the list in gene set R. Fig. 3 shows the recursive procedure of MSVM-RFE.

B. Improvement of SVM Using MSVM-RFE as Gene Selection
The improvement of accuracy in SVM has been accomplished through the enhancement made in this work using MSVM-RFE as gene selection. The classifier package that implements SVM has been used in this research. This package is a compilation of function for the application and creation of highly optimized, robustly evaluated ensembles of SVM. It creates a highly optimized ensemble of Radial Basis Function (RBF) SVM classifiers. The flow of MSVM-RFE is shown in Fig. 4 while Fig. 5 shows the enhancement of SVM using MSVM-RFE as gene selection.
Firstly, preprocessing step needs to be performed to sort the dataset following the standard input dataset that should be formatted as a data frame. They consist of the row for each observation and column for each gene or feature. In MSVM-RFE, the first column should be the true class label whereas, for the SVM classifier, the data should be separated between the class and factor. Afterwards, the data will be transposed and sorted according to input setting. Then the MSVM-RFE is performed, where the first step is to set up the fold. This step has been repeated using 10-fold cross-validation on the training set for defining which remark are in which fold. Then, reformat the folds into a list that contains the test set indices for the fold. Then, by using this fold, perform gene ranking for all 10 training set. Indicate k=10 for the k-fold cross validation as the "multiple" part of MSVM-RFE, where the standard SVM-RFE is k=1. In this function, there are also has halved above parameter that allows us to cut the features or genes. This parameter will cut the features or genes in half each round, rather than one by one. In this work, this parameter has set halve.above=100 same as prior work. Hence, the genes will be cut in half each round until the number of genes fewer than 100 remains. The output from this step is a vector of genes or features, currently sorted from most to least useful. Lastly, the output of top feature for this step is the list of genes that are ordered by average rank across the 10 folds, where the lower the numbers in average rank, the better the result.
In the SVM step, the dataset needs to be split into separate training and testing subsets using a bootstrapping approach combined with a heuristic optimization algorithm and parallel processing to minimize and reduce the computation time. Test set has been kept aside during the training process of the SVM model. Generally, the test set comprises of onethird of the original samples. Then, bootstrapping is repeated until reasonable winning parameter combination is produced. The optimal parameters from the bootstrap step are utilized to train another classifier with the full train dataset and test the test dataset. Because the data in the testing set already contains known values for the attribute for predict, it is easy to determine whether the model's guesses are correct and obtain the better accuracy.
In this research, firstly the classifier that implements SVM classification has been run without gene selection of MSVM-RFE. The random genes in a dataset of varying numbers such as 10, 20, 30 and 40 until 100 genes have been tested. Then, SVM has been run with MSVM-RFE. By using the output of top features from MSVM-RFE result with varying number of genes such as 10, 20, 30 and 40 until 100 have been undergoing the classifier package to perform SVM. The best subset of the gene is repeated the classification process for 20 times to obtain average of the accuracy.
In this work, two types of datasets have been used, which are leukemia and lung datasets. The basic information of the dataset including the number of total genes, samples, and sizes of the class is shown in Table 1. The size of the class for leukemia dataset consists of 47 patients with acute lymphoblastic leukemia (ALL) and 25 patients with acute myeloid leukemia (AML). Meanwhile, the class size of lung dataset consists of 150 patients with adenocarcinoma (ADCA) and 31 patients with malignant pleural mesothelioma (MPM). Lung 12533 181 [6], [29], [23] III. RESULT AND DISCUSSION To evaluate the performance, the accuracy of the result is calculated according to [12].
The results of leukemia dataset are compared with several methods which are standard SVM, enhanced SVM with MSVM-RFE as gene selection, random forest by Moorthy and Mohamad [30], [24], Random forest with MSVM-RFE [31], [25] and varSeIFE [32], [26]. The result based on accuracy and computational time. The result is tabulated in Table 2. The shaded row in the table indicates the best method based on highest accuracy and shortest time.
The overall comparison of accuracy and computational time is presented to demonstrate the enhancement accomplished. Based on the result, the enhanced SVM with combine MSVM-RFE show an improvement in terms of better accuracy and lower computational time which can lead to lower computational cost.
The random forest has many advantages such as good predictive performance even though most predictive genes are noisy and can handle large input genes without gene deletion [32], [26]. However, the result of accuracy for enhanced SVM is higher than standard Random Forest and Random Forest with MSVM-RFE. The enhanced SVM achieve the highest result of accuracy with 0.986 and the shortest time with only 0.013 hours. The result of enhanced SVM followed by Random Forest with MSVM-RFE with 0.949 for accuracy and 0.060 hour, standard Random Forest with 0.925 for accuracy and 0.013 hour, varSeIFR with 0.911 for accuracy and 1.280 hour and lastly standard SVM with 0.881 for accuracy and 1.830 hour. Thus, based on this result it proved that MSVM-RFE is a power gene selection by improving classification method with a better result. Meanwhile, the result of lung dataset had been compared with some methods which are standard SVM, enhanced SVM with MSVM-RFE as gene selection and C4.5 Decision Tree [33], [27] methods. The result based on accuracy is tabulated in Table 3. The shaded row in the table indicates the best method based on highest accuracy. All the findings show that the enhanced SVM outperforms the SVM without gene selection and C4.5 Decision Tree in terms of higher classification accuracy with 0.989. The result of enhancing SVM follow by C4.5 Decision Tree with 0.926 of accuracy and standard SVM with 0.920. Thus, it is proven that better accuracy can be gained by reducing the number of genes as the result of implementing gene selection.

IV. CONCLUSIONS
In this study, MSVM-RFE is implemented in the standard SVM as gene selection to handle a large number of genes in microarray data for identifying the small informative gene. Multiple SVM-RFE (MSVM-RFE) is an upgraded version of the original SVM-RFE. This method using a backward elimination procedure that eliminates the lowest weight of the gene, same like SVM-RFE. However, at each step, the computation of feature ranking score is based on the statistical analysis of weight vector of multiple linear SVMs that being trained on a subset of the training data. The implementation of MSVM-RFE yields better and more accurate result compared to the SVM-RFE. MSVM-RFE is implemented to enhance the performance of SVM in terms of accuracy and computational time. The performance of the enhanced method has been compared with several methods such as Random Forest, Random Forest with MSVM-RFE, varSeIFE and C4.5 Decision Tree by using two different datasets of gene expression which are leukemia and lung cancer. All the findings show that the enhance SVM outperform the SVM without gene selection in terms of higher classification accuracy in both datasets and lower computational time in leukemia dataset. However, this research still has some limitations such as larger dataset are taking longer computational time, and all the datasets used require preprocessing before undergo the gene selection and classification processes. Therefore, there are many works that can be done in future to improve the results of the used method. Firstly, the result of this research can be compared with more performance measurement such as error rate, specificity, and sensitivity. Secondly, to implement, test and analyze the strength of MSVM-RFE with other classifier and compare the result with this research. Lastly, to add more type of dataset other than leukemia and lung data.