Intrusion Detection System using Multivariate Control Chart Hotelling's T 2 based on PCA

— Statistical Process Control (SPC) has been widely used in industry and services. The SPC can be applied not only to monitor manufacture processes but also can be applied to the Intrusion Detection System (IDS). In network monitoring and intrusion detection, SPC can be a powerful tool to ensure system security and stability in a network. Theoretically, Hotelling’s T 2 chart can be used in intrusion detection. However, there are two reasons why the chart is not suitable to be used. First, the intrusion detection data involves large volumes of high-dimensional process data. Second, intrusion detection requires a fast computational process so an intrusion can be detected as soon as possible. To overcome the problems caused by large number of quality characteristics, Principal Component Analysis (PCA) can be used. The PCA can reduce not only the dimension leading a faster computational, but also can eliminate the multicollinearity (among characteristic variables) problem. This paper is focused on the usage of multivariate control chart T 2 based on PCA for IDS. KDD99 dataset is used to evaluate the performance of the proposed method. Furthermore, the performance of T 2 based PCA will be compared with conventional T 2 control chart. The empirical results of this research show that the multivariate control chart using Hotelling’s T 2 based on PCA has excellent performance to detect anomaly in network. Compared to conventional T 2 control chart, the T 2 based on PCA has similar performance with 97 percent hit rate. It also requires shorter computation time.


I. INTRODUCTION
Statistical Process Control (SPC) has been widely used in many fields, particularly in industry and services.SPC not only can be applied to monitor the manufacturing or industrial processes but also can be utilized for Intrusion Detection System (IDS).In network monitoring and intrusion detection, SPC can be used as a powerful tool to guarantee safety and stability in a network system [1].There are many studies on SPC that has been implemented in IDS [2].SPC has an advantage because it does not require knowledge of an unprecedented attack.In addition, using SPC in IDS can also guarantee the real-time attack detection [3].Moreover, the SPC can be used to monitor intrusion both in univariate and multivariate case.
The univariate control chart is a control chart only monitoring one characteristic such as X chart [4], Exponentially Weighted Moving Average (EWMA) control chart [5] and Cumulative Sum (CUSUM) control chart [6].To monitor the stability on the univariate attribute process, some control charts such as the p chart [7,8], and np control chart [9] have been developed.Furthermore, the multivariate control chart is a control chart used to control production process with more than one correlated or uncorrelated characteristics.The latest investigation of the multivariate control chart includes of [10][11][12][13][14][15][16][17][18].
As the implementation of multivariate control chart in detecting the anomalies in the network, Ye et al. [19] employed the Markov Chain, T 2 , and Chi-Square multivariate test strategies for network anomaly detection.Ye et al. in [20] proposed a technique based on the Hotelling's T 2 test that can detect both counter-relations and mean-shift anomalies.Qu et al. in [21] used the Hotelling's T 2 chart to monitor the intrusion of a network.Furthermore, the system so-called real-time Multivariate Analysis for Network Attack (MANA) detection algorithm is used in Hariri and Yousif [21].The MANA control limits will be updated continuously at certain intervals of time.Chi-Square Distance Monitoring (CSDM) method is developed by Ye et al. [22] and it is applied to monitor the uncorrelated, correlated, autocorrelated, normal, and non-normal distributed data.In general, CSDM performs better than Hotelling's T 2 to detect a shift in the mean, especially in uncorrelated, autocorrelated, and non-normally distributed data.Meanwhile, Hotelling's T 2 has better performance than CSDM for correlated and normally distributed data [22].Sivasamy and Sundan in [23] compared the performance of Hotelling's T 2 control charts with Support Vector Machine (SVM) and Triangle Area-based Nearest Neighbors (TANN) methods and found high accuracy Hotelling's T 2 for all types of attack classes.In addition, Ahsan et al. proposed the Hotelling's T 2 control charts based on Successive Difference Covariance Matrix (SDCM) with bootstrap control limit to monitor the anomalies in the network [24].
In the theory, the network intrusion detection can be monitored by using Hotelling's T 2 chart technique.Nevertheless, there are two arguments why this method is not suitable to be employed for this case ( [19], [25]).Firstly, the intrusion detection system involves large volumes of high-dimensional connection.Secondly, the network monitoring system requires a fast computational process so that an anomaly can be quickly detected.In fact, the effectiveness of conventional multivariate control charts such as Hotelling's T 2 is increased for a small number of quality characteristics.If large number of quality characteristics used then the performance of control chart to detect any shift in a process may be decreased [26].Large numbers of highly correlated quality characteristics often take place in modern manufacturing processes.As a result, the computation of the T 2 statistic is difficult due to the singularity of the covariance matrix ( [27], [28]).
To overcome the problems arise in monitoring large number of quality characteristic, the Principal Component Analysis (PCA) can be used as an alternative solution.The PCA procedure can reduce the feature so that the faster computational process can be achieved.This method also can eliminate the multicollinearity problem on the process.PCA is a multivariate method that extracts a new set of variables by projecting the input variables onto principal component space.The extracted variables which are called as principal components (PCs) are linear combinations of the original variables in which the coefficients of the linear combination can be obtained from the eigenvectors of the covariance or correlation of the input data [29].
PCA is widely used to monitor anomalies in the network.Wang et al. [30] developed PCA for intrusion detection with fast calculation and high efficiency.PCA can also be used for feature reduction [31] and feature selection [32].In addition, PCA can be combined with machine learning methods such as SVM [33], genetic algorithm [34] and naïve bayes [35].Chen et al. [36] using the Multi-Scale Principal Component Analysis (MSPCA) to identify the Denial of Service (DoS) attacks.
Based on the aforementioned above, the integration between PCA and T 2 chart is a good alternative to solve the problems caused by a large number of quality characteristic and ineffective computational time.PCA technique for Hotelling's T 2 charts construction presented using the first k principal components (PCs) [37].This paper will focus to create IDS using multivariate T 2 control chart based on PCA.KDD99 DARPA dataset would be used to evaluate the performance of proposed IDS.Moreover, the performance of the proposed method is compared with existing T 2 chart.The rest of this paper is arranged as follows.In Section 2, a brief review about T 2 based PCA, KDD99 Cup DARPA and IDS method are presented.Section 3 contains result and discussion about the performance of PCA control chart in intrusion detection.Finally, Section 4 is devoted to the conclusion.

A. Hotelling's T 2 control chart
Hotelling's T 2 control chart [38] is one of the multivariate control charts that could be used to monitor the mean of production process [26] and to detect multivariate outliers [39] , , , .
The T 2 statistics [40] can be calculated according to the following equation: where: With the assumption that the data are multivariate normally distributed, the T 2 chart control limit is formulated as follows: , where n denotes the number of observation, p denotes the number of variables and α denotes the false alarm rate.The process is said to be out of control when the statistics are located on the outside of the control limits [26].

B. T 2 control chart based on PCA
The PCA is the most widely and commonly used procedure for high-dimensional, noisy, and highly correlated process data.This happens due to its ability to handle such input data by projecting it onto a lower-dimensional subspace that contains most of the variance of the input data [41].The new observations are a linear combination of the original observations [42].Data standardization is often suitable when the variables are in different measurement units or when the variance of the different columns of the data is substantial.The standardized data can be calculated as follows: where is the diagonal matrix with standard deviation of each variable as diagonal.It is worth pointing out that the covariance matrix R of the standardized data Z is exactly the correlation matrix of the original data, and it can be computed as The PCA is then performed by employing the eigendecomposition of the matrix R as follows: The T 2 based PCA control chart uses first k PCs to create control chart.The statistics of T 2 based on PCA control chart can be computed by using the following formula: ) where the first k PCs are , 1, ..., , and l λ is the eigenvalue corresponding to the l-th PC.Under the assumption the data follow the multivariate distribution, the control limit of can be obtained as follows: where n is the number of observation, k is the number of PCs retained and α is false alarm rate.

C. Intrusion Detection System using Control Chart
In general, intrusion detection process using the control chart [2] is presented in Figure 1.Determining the objective of the system is the first step for this procedure.The main purpose of an IDS is to correctly and quickly detect the intrusion on the network with a low rate of false alarms.The second step is data preparation which is one of the difficult parts in the IDS process and consuming much time.There are two steps in data preparation such as data sourcing and data acquisition.Data sourcing refers to identify the sources and select the target of the data.Data Acquisition refers to transform the target data into the input data that can be used in the control chart method.
The next step is the construction of a control chart.Construction of control chart is divided into two steps such as data pre-processing and create a control chart.In this step, the control limits previously estimated are then applied to monitor network traffic.Finally, the identification and corrective actions are executed.

D. KDD99 Dataset
The Knowledge Discovery and Data Mining 99 (KDD99) dataset [43] is the most widely used and accepted benchmark dataset for network IDS.The KDD99 is a feature extraction of Defense Advanced Research Project Agency (DARPA) dataset.This dataset has included some types of attacks that occur in the network so that many researchers use it to test the merits of the new method proposed.
The KDD99 dataset has the following characteristics [44] 3. Determine α and k that will be used in the analysis.Moreover, the performance of IDS would be evaluated by the confusion matrix as shown in Table 3.The accuracy of a classification method could be measured by the degree of accuracy and degree of error. .

III. RESULT AND DISCUSSION
In this section, the results and discussion from the evaluation performance of proposed IDS using T 2 based on PCA are presented.The performance of the proposed T 2 based on PCA would be compared with conventional T 2 control chart.

A. Result
The intrusion detection process using T 2 based PCA is analyzed by determining α and number of PCs (denoted as k) that used in the analysis.Using 0.00273 α = which refers to three sigma, optimal k would be determined by its hit rate, false positive rate and false negative.4 shows the results of intrusion detection using a different number of PCs.For k = 5, the hit rate is only 0.195 with the FN rate of 1.For k = 6, the hit rate starts rising with the value of 0.894.However, this result is not suitable for IDS.For the number of PCs equal to 7 until 25, the hit rate for the IDS is about 0.978.The FP rate will increase as increasing of the number principal component used.On the other hand, the FN rate will decrease as the principal component used increasing.Considering the similarity value of the hit for seven or more principal components, this research used seven principal components to speed up the detection process.In addition, the FP rate and FN rate also look more balanced when using seven principal components.Figure 2 shows the α selection for the IDS.The horizontal axis represents the value of α while the FP and FN rate value are represented by the vertical axis.It can be seen from the figure that the greater value of α would produce high value the FP rate.On the contrary, the FN rate will be smaller along with the increasing value .
α Therefore, in this study small value of 0.001 was used in order to produce an optimal value of FP rate and the optimal FN rate in IDS. both charts produce a similar value of hit rate.Even though the hit rate of T 2 still higher with 0.9799 than T 2 based on PCA chart with 0.9779, there is not much difference between the accuracy of attack prediction.The value of FP rate for T 2 chart is higher value than T 2 based on PCA chart.For FN rate, T 2 and T 2 based on PCA chart produce almost the same value, although T 2 based on PCA chart has a higher value.
Figure shows computational time comparison of T 2 chart and the T 2 based PCA chart.The T 2 based on PCA diagram requires only 2.9152 seconds to complete the analysis process of 489.843 connections.In contrast, the T 2 chart requires 3.2462 seconds to complete the analytical process for the same number of connections.Thus, it can be concluded that T 2 based on PCA chart have more effective computation time than T 2 chart., IDS produce high hit rate with faster computation time than the conventional method.Furthermore, the performance of the proposed IDS system will be evaluated for new connection using the testing dataset of KDD99.Comparison of T 2 chart performance with T 2 based on PCA chart for the testing dataset can be seen in Figure 5. Analog with training dataset, T 2 chart and T 2 based on PCA chart have similar hit rate and FP rate.While the value of FN rate for T 2 based on PCA chart is higher than T 2 chart.In addition, it can be seen that the value of hit rate of testing data has been decreased compared with training dataset.The hit rate decrease from 0.9799 to 0.9216 for the T 2 chart diagram and from 0.9779 to 0.9125 for the T 2 based on PCA chart.In addition, the FN rate for the testing dataset is higher than the FN rate for training dataset.
Comparison of computational time of T 2 based PCA chart with T 2 chart for the testing dataset is shown in Figure 6.It can be seen for testing dataset T 2 based PCA chart produce faster computation time than T 2 chart.The T 2 based PCA chart requires 1.9175 seconds of analysis time while the T 2 chart requires 2.0928 seconds to complete the analysis with the same number of connections.

B. Discussion
Table 5 summarizes the result of intrusion detection from training and testing dataset.Based on the results of IDS evaluation in both training and testing dataset, it is known that T 2 based on PCA control chart has excellent performance in training dataset.Nevertheless, the performance of the proposed IDS decreases when using to evaluate testing dataset.Based on the fact from the results, there are three possibilities that can cause performance degradation.First, the normal profile used to evaluate testing dataset is a normal connection from the training dataset.Decreasing performance in the testing dataset can be the result of the inability of the normal profile from training dataset to capture any pattern changes in testing dataset.The normal profile of the training dataset needs to be updated with new connection with normal status from the testing dataset.On of method can be employed to overcome this problem is an incremental learning algorithm which has the ability to update existing patterns based on new data [46].Second, the control limits in this study were built with normal multivariate assumptions.However, in reality, the distribution of computer network data does not always follow multivariate normal distribution.This is caused by the attacks that occur on a network which produce extreme values [47].The control limit of Hotelling's T 2 is calculated from F distribution by assuming monitored process data follow the multivariate normal distribution [28].However, when the assumption does not hold, a control limit based on the F distribution that used in this study may be inaccurate because a control limit determined this way can increase the rate of false alarms [48].This fact can be seen from the high value of FN rate in testing dataset.The inability of the control limits to capture the intrusion leads to decreasing performance of the proposed IDS.This will be very dangerous because high value of FN rate on IDS can be a fatal problem because it allows attacks without warning.To overcome this condition, the Kernel Density Estimation method can be adopted in order to increase the level of precision of the IDS as demonstrated in [49].
Finally, PCA is built with the assumption of a linear relationship between variables.However, in reality, the relationship that occurs in a network data is not always linear.PCA performs poorly due to its assumption that the process data are linear.This is can be seen in some complicated cases in manufacturing and chemical processes which have a nonlinear relationship [50].
Therefore, it needs to improve the proposed IDS by paying attention to normal profile data for training dataset, control limit for non-normal distribution and nonlinear process data.Thus, IDS which has a high hit rate with small false alarm and computing time can be constructed.

IV. CONCLUSIONS
This paper proposes integration between PCA and Hotelling's T 2 chart to create IDS for network anomaly detection.In summary, based on performance evaluation of IDS, multivariate control chart Hotelling's T 2 based on PCA has excellent performance to detect an anomaly in the network.Compared to conventional Hotelling's T 2 chart, T 2 based on PCA has similar performance with 97 percent hit rate with small computational time.However, in testing dataset, the performance of both T 2 chart and T 2 based on PCA decrease.Nevertheless, the decline is not significant because the IDS can still detect about 91 percent of the intrusions that occur in the network.
The future research will be conducted to improve the drawbacks of proposed IDS by utilizing combined incremental learning algorithm and KDE.The incremental learning can overcome the inability of the normal profile from training dataset to capture any pattern changes in testing dataset.Meanwhile, the KDE method is adopted to adaptively re-calculate the control limit of the proposed IDS.The present work can also be extended by monitoring the multiclass attack on the dataset.

Fig. 1
Fig. 1 Intrusion Detection System using Control Chart Method

Fig. 2 False
Fig. 2 False Positive and False Negative Rate with different α for k=7

Fig. 3
Fig. 3 Performance Comparison of T 2 and T 2 Based PCA Using 7 principal components for the training dataset Performance comparison of T 2 chart and T 2 based on PCA chart for training dataset was shown in Figure 3. Using k=7 and 0.001, α =

Fig. 4
Fig. 4 Time Comparison of T 2 and T 2 Based PCA Using 7 principal component for training data Based on performance evaluation result from T 2 based on PCA on training dataset, it can be seen that by using k = 7 and 0.001 α =

Fig. 5
Fig. 5 Performance Comparison of T 2 and T 2 Based PCA Using 7 principal components for testing dataset

Fig. 6
Fig. 6 Time Comparison of T 2 and T 2 Based PCA Using 7 principal component for testing dataset ...., ) 1 ( , ..., ) This study only use 32 out of 34 quantitative variables because two other quantitative variables have the same values (entirely zero).Moreover, testing dataset as shown in Table2would be used to evaluate the performance of IDS.The steps employed to detect intrusions with the T 2 based on PCA chart is described as follows.The first step in this IDS is formed a normal profile or in control process from the normal connection.Then, each new connection would be compared with a normal profile.The new connection that significantly different from normal connection would be suspected as an intrusion.The algorithm for IDS with T 2 based PCA chart divided into two phase as follows: FP causes a false alarm while FN allows an attack on the system.The level of accuracy used is the hit rate that can be calculated as follows: