Machine Learning Based Model to Identify Firewall Decisions to Improve Cyber-Defense

— A firewall system is a security system to ensure traffic control for incoming and outgoing packets passing through communication networks by applying specific decisions to improve cyber-defense and decide against malicious packets. The filtration process matches the traffic packets against predefined rules to preclude cyber threats from getting into the network. Accordingly, the firewall system proceeds with either to “allow,” “deny,” or “drop/reset” the incoming packet. This paper proposes an intelligent classification model that can be employed in the firewall systems to produce proper action for every communicated packet by analyzing packet attributes using two machine learning methods, namely, shallow neural network (SNN), and optimizable decision tree (ODT). Specifically, the proposed models have used to train and classify the Internet Firewall-2019 dataset into three classes: “allow, “deny,” and “drop/reset.” The experimental results exhibited our classification model's superiority, scoring an overall accuracy of 99.8%, and 98.5% for ODT, and SNN respectively. Besides, the suggested system was evaluated using many evaluation metrics, including confusion matrix parameters (TP, TN, FP, FN), true positive rate (TPR), false-negative rate (FNR), positive predictive value (PPV), false discovery rate (FDR), and the receiver operating characteristic (ROC) curves for the developed three-class classifier. Ultimately, the proposed system outpaced many existing up-to-date firewall classification systems in the same area of study.


I. INTRODUCTION
Data communication over the Internet is vulnerable to a wide range of potential cyber-attacks and intrusions. Once the network infrastructure is breached, hackers could distribute the data to unauthorized parties and manipulate the network data's accuracy and consistency over its entire life cycle. Consequently, various safety methods have been used in the different stages of defense to address security concerns, such as Internet Firewalls [1], Intrusion Detection/Prevention Systems (IDS/IPS) [2], and others.
Firewall devices/servers are crucial security systems to defend communication networks from external cyber-attacks [3]. They are usually installed at the networks' edges to monitor the traffic flow and protect the communication by filtering out all incoming (and sometimes the outgoing) data packets. The filtration process is typically performed by matching the network packets against predefined instructions and rules to preclude cyber threats from getting into the network. Hence, the firewall system proceeds with either to "allow," "deny," or "drop/reset" the incoming packet. Once the firewall decides what to do with the received traffic, it records all these decisions in the form of log-files. Further analysis for the firewall log-files will help improve the cyber-defense against incoming threats by integrating machine learning methods to provide automated and early classification and prediction for the traffic bypassing the firewall system. Indeed, the recent coupling of the cybersecurity field and machine learning methods produced robust and efficient security solutions [4] for diverse applications and systems.
Classification task typically employs machine learning techniques to learn several class labels of examples from the problem space [5]. For example, classifying the emails as "inbox" or "junk." Several machine learning techniques can be employed to perform the classification/prediction tasks for the data records in the problem space [6]. Examples of classification-based machine learning techniques used to provide cyber-security solutions: Artificial Neural Networks (ANN) [7], Shallow Neural Network (SNN) [8], Convolutional Neural Network (CNN) [9], K-Nearest Neighbors (KNN) [10], Decision Tree Method (DTM) [11], Majority Voting Method (MVM) [12], Support Vector Machine (SVM) [13], and others. The use of a proper algorithm is heavily based on several factors related to the dataset nature and complexity, such as the size of a dataset, the number of features, the types of features, the structure of data, the data labeling/clustering, the data distribution [14], and IFW-2019 dataset [15].
Firewalls are essential devices to protect the communication networks by a mean of filtering out all incoming (and sometimes outgoing) traffic packets. The filtration process is performed by matching the traffic packets against predefined rules aiming to preclude cyber-threats from getting into the network. Due to the rapid increasing of security incidents and breaches [16], several recent research projects have been conducted to address the diverse issues related to firewall systems. For instance, D. Appelt et. al. [17] presented a machine learning and evolutionary algorithm-based approach to spontaneously identify the holes in communication networks via Web application firewalls (WAFs) due to SQL injection attacks. They implemented their model using open-source WAF tool called ModSecurity. As a result, their simulation findings show their model's effectiveness against SQL injection attacks bypassing WAFs and identifying attack patterns. Similarly, D. Ucar et al. [18] proposed a machine learning-based model for the detection of anomalies in the firewall rule repository. To do so, they have employed a dataset of firewall logs using several classification algorithms, including Naive Bayes, kNN, Decision Table and HyperPipes. As a results of their analysis, their model works with its best performance when configured with kNN recording an F-Measure of 93%. They concluded that anomalies in firewall rules can be detected by automatically analyzing large-scale log files with machine learning methods.
Another noticeably related research is reported by Vartouni [19], who proposed a deep learning model for anomaly-based web application firewall. Their proposed model is constructed mainly of stacked auto-encoder (SAN), deep belief network (DBN), and one-class SVM, isolation forest, and elliptic envelope are applied as classifiers. Their experimental results showed that model demonstrated has a better performance in terms of accuracy and generalization in a reasonable time. In related research that employs a deep learning model, Ertam [20] proposed a new firewall data classification approach that uses 10 cases to obtain numerical results. The proposed approach consists of data acquisition from Firewall, feature selection and classification steps. The author evaluated their model using several classifiers including Long Short-Term Memory (LSTM), Bi-directional Long Short-Term Memory (Bi-LSTM) and Support Vector Machine (SVM). As a result, they inferred that the deep learning approach based Bi-LSTM-LSTM hybrid network is more successful than the SVM classifier scoring the highest classification accuracy of 97.38%. In conclusion, they noticed that an intelligent monitoring system is a very efficient approach for network security solutions.
Moreover, Reinforcement Learning (RL) techniques were also employed in this area such as the work conducted by J. Jeya Praise et. al. [21]. In this paper, the authors have developed a reinforcement learning and pattern matching (RLPM) based firewall for secured cloud infrastructure to block the malicious attacks by validating the payload signature of arriving packets. Their hybrid model provided a two-way pattern matching algorithm that validates the signature towards attaining the quick decisions. The simulation results showed that their proposed RLPM model has improved firewall response time, throughput, and malicious attack blocking by 10% less than the existing state-of-the-art methods. Furthermore, several other promising state-of-the-art research has been conducted for cybersecurity using deep neural networks [22]- [31].
Unlike aforementioned research, in this paper, we propose an intelligent self-reliant classification model that can be employed in the firewall systems to produce proper decision on every incoming traffic packet passes through the communication network Firewall system by analyzing packet attributes (i.e., through firewall logs) using SNN, and ODT techniques. Specifically, the proposed model has been trained to classify the Internet Firewall-2019 (IFW-2019) dataset into three classes, including: "allow, "deny," and "drop/reset." The proposed model is considered as competent contribution to this area due to the well-defined and designed model scoring a very high classification accuracy of 99.8% with low prediction overhead. In this paper, we employ two different supervised machine learning techniques to train and classify the communication traffic records provided by Internet Firewall (IFW-2019) dataset after a series of preprocessing operations. The employed learning techniques include shallow neural network (SNN), and optimizable decision tree (ODT). IFW-2019 dataset [15] comprises 65532 records from firewall logs files divided into four different categorical categories, namely, "allow," "deny," "drop," and "reset-both. Based on the collected records, our ML models were trained aiming to minimize the loss function and we show that the overall accuracy of the model is superior.
In particular, the core contributions of the proposed work can be listed as follows:  We provide a firewall prediction system that employs SNN, and ODT architectures for classifying multi-class firewall log action records in communication networks.  We evaluate our IDS' performance on recent and important datasets for Internet firewall log files (IFW-2019 [15]), scoring a 99.8% and 98.5% classification accuracy for ODT and SNN respectively.  We provide a detailed description of our implementation in conjunction with an extensive comparison with stateof-the-art solutions.

II. MATERIAL AND METHOD
In this work, we concern to provide a comprehensive machine learning based framework to ensure an automated and intelligent decision-making process for the firewall system to improve the communication network defense and security. Fig.1 demonstrates the flowchart diagram for system development method displaying the systematic stages for the proposed system starting from the initial stage of research, data gathering toward the last outcome stage, the classification stage. According to the figure, the system development comprises four modules: data gathering, data preparation, data learning, and data classification. The modules are to be discussed in the following subsections after Fig.1.

A. Data Gathering Module
Data is the vital element of any intelligent system since it allows systems and stakeholders to build the required decisions based on definite facts and records. Data (categorical, numerical, images...etc.) are usually collected into organized records in a systematic dataset [2]. Dataset can be analyzed and used to address research investigations, formulate problem statements and validate theories and outcomes. In this work, our system concerns providing automated and intelligent classification for the security actions performed by firewall devices on the network traffic, and thus, we have used a dataset [15].
IFW 2019 [15] is a recently composed dataset from the internet traffic records on a university's firewall devices (i.e., Firat University, Turkey) and used for the automated prediction purposes of firewall actions in response to the network traffic data. IFW 2019 accumulates 65532 firewall log files records with four different categorical labels listed for the output filed of each sample record, including: "allow", "deny", "drop", and "reset-both". The data distribution among the different classes of the files are presented in Table 1. IFW 2019 records are developed with 11-features and one class-label using firewall log files. Features are carefully selected as numerical datatype to be efficiently applied to the machine learning techniques. The selected features include source port, destination port, network address translation (NAT) source port, NAT destination port, elapsed time for flow (in seconds), total bytes, bytes sent, bytes received, total packets, packets sent, and packets received [32].
Indeed, the IFW-2019 dataset is nominated for evaluation in this research since it is publicly available as .CSV filetype and comprises a sensible number of distinctive samples that prevent the classifier from being influenced by a more recurrent class. Also, this dataset covers all common security actions for firewall devices/servers on the network traffic. Moreover, it can be powerfully preprocessed and programmed to generate multiclass classification for the firewall actions in the communication networks. Finally, IFW 2019 can be tailored, extended, updated, and stimulated.

B. Data Preparation Module
Like any machine learning-based system, the dataset undergoes a number of preprocessing operations to be prepared for use by the machine learning input layer for further processing and learning operations. In this work, our collected dataset has been processed as follows: 1) Dataset Transformation: Since the dataset records are available as .csv file with multiple rows and columns (separated by comma) where rows represent the data samples and the columns represent the features, the dataset needs to be transformed through the MATLAB system (our development platform) into a double matrix to be able for any further calculation or machine learning processing. Also, at the stage, the dataset was transformed into a matrix of features with corresponding samples (11 65532) and a vector of labels (1 65532).
2) Dataset Labeling: Since the dataset class feature are stored as a categorical datatype, such datatype needs to be encoded into numerical labels (labeling) as to be processed mathematically by the machine learning algorithms and calculations. Therefore, we have applied the one hot encoding techniques [17] to provide a proper labeling for the target classes as follows: Allow (100), Deny (010), Drop/Reset (001).

3) Dataset Randomization:
This stage is performed to redistribute the dataset samples in a random fashion to elude any classification preference and thus enhance the validation and testing stages by ensuring randomized dataset samples. To do so, we have used the ℎ algorithm as a data randomization policy which shuffles the data samples of the dataset through random locations.

4) Dataset Splitting Up:
This stage is performed to divide the data into three datasets, namely, taring dataset, validation dataset, and testing dataset. To do so, we have used the algorithm as a dataset distribution policy that divides the targets into 3-sets using random indices. Thus, the dataset distribution Proportions are training: 70%, Validating: 15%, Testing: 15%, and the dataset distribution numbers are Training: 45,872, Validating: 9,830, Testing: 9,830.

C. Data Learning Module
In this work, we developed our inference system using a SNN and ODT to train and classify the communication traffic records provided by the IFW-2019 dataset into three classes: Allow, Deny, Drop/Reset. In the third class, we have combined both "reset-both" and "drop" actions in one class since "reset-both" has a small number of samples (i.e., only 54 samples).

1) Shallow Neural Network (SNN):
In SNN, data introduced to the network goes through a single hidden layer of pattern recognition. Our SNN is composed of an input vector with 11-inputs ( ; ! ; " ; … ; ) that connects the 11-features of $% 2109 to the 150-neurons at the hidden layer (& ; & ! ; & " ; … ; & '( ) in a fully connected fashion. Also, every single neuron is fully connected to the 3-neurons at the output layer () ; ) ! ; ) " ) producing * +, probabilities for the corresponding classes (Class_1="allow"; Class_2="deny"; Class_3="drop/reset" ; ). Finally, the trainable weight vectors corresponding to each of the parametrized layers are % andfrom left to right respectively. Moreover, to demonstrates the symbolic representation for the individual neurons, we consider a neuron unit with an input vector of elements and single output &. For every neuron, all elements of input vector are multiplied by the corresponding weights in the weight vector % and subsequently supplied to the intersection of summation operation production the dot product of weighs and inputs (% . • ). After All, the bias 0 is added to the dot-product forming the + value.

2) Optimizable Decision Tree (ODT):
Decision trees are powerful and popular tools for classification and prediction. Decision trees represent rules that humans can understand and use in knowledge systems such as databases. In our ODT, we have configured the tree with 11 predictors and one response variable (target variable). Also, the tree split followed split criterion of maximum deviance reduction, with a maximum number of splits of 30 splits using 30 iterations and 5-fold crossvalidation.

D. Data Classification Module
In order to calculate the probabilities for the output classes, we have used the SoftMax activation function (multi-class classifier). SoftMax is a normalized exponential formula that normalizes a vector of 1 real numbers (ℝ 3 ) into a probability distribution comprising of 1 real number-probabilities (ℝ 3 ) that are proportional to the exponentials of the input numbers [9]. To calculate the numerical probabilities for each class, we first consider the final neuron output from previous layer which are activated using Sigmoid function σ 5net9 as follows: A sample of SoftMax classification output is provided in Table 2. According to the numerical probabilities provided in the table, the classifier will always select the label that recorded the highest probability value for each instance.

III. RESULTS AND DISCUSSION
In order to develop and evaluate the proposed IFW classification system, the training and testing phases were carried out using IFW 2019 dataset. The predictive model is specified to differentiate between three classes: 'allow', 'deny', 'and 'drop/reset' for the network packets. The proposed predictive model is implemented using MATLAB 2020b on a commodity laptop. Also, to optimize neural network training speed and memory, MEX calculation (MATLAB executable) has been used to train and simulate the network as well as for gradient calculations. Besides, the original dataset has undergone a preprocessing stage prior the use into the machine learning techniques. The preprocessing module is responsible for the conversion of raw traffic records of IFW 2019 into a matrix of labeled features that can be trained by the supervised learning part of the classification system. To sum-up, the specifications and configurations of the test-bench environment is shown in Table III.   Also, since the objective of the classification model is to produce output values as close as possible to the true values, thus the trainable weights of the model are iteratively adjusted aiming to minimize the Cross-Entropy Loss (d ef ) value in the case of SNN model and to minimize the Cross-Entropy Mean Squared Error (g a) value in the case of ODT model. Hence, SoftMax probability (h S ) for each predicted class ( ) is compared to the true class label 5+ S 9 and the loss (d ef *P g a) is calculated that penalizes the probability based on how far it is form the true value [35]. A perfect model has a LCE or MSE loss of 0. The plot for mean secured error vs iteration number showing the best point hyperparameters for the ODT model, is illustrated in Fig.3 (A), while the plot for cross-entropy for training, validation, testing, and best curves is illustrated in Fig.3 (B). Moreover, Table IV   Furthermore, we have investigated the receiver operating characteristic (ROC) curve for our 3-class classifier. ROC curve represents the relative trade-offs between the true positive rate (benefits) at the y-axis against the false positive rate (costs) at the x-axis for at various threshold settings. Typically, the classification model is established based on a continuous random variable 5i) which is compared with a predefined threshold (j), therefore, the instance is classified as "positive" if i > j, and "negative" otherwise. Accordingly, the true positive rate (\] ) and false positive rate ($] ) for a given threshold (j), can be integrally computed. Consequently, )` l P plots parametrically \] 5\9 versus $] 5\9 with \ as the varying parameter. The ROC curves of our three-classes classifier for the for SNN model and ODT Model are illustrated in Fig. 4. Since almost all experiments yield a point in the upper left corner (0,1) of the ROC space, the classifier almost provides a perfect classification case recording 99.0% and 100.0% for the area under the curve (AUS) of the SNN model and ODT Model respectively.
In addition, Fig. 5 shows a histogram of MSE errors for the training dataset, validation dataset, and testing dataset. The entire range of residuals has been divided into 20 bins. According to the figure, it can be clearly inferred that most MSE values are approaching zero. Besides, the histogram error bars seem to follow a normal distribution curve, which reflects the quality of the proposed machine learning model. Moreover, most counted errors correspond to the training dataset since it has most dataset records within the employed dataset (i.e., 70 % of the data samples belong to the training dataset, 15 % belong to the validation dataset, and 15 % belong to the testing dataset). These error cases are illustrated using color conventions where blue refers to the training error residuals, green refers to the validation error residuals, red refers to the training error residuals, and the yellow line refers to the zero-error value. Besides, Fig. 6 shows the neural network training state in terms of gradient analysis and validation fails during the 124 training epochs. This figure represents the current progress/status of the training at a specific time while training is in progress. In our case, six validation errors are mentioned, which means that the training will stop when the 6 validation check errors are simultaneously produced. As can be clearly seen, the validation process has been stopped after 124 epochs in which the first time the model meets 6 validation check errors from the beginning of the training process. Finally, to gain more insight into the proposed solution's advantages, we benchmarked the IFWclassification system by comparing its performance with other state-of-the-art machine-learning-based firewall-action classification systems in terms of the classification accuracy metric. The comparisons are provided in Table V below.

IV. CONCLUSION
This paper has proposed and discussed a dependable automated machine-learning-based internet firewall model to classify the packet traffic for communication network systems. The proposed system uses a shallow neural network (SNN) and optimizable decision tree (ODT) using 11attributes/predictors at the input stage and 3-classes at the output classification layer. The proposed system employs the multi-class internet firewall (IFW-2019) dataset with 70% of the records used for the training dataset, 15% used for validation data set, and 15% used for testing dataset. To evaluate the model's performance, it was adequately trained, recording a maximum accuracy of 99.8% and 98.5% achieved using ODT and SNN models, respectively, for the 3-class classifier. Besides, other machine learning metrics were evaluated to gain more insights into the system trajectory, such as positive predictive value, true positive rate, and others. Finally, based on the comparison with the existing state-of-art in the field, the achieved outcomes surpassed the existing automated classification models for the firewall actions, which contributes to this area of study.