The Effect of Pre-Processing Techniques and Optimal Parameters selection on Back Propagation Neural Networks

— Artificial Neural Network had gained a tremendous attention from researchers particularly because of the architecture of Artificial Neural Network that laid the foundation as a powerful technique in handling problems such as classification, pattern recognition, and data analysis. It is known for its data-driven, self-adaptive, and non-linear capabilities channel that is used in processing at high speed and the ability to learn the solution to a problem from a set of examples. Recently, research in Neural Network training has become a dynamic area of research, with the Multi-Layer Perceptron (MLP) trained with Back-Propagation (BP) was the most popular and been worked on by various researchers. In this study, the performance analysis based on BP training algorithms; gradient descent and gradient descent with momentum, both using the sigmoidal and hyperbolic tangent activation functions, coupled with pre-processing techniques are executed and compared. The Min-Max, Z-Score, and Decimal Scaling pre-processing techniques are analyzed. The simulations results generated from some selected benchmark datasets reveal that pre-processing the data greatly increase the ANN convergence, with Z-Score producing the overall best performance on all datasets.


I. INTRODUCTION
Recently, Artificial Neural Network (ANN) had gained a tremendous attention from researchers in diverse applications.Most of the interest in ANN arose from their use to perform useful computations.Roughly speaking, these computations fall into two categories; natural problems such as pattern recognition and optimization problems.In addition, ANN is an information-processing paradigm motivated by biological nervous systems.Moreover, the human learning process may be partially automated with ANNs and can be constructing a specific application such as pattern recognition or data classification, through a learning process [1].ANNs and their techniques have become popular and important for modeling and optimization in many areas of science and engineering, and this popularity is largely attributed to their ability to exploit the tolerance for imprecision and uncertainty in real-world problems, coupled with their robustness and parallelism [2].Moreover, with its popularity, Artificial Neural Networks (ANNs) have been implemented for a variety of classification and learning tasks [3].Furthermore, the main reason for using ANNs rest solely on its several inhibitory properties such as the generalization and the capability of learning and generalizing from training data, even where the rules are not known a-priori [4].
There are many kinds of ANN exist.However the main architecture of the ANN is constructed from the input layer, hidden layer, and the output layer [3].Hence, over the last few years, the ANN methodology has been accepted widely to solve problems such as prediction, classification, pattern recognition and ANN has become one of the most highly parameterized models that have attracted considerable attention in recent years [4].Because of the ability to generalize, self-learning and self-organizing ability to adapt, ANN has the characteristics that had been looking for by researchers and the most important is that it can be trained.Furthermore, the best characteristic of ANN is that it can absorb experience by learning from the historical data and previous project information which can be used in the new prediction period.Back-propagation algorithm (BP) and feed-forward network are two popular and widely applied in ANN estimation technologies [3].It is known that ANN is constituted with active layers and hidden layers, and lots of nodes are connected inside each layer.One connection between two nodes represents a weight and each node can be represented by a special activation function such as tangent and sigmoid activation function and among those two sigmoid functions is widely used.The most significant ability of ANN is that it acquires a self-learning process in which it can modify each layer's weight by training samples.The widely used algorithm and can be considered as the traditional method is Back-propagation [3].The way on how all nodes in ANN are constructed are simple and different such as Single-Layer Perceptron (SLP) and Multi-Layer Perceptron (MLP).However, this paper focuses on a Back-Propagation algorithm which constructed based on multilayer perceptron (MLP) as will be discussed further in the sections below.
The extension of the single-layer feed-forward structure to the multilayer feed-forward structure as depicted in Fig. 1.As it can be seen, that there are still exist the input layer of nodes and the output layer of nodes as in the single-layer case.However, between these two layers are one or more layers of nodes known as a hidden layer.All these layers of nodes are denoted as layer 0 (input layer), layer 1 (first hidden layer), layer 2 (second hidden layer), and finally layer M (output layer) [3].
Until this date, multilayer feed-forward network or also known as MLP has become the major and most widely used supervised learning neural network architecture [4].Since MLPs utilize computationally intensive training algorithms (such as the error back-propagation), then this algorithm can easily get stuck in local minima.In addition, this architecture has problems in dealing with large amounts of training data, while demonstrating poor interpolation properties, when using reduced training sets [4].In addition, attention also must be drawn to the use of biases.Basically, neurons can be chosen with or without biases.Since the bias gives the network extra variable, then it logically translates that the networks with biases would be more powerful [4].

Fig. 1 Feed Forward Neural Network with one hidden layer and one output layer
There are different types of ANNs existing, and their uses are very high in many applications.Since the first neural model proposed by McCulloch and Pitts [5], there have been hundreds of different improved models considered as ANNs.The differences between those proposed algorithms might be the functions, the accepted values, the topology, the learning algorithms, etc.Furthermore, there are many hybrid models that had been introduced recently by researchers, where each neuron has more properties, but the focus is directed at an ANN which learns using the back-propagation algorithm [6], [7], [8] for learning the appropriate weights.
Back-Propagation (BP), the most common and popular used neural network learning technique, is one of the most effective algorithms accepted by most researcher currently, and also the basis of pattern identification of BP neural network.BP algorithm used Gradient-based methods also known as one of the most commonly used error minimization methods used to train back-propagation networks.Despite its popularity, the algorithm still facing some drawbacks such as the defects of local optimal and slow convergence speed [9].Until then, many research aimed at improving the traditional Back-Propagation Neural Network (BPNN) since 1986 by introducing some additional parameters such as the addition of learning rate, and momentum parameters, or use of different activation function, etc.Moreover, adequately pre-processing the datasets for the neural network before the training the network also influences the performances positively [10]- [14].
Therefore, this study includes analysis and discuss the performance effect of using data pre-processing techniques on back propagation algorithm trained with different algorithms and activation functions.The structure of the paper is highlighted as follows: Section II describes material and method in which focus on the back propagation training algorithm and its improvements.Experimental set up which constitute the data pre-processing techniques, simulations results and analysis are reflected in Section III.The concluding part of Section IV summarizes the contributions of the study.

II. MATERIAL AND METHOD
The back-propagation (BP) algorithm has recently emerged as one of the most efficient learning procedures for multi-layer networks, and it also is known as one of the most common algorithms used in the training of artificial neural networks [15]- [19].The BP learning has become efficient with the establishment of its mathematical formula as the standard method or process in adjusting weights and biases for training an ANN in many domains [20].The formulation of the back-propagation algorithm can be defined as follows: By given a set of testing data that was propagated to the MLP then its start to calculate the output as follows; where h is the hidden node, x is the input need, w is the weight, and y is the output node.The BP training algorithm is an iterative gradient algorithm designed to minimize the mean square error between the actual output of a multilayer feed-forward perceptron and the desired output.In this case, once the output is calculated then the network will start to compute the error, which will be the difference of the expected value t and the actual value, and compute the error information term δ for both the output and hidden nodes.
where δ j representing the information error of the nodes.
Once the information errors for each node were calculated, then, the network will back-propagate this error through the network by adjusting all of the weights; starts from the weights to the output layer and ends at the weights to the input layer.
where η is the learning rate.
The generally good performance found for the BP algorithm is somewhat surprising considering by many researchers, and the applications of Artificial Neural Network with Back Propagation algorithm have gained immense popularity in different areas.Some of these areas include but not limited to: voice recognition, face detection, control systems, medical, cause and effect analysis, engineering, time series prediction, and cryptosystems, etc.
The Multilayer Perceptron (MLP) training is an iterative process the most common method used to train MLP is the back-propagation (BP) algorithm for classification.The basic process in BP algorithm is that at each epoch the calculation of the network outputs patterns in the training set the adjustment was made to the network weights according to the difference between actual network output and the desired output.This BP algorithm has been independently derived by several researchers working in different fields, and the algorithm has the capacity of organizing the representation of the data in the hidden layers with high power of generalization [15].
The BP algorithm implements the gradient descent method which is the most venerable, but also one of the least effective, classical optimisation strategies.The process of updating the weights can be done using either a batch method or an on-line method.For batch training method, weight changes are accumulated over an entire presentation of the training data (epoch) before being calculated, while for on-line training method, the weights were updated after the presentation of each training example (instance).Hence, until today, Back Propagation Gradient Descent (GD) is probably the simplest of all learning algorithms usable for training multi-layered neural networks.Even though it is not the most efficient algorithm, but it converges fairly reliably.Furthermore, the calculation equations were already well established.This well-established technique is often attributed to Rumelhart, Hinton, and Williams [5].
The main focus of BP training algorithm is to reduce the error function by iteratively adjusting the network weight vectors.In training, for each iteration, the weight vectors are adjusted one layer at a time from the output level towards the network inputs.That is why, in the gradient descent version of BP, the change in the network weight vector in each layer happens in the direction of the negative gradient of the error function with respect to each weight itself.Hence, from that process, it can be noted that the introduction of learning rate η is multiplied by the negative of the gradient to conclude the changes to the weights and biases.
Moreover, another parameter had also played an important part in BP training, and it is known as momentum.The back-propagation with momentum algorithm (GDM) has been largely analysed in the neural network literature and even compared with other methods which are often trained by the use of gradient descent with momentum.Normally, a momentum term is usually included in the simulations of connectionist learning algorithms.It is proved and well known that such a term greatly improves the speed of learning, where the used of momentum can speed up and stabilize the training iteration procedure for the gradient method.A momentum term is often added to the increment formula for the weights, in which the present weight updating increment combined the present gradient of the error function and the previous weight updating increment.
where α is the momentum parameter, and r is the of iteration The momentum parameter can be an analogy of the mass of Newtonian particles that moves through a viscous medium in a conservative force field.Therefore, this paper identified that the performance of GDM depends on two training parameters.One is the parameter learning rate which is similar to the simple gradient descent.The other one is the parameter momentum which is a constant that defines the amount weights changes.
Even though the BP algorithm has proved satisfactory results when applied to many training tasks, but despite many successful applications, the BP algorithm has several important limitations.Since the BP algorithm uses the gradient descent method to update weights, one of the limitations of this method is that it is not guaranteed to find the global minimum of the error function.The problem of improving the learning efficiency and convergence rate of the BP algorithm has been investigated by a number of researchers.One approach had been proposed to incorporate learning rate adaptation methods and apply the Goldstein-Armijo line search in Back-Propagation algorithms [21].It was found out that the advantages of using these methods are because they provide stable learning, robustness to oscillations, and improved convergence rate.The experiments reveal that the algorithms proposed can ensure to reach global convergence (that is avoiding local minima).
Since the sigmoid derivative which appears in the error function of the original BP algorithm has a bell shape, it sometimes causes slow learning progress when the output of a unit is near '0' or '1'.That is why the importance of activation function within the back propagation algorithm was emphasized in the work done by Sibi et al. [22].They carried out a performance analysis by choosing different activation functions, and confirmed that the choice of selecting activation functions play a great role in the performance of the neural network; other parameters also come into play such as training algorithms, network sizing, and learning parameters.
One of the main reasons for the slow convergence of BP algorithms is the derivative of the activation function that leads to the occurrence of premature saturation of the network output units.It is well known that the activation function (also called a transfer function) can be a linear or nonlinear function.In addition, the activation function f(.) is also known as a squashing function where it keeps the cell's output between certain limits as is the case in the biological neuron [21].On the other hand, the relationship between the net inputs and the output is joined and called the activation function of the Artificial Neuron.There could be different kind of function or relationships that determine the value of output that would be produced for given net inputs.Furthermore, there are different types of activation functions [22], and the Uni-Polar Sigmoidal Function (S-shape function) and Hyperbolic Tangent Function.
However, a sigmoid function is by far the most common form of an activation function used in the construction of artificial neural networks [15].The formula of the activation function of the Uni-polar sigmoid function is given as follows: There are many advantages of using this function especially in neural networks trained by back-propagation algorithms.First of all, this function can be easily distinguished, and this can interestingly minimize the computation capacity for training.The term sigmoid devoted 'S-shaped', and logistic form of the sigmoid maps where the interval (-∞, ∞) onto (0, 1) [18] as seen in Fig. 2.
In other applications, the other choice of activation function is selected such that the output y is in the range from -1 to +1 rather than 0 to +1 [22].Hence, that activation function is known as hyperbolic tangent function and can be represented diagrammatically in Fig. 2. The equation that defined the ratio between the hyperbolic sine and the cosine functions or expanded as the ratio of the half difference and the half sum of two exponential functions in the points x and -x as follows: sin( ) tanh( ) cosh( ) Among various attempts to enhance the efficiency of BP algorithm that has been mentioned before, those using the gain value are among the easiest to implement.Shortly after the finding on the gain, researchers had paid their attention to the parameters that influence the performance of BP training is known as gain value.In general, the value of the gain parameter, c , which directly influenced the slope of the activation function [23].It was found out that for a large gain values ( c >>1), the activation function nearly approaches a 'step function' whereas for a small gain values (0 < c << 1), the output values change from zero to closely unity over a large range of the weighted sum of the input values and the sigmoid function approximates a 'linear function' as shown in Fig. 3. Fig. 3 The effect of gain on sigmoid activation function It has been recently shown that a BP algorithm using a variation of gain in an activation function converges faster that the standard BP algorithm.The main reason for such improvement on the BP was that the gain value adequately causes a change in the momentum and learning rate [23].
There were many simulation results that showed the use of changing gain propels the convergence behaviour and also slide the network through local minima.This including in the area of pattern recognition, the identification and recognition of complex patterns by the adjustment of weights experimented upon [24]- [25].It is proved from the experimental results that by using the gain value, it yielded high accuracy and better tolerance factor, but may take a considerable amount of time.The research was taken into next level by Nawi et al. [26] who proposed a cuckoo search optimized method for training the back propagation algorithm.The simulation results showed that the performance of the proposed method proved to be more effective based on convergence rate, simplicity, and accuracy.

A. Data Processing Technique
In data analysis methodology, the role of data preprocessing shall not be avoided, since all training data may consist of noise and outliers and need to be clean before training the network.That is why the main process of data pre-processing is to remove the irrelevant information and extract key features of the data to facilitate a pattern recognition problem without throwing away any important information.Hence, that is why data pre-processing is a significant step in the data mining process.It is known that most data gathering methods are not well prepared and lightly controlled, resulting in outliers, impossible data combinations, and missing values, etc.As a result, analyzing data that has not been carefully separated can produce confusing results.Thus, the representation and quality of data are first and foremost before running any analysis [27].It is important in data analysis research to consider the quality, reliability and availability are some of the factors that may lead to a successful data interpretation by a neural network.Furthermore, if there is inappropriate information present or noisy and unreliable data, then knowledge discovery process becomes very difficult during the training process.
It is known to all data analysis researchers that data preparation and filtering steps can take a considerable amount of processing time but once pre-processing is done, the data become more reliable and robust results are achieved [28].Therefore, as part of improving the training efficiency of BP algorithm, this study had employed three pre-processing techniques namely; Min-Max Normalization (Equation 11), Z-Score Normalization (Equation 12), and Decimal Scaling Normalization (Equation 13).
The mean(p) translate to the mean of attribute P, and std(p) represents the standard deviation of attribute P.
where m is the smallest integer such that Max(|d'|) < 1.

III. RESULTS AND DISCUSSION
Experiments had been set up and performed to provide the empirical evidence on the comparative study of different data pre-processing methods in MLP Back Propagation model for classification problems.Two different training algorithms and two different activation functions were selected.The programming code was developed by using MATrix LABoratory (MATLAB) which implements the algorithms.Some selected datasets have been retrieved from the UCI Machine Learning Repository namely: Iris plant data, Balance-Scale data, and Car Evaluation dataset.The data partition for training algorithms is split into 70% for the training set and 30% for the testing set for all the datasets.Whereas, parameters values for η and α which are used with standard Gradient Descent (GD) and Gradient Descent with Momentum (GDM) with all the datasets are η = 0.1 -0.7 (only for GD), and (α and η) = 0.1 -0.7 (for GDM).
The performance analyses are performed across 10 trials, where the averages of all trials were recorded.The effect of having data pre-processing techniques with ANN training algorithm is evaluated based on simulations with some benchmark datasets.The evaluation of the pre-processing techniques coupled with different activation functions by using various training algorithms such as GD and GDM.Towards the end of the simulations, the target outcome of the simulations is to know the ability of each model whether they can perform best with three evaluations performance in consideration.They performance criteria for the analysis are the classification accuracy (ACC) and Mean Squared Error (MSE) depicted in Equation ( 14) and Equation (15).
In addition, the numbers of Epoch (i.e. the number of times of all the training vectors that are used once to update the weights) make up the third metrics with values set at 5000.If the model performs well and reaches the targets, then it can then be applied to new data to predict the future.
( / )*100 where (C) represent the corrected class, and (A) the total number of instance.
( ) where P i is a vector of (n) predictions, P i * is the vector of the true values, and n is the number of instances.
The simulation experiments are carried out using MATLAB (R2012a) on Petium4 Core i7 CPU, and all results after series of experiments are tabulated and discussed.
The pre-processing techniques on training algorithm GD and GDM with different activation functions (tansig = hyperbolic tangent and logsig = sigmoidal) for MLP have been applied on iris dataset, which contains 150 instances, partitioned into a training set and testing set, with a distribution of 70% and 30% respectively.The network is built with one hidden layer; and for GD, the learning rate was initially started with 0.1, and increasing by 0.2 until it reached a maximum of 0.7.When GDM is concerned, the best learning rates are chosen from previous experience, but momentum was initially started with 0.1, and increasing by 0.2 until it reached a maximum of 0.7.The number of outputs is 3 nodes each for the training algorithms.
In Table 1, the data pre-processing techniques performance on the network are above the 90% range for GD and GDM.The Min-Max-Logsig at 0.1 learning rate, Decimal Scaling-Tansig at 0.1, and Decimal Scaling Logsig with 0.1 and 0.7 learning rates are below the 90% accuracy rate.The results apparently show that with GD, the Z-Score-Logsig with 0.3 learning rate outperformed the other preprocessing techniques in terms of accuracy, but the Z-Score-Tansig with 0.5 learning rate outperformed the other preprocessing techniques in terms of minimum error.With respect to GDM, only Decimal Scaling-Logsig with momentum at 0.3, and learning rate at 0.3 produced the minimal accuracy.Z-Score-Logsig with 0.3 learning rate and 0.7 momentum value outperformed the other pre-processing techniques in terms of accuracy, but the Z-Score-Tansig with 0.5 learning rate and 0.5 momentum value outperformed the other pre-processing techniques in terms of minimum error.The results obtained from the classification accuracy based on balance-scale dataset varied proportionally to preprocessing techniques, as shown in Table 2.However, for all pre-processing techniques on training algorithm (GD) with different activation functions (tansig and logsig), 0.5 learning rate gave the highest possible accuracy but Z-Score-Logsig with the highest accuracy.The Z-Score-Tansig with 0.5 learning rate outperformed the other pre-processing techniques in terms of minimum error.On GDM training algorithm, Z-Score-Tansig with momentum at 0.5 and learning rate of 0.5 accounted for the highest accuracy.Consequently, Z-Score-Tansig also gave the lowest error of 0.037% at momentum rate of 0.7.
The performance results as revealed in Table 3 for car evaluation data illustrates that the pre-processing techniques with training algorithms; GD and GDM performed excellently well with accuracy rates from 95% upwards.Based on GD, the Z-Score-Logsig resulted in the highest rate of 96.69% at a learning rate of 0.5.Also, the MSE results are at their lowest minimum, and Z-Score-Tansig generated the minimum error of 0.059% at 0.7 learning rate.On the performance with GDM which made use of learning rates of 0.1 and 0.5 respectively, Decimal Scaling-Tansig with accuracy rate of 96.61% outperformed the other preprocessing techniques at learning rate of 0.1 and momentum of 0.3, and minimum error percentage of 0.060% at 0.1 and 0.3 learning rates respectively coupled with momentum of 0.5 for Z-Score-Tansig.Motivated by the re-occurrence of the Back Propagation (BP) neural network sticking to local optimal, convergence speed, and the increase in computational cost associated with its learning process, this study explored the influence of data pre-processing at alleviating the BP shortcomings.Taking advantage of the Min-Max, Decimal Scaling, and Z-Score pre-processing techniques, coupled with the gradient descent and gradient descent with momentum training algorithms and activations functions; uni-polar sigmoidal and hyperbolic tangent, the performances of BP neural network are greatly improved with minimum errors.The experimental results align with the projected goal of this study.For all the datasets at different learning rates and momentum, the pre-processing techniques increased the accuracy of the BP classifier with Z-Score outperforming the other techniques.Also, the computational cost diminishes which ultimately increase performance.Hence, it can be concluded that adequately pre-processing that data optimizes the overall efficiency of the neural network.

Fig. 2
Fig. 2 The unipolar sigmoidal function (S-shaped) and the hyperbolic tangent function maximum value of the attribute, p min is the minimum value of the attribute for ( 0, it indicates a constant value for that feature in the data.

TABLE III CLASSIFICATION
PERFORMANCES FOR CAR EVALUATION DATASET