Optimal Parameter Selection Using Three-term Back Propagation Algorithm for Data Classification

— The back propagation (BP) algorithm is the most popular supervised learning method for multi-layered feed forward Neural Network. It has been successfully deployed in numerous practical problems and disciplines. Regardless of its popularity, BP is still known for some major drawbacks such as easily getting stuck in local minima and slow convergence; since it uses Gradient Descent (GD) method to learn the network. Over the years, many improved modifications of the BP learning algorithm have been made by researchers, but the local minima problem remains unresolved. Therefore, to resolve the inherent problems of BP algorithm, this paper proposed BPGD-A3T algorithm where the approach introduces three adaptive parameters which are gain, momentum and learning rate in BP. The performance of the proposed BPGD-A3T algorithm is then compared with BPGD two-term parameters (BPGD-2T), BP with adaptive gain (BPGD-AG) and conventional BP algorithm (BPGD) by means of simulations on classification datasets. The simulation results show that the proposed BPGD-A3T showed better performance and performed the highest accuracy for all dataset as compared to other.


I. INTRODUCTION
Artificial Neural Network (ANN) is based on the model of a human brain. ANN is composed of several neurons that act as processors which are interconnected by weighted links which are updated to obtain required outputs [1]. It uses a mathematical model for information processing which is based on the approach of computation inspired by the structure and operation of biological neurons organized into layers. Basically, there are three layers in a neural network which are Input layer, Hidden layer, and Output layer. The most common and established neural network model is the multilayer perceptron (MLP). This type of neural network is known as a supervised network since it requires the desired output in order to make sure that the network learns. The main purpose of this type of network is to create a model that correctly maps the input to the output using historical or unseen data so that the model can then be used to produce the output when the desired output is unknown. The Back propagation (BP) algorithm is very popular for supervised learning method such as multi-layered feed forward Neural Network. It is commonly used for learning algorithm for training Neural Network. In back propagation, the input data is repeatedly presented to the neural network and for every iteration of the training process, each presentation the output of the neural network is compared to the desired output in order to compute the error. The error is then fed back (back propagated) to the neural network and been used to adjust the weights such that the error decreases with each iteration. As a result, the neural model gets closer and closer to producing the desired output. This algorithm uses a gradient descent (GD) method which known to minimize the error of the network by moving down the gradient of the error curve. Furthermore, the weights of the network are adjusted by the algorithm for every iteration. Consequently, the error is reduced along a descent direction.
Recently, Artificial Neural Network (ANN) technology has gained much attention and been improved by many researchers. Some researchers had proposed some modifications to the conventional BP algorithm in order to improve the performance of Multilayer Perceptrons network training. The most simple and significant improvement of BP is by focusing on the development of ad hoc techniques [2]- [9].
In which proposed techniques some of the researchers introduced the momentum term, others used the alternative cost function or dynamic adaptation of the learning parameters. Many apply special techniques of initialization of weights.
Later, Nazri et al. [10] proved that by adaptively changing the 'gain' value for each node can reduce the training time without modifying the network topology. This is due to the effect of 'gain' parameter in reducing the steps needed to reach the minimum error. Therefore, this research takes a further step by proposing an improvement on [10] by adjusting activation function of neurons in the hidden layer in each training set. Moreover, the activation functions are adjusted by combining gain parameters together with adaptive momentum and adaptive learning rate value during the learning process. The proposed algorithm, known as, (BPGD-A3T), presents better convergence rate and can avoid the network from trapping into local minima. The performance of the proposed algorithm will be compared with the conventional BP algorithm (BPGD), back propagation gradient descent with adaptive gain (BPGD-AG), back propagation gradient descent with adaptive momentum (BPGD-AM) and back propagation gradient descent with adaptive Learning Rate (BPGD-ALR). The simulation was run were performed on five classification dataset which are glass dataset, card dataset, diabetes dataset, heart dataset and horse dataset.
The remaining of the paper is organized as follows. Section II, explains the basic operation of the back propagation algorithm and the proposed algorithm. The simulation results are discussed in Section III. This paper is concluded in the final section.

II. MATERIAL AND METHOD
The Back Propagation (BP) algorithm is a well-known technique used in the implementation of artificial neural networks. The establishment of BP algorithm had gained attention to many researchers and been implemented in diverse disciplines and applications. The best part of BP algorithm is that it always looks for the minimum of the error function in weight space using the method of gradient descent. In which the combination of weights which minimizes the error function is considered to be a solution to the learning problem. Since this method requires computation of the gradient of the error function at each iteration step, therefore it must guarantee the continuity and differentiability of the error function. There are many activation functions that can be used, and among them, one of the most popular activation functions for back propagation networks is the sigmoid, a real function S c :IR→ (0, 1) is defined by the expression.
A differentiable activation function makes the function computed by a neural network differentiable since the network itself computes only function compositions. The error function also becomes differentiable. Furthermore, since the sigmoid always has a positive derivative, the slope of the error function provides a greater or lesser descent direction which can be followed. In most cases, local minima appear because the targets for the outputs of the computing units are values other than 0 or 1. Moreover, if a network for the computation of XOR is trained to produce 0.9 at the inputs (0,1) and (1,0), then the surface of the error function develops some protuberances, where local minima can arise. Whereas, in the case of binary target values, some local minima are also present, as shown by Lisboa and Perantonis [11] who analytically found all local minima of the XOR function. In fact, the network model represents a chain of function compositions which transform an input to an output vector.
The network is a particular implementation of a composite function from input to output space, which called network function. The learning problem consists of finding the optimal combination of weights so that the network function α approximates a given function f as closely as possible [11].
The learning rate (LR) is one of the crucial factors to accelerate the convergence of BP learning and control the variable of the neuron weight adjustments at each iteration during the training process. The convergence speed is dependence on the choice of LR. The algorithm will take a longer time to converge or may never converge if the LR is too small. However, the network will accelerate the convergence rate significantly and still possibly will cause the instability if the LR value is too high. The value of LR usually set to be constant for all weights in the whole learning process. By adding some momentum coefficient (MC) to the network, it will speed up the convergence, stabilize the training procedure and avoid the local minima. Basically, the MC is set to be constant in the interval [0,1] because it was discovered from simulations that the fixed momentum coefficient value could only speed up learning when the recent downhill gradient of the error function and the last change in weight have a parallel direction. When the recent negative gradient is in a crossing direction to the previous update, the MC may cause the weight to be altered up the slope of the error surface as opposed to down the slope as preferred [12]. This leads to the emergence of diverse schemes for adjusting the MC value adaptively instead of being kept constant throughout the training process [13][14].
Yu and Liu [15], proposed a back propagation algorithm with adaptive learning rate and momentum. They modified the conventional back propagation algorithm by using adaptive learning rate and momentum where the learning rate and the momentum are adjusted at each iteration to speed up the training time. The modified back-propagation with adaptive learning rate and momentum outperforms the conventional back propagation with fixed momentum or without momentum in term of learning speed. Shamsuddin et al., [16] have improved the convergence rates of two-term BP model with some modification in learning strategies. The experiment results show that the modified two-term BP improved with a convergence rate much better when compared with standard BP. Iranmanesh and Mahdavi [17] proposed a differential adaptive learning rate method for BP to speed up the learning rate.
A few researchers also introduced optimization method by introducing Particle Swarm Optimization and Random walk algorithm with BP [18]- [19]. However, the calculation for finding an optimum solution was so complex and cause extra overhead. That is why this paper only focuses on parameters such as momentum, learning rate and activation function for improving BP.
Moreover, the proposed method does not cause any extra or additional overhead since the employs of the large learning rate at the beginning of training gradually decreases the value of learning rate using the differential adaptive method. By considering the advantages of each three-term parameters to the BP performance, we believe that by combining all three parameters together the performance of BP algorithm will be further improved and faster to converge.
Therefore, this paper proposed algorithm BPGD-3T that modifies the BP algorithm with three adaptive terms which are gain, momentum coefficient, and learning rate. The advantages of using an adaptive gain value together with momentum coefficient and learning rate have been investigated. Gain update such as weight and bias update implemented for output and hidden nodes have also been explored. The iterative algorithm is proposed for the batch mode of training. For the all training set which is being presented to the network, the weights, biases, gains, momentum coefficients and learning rates are calculated and updated [17].
The pseudo code of the proposed algorithm is discussed as below: Update the weights, biases, gains, momentum coefficients and learning rates using the summed updating terms and repeat this procedure on epoch-by-epoch basis until the error on the entire training data set reduces to a predefined value. End

III. RESULTS AND DISCUSSION
The simulations were carried out using MATLAB software on five classification datasets taken from UCI machine learning repository. Those five datasets are; glass dataset, card dataset, diabetes dataset, heart dataset and horse dataset. The following algorithms are analysed and simulated on the datasets: Three-layer back-propagation neural networks are used to test the models. The hidden layers are keeping constant to 5 hidden nodes while output and input layers nodes are different according to the datasets given and sigmoid activation function was used for all nodes. The maximum iteration for each problem is set to 5000 epochs, and 30 trials are run for each dataset. For each trial, the results are stored in the result file meanwhile CPU time and accuracy are recorded for each trial on every dataset.
For all training for the conventional BPGD algorithm, the initial value for momentum coefficient and learning rate is fixed generated. Furthermore, for all training for BPGD-AG, the initial value for momentum coefficient and learning rate is fixed generated. The initial value used for the gain parameter for BPGD-AG, BPGD-AM, BPGD-ALR and BPGD-3T algorithms is set to 1. For all training for BPGD-AM, BPGD-ALR and BPGD-A3T algorithms, as the gain, momentum coefficient and learning rate value were modified and the weight and biases were updated using the new value of gain, momentum coefficient, and learning rate. The initial value for momentum coefficient is fixed, and learning rate of BPGD-AM and BPGD-ALR algorithms is randomly generated. The initial value for momentum coefficient and learning rate of BPGD-A3T algorithms is randomly generated. The target error is set to 0.01.

A. Glass Dataset
This dataset was collected by B. German on fragments of glass encountered in forensic work. The glass dataset is used for separating glass splinters into six classes, namely float processed building windows, non-float processed building windows, vehicle windows, containers, tableware, or headlamps [20]. The selected architecture of the network is 9-5-6 with target error was set to 0.01, and the maximum epoch was 5000. The best momentum coefficient and learning rate value for conventional BPGD and BPGD-AG for the glass dataset are 0.2 and 0.4 respectively. For the BPGD-AM, the best momentum coefficient and learning rate value are found in the interval [0.1,0.2] and 0.4 respectively while the best momentum coefficient and learning rate value for BPGD-ALR are found in the interval 0.    Table 1 and Fig. 1 shows that the proposed algorithm (BPGD-A3T) gives the best performance. Furthermore, the accuracy of the proposed algorithm is better with 81.08% as compared to BPGD-ALR, BPGD-2T AM, BPGD-AG and BPGD which are 81.01%, 79.36%, 79.11% and 75.02% respectively. Moreover, the proposed algorithm (BPGD-A3T) needs 1997 epochs to converge opposed to the conventional BPGD at about 2312 epochs, BPGD-ALR at about 2022 epochs, BPGD-AM at about 2071 epochs while BPGD-AG needs 2095 epochs to converge. The time required for the training the classification dataset is an important factor when analysing the performance. The result clearly shows that the proposed algorithm (BPGD-A3T) have the best total time of converging as compared to conventional BPGD, BPGD-AG, BPGD-AM, and BPGD-ALR.

B. Card Dataset
This dataset was predicted the approval or non-approval of a credit card to a customer [21]. Descriptions of each attribute name and values were not enclosed for confidentiality. The selected architecture of the network is 51-5-2 with target error was set to 0.01, and the maximum epoch was 5000. The best momentum coefficient and learning rate value for conventional BPGD and BPGD-AG for the glass dataset are 0.    Table 2 and Fig. 2 shows that BPGD needs 44.98 seconds with 1175 epochs to converge, whereas BPGD-AG needs 13.08 seconds with 1243 epochs to converge, BPGD-AM needs 12.44 seconds with 1176 epochs to converge, and BPGD-ALR needs 11.91 seconds with 1041 epochs to converge. Conversely, the proposed algorithm (BPGD-A3T) performed better, and it only needs 10.39 seconds with 970 epochs to converge. Furthermore, the accuracy of the proposed algorithm is better with 93.21% as compared to BPGD-ALR, BPGD-AM, BPGD-AG, and BPGD with 90.60%, 93.00%, 92.39% and 90.94% respectively.

C. Diabetes Dataset
This dataset that was selected from a larger data set held by the National Institutes of Diabetes and Digestive and Kidney Diseases. The constraint of this dataset are all the patients are Prima-Indian women, at least 21 years old and must be living near Pheonix, Arizona, USA [22]. The selected network topology for Diabetes classification dataset is 8-5-2, with 8 input nodes, 5 hidden nodes, and 2 output nodes. 384 instances were represented as training dataset and 192 as a testing dataset. The target error was set to 0.01, and the maximum epoch was 5000. The best momentum coefficient and learning rate value for conventional BPGD and BPGD-AG for the glass dataset are 0.3 and 0.3 respectively. For the BPGD-AM, the best momentum coefficient and learning rate value are found in the interval    Table 3 and Fig. 3 shows that the proposed algorithm (BPGD-A3T) still outperforms other algorithms in terms CPU time and number of epochs .The proposed algorithm (BPGD-A3T) epochs need only 2036 to converge as opposed to the conventional BPGD at about 3965 epochs, BPGD-AG needs 3755 epochs to converge while BPGD-AM at about 3272 epochs and BPGD-ALR needs 2593 epochs to converge. Moreover, the time required for training the classification dataset is an important factor when analyzing the performance. The result clearly shows that the proposed algorithm (BPGD-A3T) have the better performance for a total time of converge. Furthermore, the accuracy of BPGD-A3T is much better than BPGD, BPGD-AG, BPGD-AM, and BPGD-ALR.

D. Heart Dataset
The selected architecture of the network is 36-5-2 with target error was set to 0.01, and the maximum epoch was 5000. The best momentum coefficient and learning rate value for conventional BPGD and BPGD-AG for the glass dataset are 0.7 and 0.3 respectively. For the BPGD-AM ,the best momentum coefficient and learning rate value are found in the interval [0.5,0.7] and 0.3 respectively while the best momentum coefficient and learning rate value for BPGD-ALR are found in the interval 0.7 and [0.2,0.3] respectively. Meanwhile, for the proposed BPGD-A3T, the best momentum coefficient and learning rate value are found in the interval [0.5,0.7] and [0.2, 0.3] respectively.   Table 4 and Fig. 4 shows that the proposed algorithm (BPGD-A3T) deliver the best performance. Furthermore, the accuracy of the proposed algorithm is better with 90.90 % as compared to BPGD-ALR, BPGD-AM, BPGD-AG, and BPGD which are 90.66%, 90.38%, 88.76% and 88.58% respectively. Moreover, the proposed algorithm (BPGD-A3T) needs 1717 epochs to converge opposed to the conventional BPGD at about 1702 epochs, BPGD-AG needs 1691 epochs to converge while BPGD-ALR at about 1502 epochs, BPGD-AM at about 1438 epochs. Apart from the speed of convergence, the time required for the training the classification dataset is an important factor when analysing the performance. The results clearly show that the proposed algorithm (BPGD-A3T) have the best total time of converging compared to outperforms conventional BPGD, BPGD-AG, BPGD-AM , and BPGD-ALR.

E. Horse Dataset
The selected architecture of the network is 58-5-2 with target error was set to 0.01, and the maximum epoch was 5000. The best momentum coefficient and learning rate value for conventional BPGD and BPGD-AG for the glass dataset are 0.5 and 0.4 respectively. For the BPGD-AM, the best momentum coefficient and learning rate value are found in the interval [0.5,0.6] and 0.4 respectively while the best momentum coefficient and learning rate value for BPGD-ALR are found in the interval 0.    Table 5 shows that the proposed algorithm required 2404 epochs with 27.74 seconds CPU times to achieve the target error by 80.99%. Whereas BPGD-ALR required 2602 epochs with 28.12 seconds CPU times with 80.86% accuracy while BPGD-AM required 2636 epochs with 28.60 seconds CPU times with 79.91% accuracy. At the same time, BPGD-AG required 2717 epochs with 29.76 seconds CPU times with 79.64% accuracy, and BPGD required 2829 epochs with 140.13 seconds CPU times with 79.37 % accuracy. Fig.  5 shows that the proposed algorithm (BPGD-A3T) still outperformed other algorithms in terms of the number of epochs, CPU time and accuracy.

IV. CONCLUSION
As a popular and most widely used algorithm, Back Propagation (BP) Neural Network is known to be able to train Artificial Neural Networks (ANN) successfully. However, BP algorithms have some drawbacks which are getting stuck in local minima, and slow convergence rate and this algorithm still need some improvement. In this paper, the BPGD-A3T algorithm is proposed to train BPNN in order to achieve fast convergence, avoid local minima and enhance accuracy. The proposed algorithm adaptively changes the gain parameter of the activation function together with momentum and learning rate to overcome the inherent problems of BP. The performance of the BPGD-A3T algorithm is then compared with the BPGD-AM, BPGD-ALR, BPGD-AG and conventional BP algorithm. The performance of the proposed BPGD-A3T is verified by means of simulations on Glass classification dataset, Card classification dataset, Diabetes classification dataset, Heart classification dataset and Horse classification dataset are used respectively. The simulation results show that the proposed BPGD-A3T showed better performance and performed the highest accuracy for all dataset compared to other algorithms.