The Performance of Binary Artificial Bee Colony (BABC) in Structure Selection of Polynomial NARX and NARMAX Models

— This paper explores the capability of the Binary Artificial Bee Colony (BABC) algorithm for feature selection of Nonlinear Autoregressive Moving Average with Exogenous Inputs (NARMAX) model, and compares its implementation with the Binary Particle Swarm Optimization (BPSO) algorithm. A binarized modification of the BABC algorithm was used to perform structure selection of the NARMAX model on a Flexible Robot Arm (FRA) dataset. The solution quality and convergence were compared with the BPSO optimization algorithm. Fitting and validation tests were performed using the One-Step Ahead (OSA), correlation and histogram tests. BABC was able to outperform BPSO in terms of convergence consistency with equal solution quality. Additionally, it was discovered that BABC was less prone to converge to local minima while BPSO was able to converge faster. Results from this study showed that BABC was better-suited for structure selection in huge dataset and the convergence has been proven to be more consistent relative to BPSO.


I. INTRODUCTION
SI is a process of building a mathematical model of a system from observed input and/or output data [1]- [3]. Among its applications is simulation, forecasting, controller design, process planning, monitoring, and modelling as well as an explanation of naturally-occurring phenomena in various fields [4]- [7].
Recently, research in non-linear SI has increased as science increasingly discovers complex systems that require these modelling techniques to understand their behavior and to closely approximate their behaviour [8].
A highly efficient and accurate model for modelling nonlinear systems called the NARMAX model was introduced [9] as a unified and accurate representation for a wide variety of nonlinear systems [10]- [18]. The NARMAX model represents its current output as a combination of its past inputs, outputs, and residuals [19], [20]. The residuals provide a feedback mechanism that may further improve the NARMAX model prediction.
The NARMAX model can be constructed using various methods such as polynomials [21]- [24], Multilayer Perceptrons (MLP) [25]- [28] and Wavelet ANNs (WNN) [29], [30]. From all, the polynomial approach is more attractive and versatile because of the simplicity of the model structure, and it reveals the dynamical properties of the system in a straightforward manner [31]. Other than that, it has been proven to work well in practical application with a smaller number of parameters used compared to the Volterra or Wiener series. NARMAX identification involves three major processes namely structure selection, parameter estimation, and model validation. Of particular interest in this paper is model structure selection that is concerned with selecting a set of the most representative regressors from a large number of candidates. In addition, the selected regressors must obey the principle of parsimony in which the structure must be as small as possible, yet has the ability to represent the system well [32].
This research proposes a binarized modification of the Artificial Bee Colony (ABC) algorithm to perform structure selection of a polynomial NARMAX model on an FRA dataset. The ABC algorithm is an optimization algorithm that mimics the intelligent behavior of bee colonies in finding the best food source around its perimeter by dividing the responsibilities of the swarm and working together to achieve a common goal [33], [34]. A bee colony consists of three groups of bees that fulfil specific roles within the colony: scout bees are responsible for searching for potential food sources, while employed bees are tasked with collecting honey from discovering food sources. Onlooker bees stay in the hive and rotate their roles with scout and employed bees as the scout and employed bees relay information through a special dance that indicates the direction and quality of available food sources [35], [36].
The performance of ABC is compared with another swarm-inspired optimization algorithm, the BPSO [37] with the experiment design focusing on solution quality and convergence consistency. This work is motivated by previous works [38], [39] that suggest that the ABC algorithm is capable of outperforming PSO in terms of local and global optimization problems.

II. MATERIAL AND METHOD
All experiments were performed on a personal computer with 3.10 GHz Intel Xeon E3-1220 v3 microprocessor and 4 GB of RAM. The operating system was Linux Mint XFCE version 17.1 with MATLAB 2014a as the development platform. The flowchart for the feature selection process is shown in Fig. 1 and parameter settings for BABC and BPSO is shown in Table 1. These parameters were selected to test the robustness of both algorithms under different initialization and exploration conditions.

A. Dataset Description
The dataset used for the structure selection experiments were the FRA dataset. FRA is a single input and single output (SISO) nonlinear system which torque input is effecting the acceleration in operation. The data have been used as a benchmark data for comparing seven different of identification program. The lag used was reported as 4 [40]. The possible combination of the regressor is 2x10 44 from 44 regressors.

B. NARMAX Structure Selection Using BABC and BPSO
The polynomial representation of the NARX/NARMA/NARMAX model for a given input-output series is: where is the number of terms in the polynomial expansion, is the -th regression term with and is the -th regression parameter.
is formed by a combination of input, output, and residual terms. In matrix form, identification involves the formulation and solution of the Least Squares (LS) problem: (2) where is a regressor matrix, is a coefficient vector and is the vector of actual observations. is arranged such that its columns represent the lagged regressors. is the white noise residuals.
For feature selection, suppose that the feature selection problem is defined as: where consists of columns representing each coefficient. To select a feature subset, using OA, a binary string of length is defined so that each column has a bit assigned to it. The initial value of the binary string can be randomly defined during initialization. A value of 1 will be given to the binary string to indicate the column will be considered in the construction of , while the value of 0 is written to show the column is ignored.
In the swarm/colony, each particle carries a vector in solutions . This vector contains the "probabilities of change" defined earlier. During optimization, the vector elements will be used as a reference to altering the binary string from its initial state (zero). If the particle vector element is more than 0.5, the binary bit will be changed to 1, otherwise the bit value is maintained. Table 1 shows the parameter values for the NARMAX experiments using BABC and BPSO. Four different preprocessing methods were tested in the experiments:

C. Experiment Setup
• No magnitude scaling, 50:50 training and testing division ratio using block division method (PP0). • No magnitude scaling, 50:50 training and testing division ratio using interleaving division method (PP1). • Magnitude scaling between -1 and 1, 50:50 training and testing division ratio using block division method (PP2). • Magnitude scaling between -1 and 1, 50:50 training and testing division ratio using interleaving division method (PP3). After the regressor matrix was created, the BABC and BPSO algorithms were used to select the best possible structure guided by the Akaike Information Criterion (AIC), Final Prediction Error (FPE) and Model Descriptor Length (MDL) as the fitness functions.
Several tests namely the One Step Ahead (OSA) prediction, residual plot, correlation tests and residual histogram analysis were performed to validate the model.

III. RESULTS AND DISCUSSION
Because NARMAX is a recursive model, it has to be implemented in stages. Initially, the NARX model is constructed (section A). The residuals from the NARX model was then used to create the Moving Average (MA) model (section B). Finally, the combined NARX and MA model was used to construct the NARMAX model (section C).
A total of 45 and 180 parameter combinations were analyzed for BPSO and BABC, respectively. Additionally, experiments on each of these parameter combinations were repeated for each of the information criteria (AIC, FPE, MDL). BABC had more parameter combinations because of the additional limit parameter, which is not applicable for BPSO.

A. NARX Part
For the NARX model, the optimal results were obtained using FPE as the fitness criterion and PP0 as the preprocessing method. Both BABC and BPSO obtained similar solutions with a fitness of 1.60×10 -5 . The training and testing MSE values of the optimal solution were 2.69×10 -5 and 2.72×10 -5 respectively as shown in Table 2. The significant result is shown in Fig. 2 and Fig. 3, which demonstrates the ability of BABC to consistently be able to find the optimal solution many times over the many experiments with different parameter settings. In comparison to BPSO, the solution distribution was inconsistent as the solutions were much more scattered and unfocused. This observation is further demonstrated in Fig. 3, where it shows the convergence versus the number of iterations for all experiments with different parameter settings with the FPE information criterion and PP0 as the pre-processing method. It can be seen that BPSO has a tendency to converge faster (as demonstrated by the rapid decrease in the fitness function), but fails to improve its fitness anymore after 500 iterations. In comparison to BABC, the reduction in the fitness function is more gradual with the ability to search the solution space in a more in-depth manner which resulting in overall better-quality solutions. Fig. 3 BABC and BPSO convergence across iterations over various parameter combinations. BABC solutions are more gradual compared to BPSO, but BABC has the ability to explore in a more in-depth manner resulting in a much better fitness relative to BPSO

B. MA Part
Being a recursive model, NARMAX introduces the residuals back into the NARX model in an attempt to improve its prediction ability.  Table 3 shows the optimal MA result achieved using BABC and BPSO. The FPE fitness and MSE values were comparable. However, with the BABC algorithm, the number of correlation violations was lower compared to BPSO. This indicates that the MA residuals were more randomly distributed compared to BPSO. Furthermore, the BABC algorithm showed much more consistency in finding the best fitness as indicated in Fig. 4. This observation is similar to the experiment performed in section A.

C. NARMAX Part
The NARMAX model is a combination of the optimal NARX (section A) and MA (section B) models. The results of the NARMAX model are shown in Table 4.
Comparing Table 4 and Table 2, it appears that the addition of MA terms to the NARX model did not add a significant advantage to the NARX model as the MSE values were slightly increased. Because of this, NARX was selected as the final model for comparison between BABC and BPSO.

D. Model Validation
The validation results for the NARX model is presented in this section. Validation measures two aspects, which are the model fit (using the OSA test) and the whiteness of the residuals (residual plot, correlation tests, and residual histogram analysis).
The OSA results for BABC and BPSO training and testing sets are shown in Fig. 5 and Fig. 6 respectively. The model prediction results (red dotted line) closely follow the actual system output (solid blue line). This indicates that the model was able to approximate the actual system well. This observation is further supported by the R2 value nearing 100% for both training and testing sets.
For the model to be accepted, the residuals of the model need to exhibit properties similar to white noise. This can be validated using residual plots (Fig. 7), correlation ( Fig. 8(a) to Fig. 8(e)) and histogram tests (Fig. 9).
The correlation tests measure the similarity between two signals [41] across different lags. For the residuals to appear as white noise, all correlation coefficients (except for lag zero in autocorrelation tests) should be within the 95% confidence limits (dotted red line). However, in real-life cases, a perfect result is difficult to achieve, and a sufficiently good result is acceptable. Based on the results in Fig. 8(a) to 8(e), the total number of correlation violations is 45 from 160 lags tested (28%) which leaves the remaining 72% correlation coefficients within the 95% confidence limits. The error histogram in Fig. 9 also supports the observation that the residuals are random. This is because the distribution of residuals appears as a bell-shaped curve indicating the residuals are randomly distributed. Because of this, we consider the NARX model to be valid and acceptable. The optimal NARX solution similarly achieved by both BPSO and BABC (FPE criterion equal to 2.69×10 -5 and 2.72×10 -5 for training and testing sets respectively with PP0 pre-processing). Both algorithms managed to find the optimal solution from 2x10 44 possible model structures. The optimal solution had passed all the necessary tests for it to be considered a valid and acceptable model.
Comparing the two optimization algorithms, BPSO converged faster than BABC. However, in terms of convergence consistency, BABC was proven to significantly outperform BPSO by a huge percentage margin (BABC found the optimal solution 82.2% of the time, while BPSO managed 6.67% for the NARX model). This proves that BABC was the better method for feature selection.