Binary Particle Swarm Optimization Structure Selection of Nonlinear Autoregressive Moving Average with Exogenous Inputs (NARMAX) Model of a Flexible Robot Arm

The Nonlinear Auto-Regressive Moving Average with Exogenous Inputs (NARMAX) model is a powerful, efficient and unified representation of a variety of nonlinear models. The model’s construction involves structure selection and parameter estimation, which can be simultaneously performed using the established Orthogonal Least Squares (OLS) algorithm. However, several criticisms have been directed towards OLS for its tendency to select excessive or sub-optimal terms leading to nonparsimonious models. This paper proposes the application of the Binary Particle Swarm Optimization (BPSO) algorithm for structure selection of NARMAX models. The selection process searches for the optimal structure using binary bits to accept or reject the terms to form the reduced regressor matrix. Construction of the model is done by first estimating the NARX model, then continues with the estimation of the MA model based on the residuals produced by NARX. One Step Ahead (OSA) prediction, Mean Squared Error (MSE) and residual histogram analysis were performed to validate the model. The proposed optimization algorithm was tested on the Flexible Robot Arm (FRA) dataset. Results show the success of BPSO structure selection for NARMAX when applied to the FRA dataset. The final NARMAX model combines the NARX and MA models to produce a model with improved predictive ability compared to the NARX model. Keywords— system identification, NARMAX, structure selection, particle swarm optimization


I. INTRODUCTION
System Identification (SI) is a control engineering discipline concerned with the inference of mathematical models from dynamic systems based on their input and/or output measurements [1]- [5]. It is fundamental for system control, analysis and design where the resulting representation of the system can be used for understanding the properties of the system as well as prediction of the system's future behavior under given inputs and/or outputs [6].
SI is a significant research area in the field of control and modeling due to its ability to represent and quantify variable interactions in complex systems. Several applications of SI in literature are for understanding of complex natural phenomena [7]- [11], model-based design of control engineering applications [12]- [16] and project monitoring and planning [17]- [19].
Many SI models exist. Among them, the Nonlinear Auto-Regressive Moving Average with Exogenous Inputs (NARMAX) [20] model is a powerful, efficient and unified representations of a variety of nonlinear models [21]- [30]. A rich literature is available regarding its success in various electrical, mechanical, medical and biological applications [31]- [33].
Simultaneous structure selection and parameter estimation of NARMAX and derivative models are achievable through the Orthogonal Least Squares (OLS) algorithm [34], [35]. The OLS algorithm has since been widely accepted as a standard [36] and has been used in many works [37]- [43] due to its simplicity, accuracy and effectiveness.
OLS structure selection depends on evaluation of the Error Reduction Ratio (ERR) criterion score. Despite OLS's effectiveness, several criticisms have been directed towards its tendency to select excessive or sub-optimal terms based on ERR. It has been proven that OLS has a tendency selected incorrect terms when the data is contaminated by certain noise sequences or when the system input is poorly designed. The suboptimal selection of regressor terms leads to models that are inaccurate or non-parsimonious in nature.
This study proposes the Binary Particle Swarm Optimization (BPSO) algorithm to perform structure selection for the NARMAX. Using the BPSO approach, the structure selection problem is considered as a binary optimization problem to minimize an objective function. Unlike OLS (which uses ERR to rank regressors that significantly reduce the error variance in order to select the best regressors for the model), no preference is given to any regressor. Rather, BPSO treats the regressors equally (all regressors have an equal chance of being selected) and the model structure is defined towards minimizing certain objective criteria.

II. THE NARMAX MODEL
The NARMAX model output is dependent on its past inputs, outputs and residuals [44], [45]. Construction of the model can employ various methods such as polynomials [46]- [49], Multilayer Perceptrons (MLP) [50]- [52] and Wavelet ANNs (WNN) [53], [54] although the polynomial approach is the only method that can explicitly define the relationship between the input/output data.
The identification method for NARMAX is performed in three steps. Structure selection is performed to detect the underlying structure of a dataset. This is followed by parameter estimation to optimize some objective function (typically involving the difference between the identified model and the actual dataset). The NARMAX model recursively adds residual terms to the Nonlinear Auto-Regressive with Exogenous Inputs (NARX) model to eliminate the bias and improve the model prediction capability [55]. Structure selection and parameter estimation for NARMAX are recursively repeated on the residual set until a satisfactory model is obtained. Finally, the model is validated to ensure that it is acceptable.
A major advantage of NARMAX and its derivatives is that it provides physically interpretable results that can be compared directly, thus providing a way to validate and enhance analytical models where first principle models lack the completeness because of assumptions and omissions during derivations. Furthermore, among all the models studied, it is the only model that embeds the dynamics of nonlinearities directly into the model [56]. Other advantages include model representativeness [57], flexible model structure, reliability over a broad range of applications as well as reasonable parameter count and computational costs (for reasonably-sized model structures) [58].

A. Particle Swarm Optimization
Optimization is generally defined as a task to search for an optimal set of solutions for a parameter-dependent problem, typically by minimizing a cost function related to the problem [59].
The PSO algorithm is a global stochastic optimization algorithm based on the swarming behavior of animals in nature. The algorithm defines particles as agents responsible for searching the solution space. The movement of particles is directed by the particles' best individual achievement, as well as the best swarm achievement [60], [61]. The iterative search process continues until the objective is met or the number of iterations is exhausted. Among the advantages of PSO are: 1. Usage of simple mathematical operators makes it easy to implement [62]- [69]. 2. It is computationally inexpensive and fast compared to other more complex optimization algorithms [70]- [73]. 3. It has a successful track record in solving complex optimization problems [74], [75]. 4. It is versatile: can be easily adapted to solve continuous, binary or discrete problems. No requirement of gradient information thus can be used for a wide range of problems otherwise non-optimizable by gradient-based algorithms. 5. It requires a minimum amount of parameter adjustments. 6. It is robust in solving complex optimization problem [76]. 7. It can be implemented in a true parallel fashion [77].

B. Binary PSO for Structure Selection
The polynomial representation of the NARMAX model for a given input-output series is: where is the number of terms in the polynomial expansion, is the -th regression term with = 1, and is the -th regression parameter.
is formed by a combination of input, output and residual terms. In matrix form, identification involves the formulation and solution of the Least Squares (LS) problem: where is a × regressor matrix, is a × 1 coefficient vector, is the × 1 vector of actual observations, is arranged such that its columns represent the lagged regressors and is the white noise residuals. The NARMAX model construction is performed in two steps, namely model structure selection and parameter estimation. Structure selection involves selecting which columns in that best describes the observations, . After a subset of has been selected, the parameter estimation step estimates the parameters of the function ∎ that gives the best fit for (Equation (3)): The most common method for solving binary optimization problems is to represent particle solutions as probabilities of change. These values indicate the probability of a bit flip occurring from the initial binary string. This type of representation is adopted in this work.
The use of BPSO for model structure selection is described. Consider the identification problem in Equation (2) defined in an LS matrix form. The BPSO algorithm defines a binary string of length 1 × so that each column in has a bit assigned to it. A value of 1 indicates that the column is included in the reduced regressor matrix ( & , while the value of 0 indicates that the regressor column is ignored. The initial binary string is a predefined parameter prior to optimization.

A. Experiment Hardware
All experiments were performed on a personal computer with 3.10GHz Intel Xeon E3-1220 v3 microprocessor and 4GB RAM. The operating system was Linux Mint XFCE version 17.1 with MATLAB 2014a as the development platform.

B. Methodology
A general overview of the methodology is presented in Fig. 1, and a description of each process is presented in the following sections. Pre-processing is an important step for data conditioning prior to identification. In experiments conducted in this work, the division for the Flexible Robot Arm (FRA) dataset was 50% for training set and 50% for testing set based on the recommendation by [51].

C. Create Regressor Matrices
This process constructs the full regressor matrix ( ) from which BPSO can choose candidate terms to form the reduced regressor matrix, & . After the first-level structure selection process, a second-level MA structure selection was performed recursively using lagged residual terms.

D. Find Optimal Parameters for Structure Selection
After the regressors matrices have been constructed, BPSO was applied to evaluate the candidate structures and determine the best structure.
BPSO convergence depends on several parameters, which need to be tested to determine their optimal values. Therefore, experiments were done by performing optimization under various combinations of parameters swarm size, maximum iterations and several random seeds. Table 1 shows the tested parameter values.
The choice of swarm size and maximum iterations were based on preliminary test to balance between speed and solution quality. These values were considered optimal given the limited computational hardware resources available.
Three random seeds were chosen for the Mersenne-Twister Algorithm. The seeds are used to generate traceable random numbers to ensure repeatability of experiments performed. The values are arbitrary, but important to ensure that the experiments are repeatable.
The values of particle minimum value (5 ( and particle maximum value (5 CD were set to 0 and 1 respectively. They are within the range of probability values for a bit change to occur. E ( and E CD represent the movement range of the particles. Since the value of 5 ( is between the range of 0 and 1 (based on 5 ( and 5 CD ), the values of E ( and E CD were set to -1 (when 5 ( moves from 1 to 0) and +1 (when 5 ( moves from 0 to 1) respectively. The values of * and * 3 were both set to 2.0. This parameter is well-accepted as optimal based on literature [80].
The optimization is guided by Akaike Information Criterion (AIC), Final Prediction Error (FPE) and Model Descriptor Length (MDL) [81] criteria: ' PQR = S1 + lU9 V J Theoretically, the values of ' FGH , ' PQR and ' XYN are minimum when ' LMMN , O L = 0 at which 2 is irrelevant. However, this never happens in real modeling scenarios thus the value of 2 must be taken into account. With the inclusion of 2, the values of ' FGH , ' PQR and ' XYN are minimum when ' LMMN , O L → 0 and 2 → 1. Models with the lowest ' FGH , ' PQR and ' XYN scores obey the principle of parsimony as the least amount of parameters were required to provide the best fit for the data.

E. Model Validation and Analysis
After structure selection has been performed, the resulting candidate models need to be validated and analyzed to determine the best model. One Step Ahead (OSA) and residual tests were performed to select the best model that fulfills the validation criteria. Several tests, namely the OSA prediction, Mean Squared Error (MSE) and residual histogram analysis were performed to validate the model.
OSA is a test that measures the ability of a model to predict future values based on its previous data. It takes the form of [82]: a t = 9 a c (11) where 9 a is the estimated nonlinear model, and c are the regressors. Representation of c for the NARMAX model is shown in Equation (16): MSE is a standard method for testing the magnitude of the residuals for regression and model fitting problems. The MSE equation for a residual vector of length is given by: where ( is the observed value and a ( is the estimated value at point 7. As MSE values are calculated from the magnitude of the residuals, low values indicate a good model fit. The ideal case for MSE is zero (when ( − a ( = 0, 7 = 1, 2, … , ). However, this rarely happens in actual modeling scenarios and a sufficiently small value is acceptable. The third test is the histogram test to measure the whiteness of the residuals. A histogram is a graphical method to present a distribution summary of a univariate data set [83]. It is drawn by segmenting the data into equalsized bins (classes), then plotting the frequencies of data appearing in each bin. The horizontal axis of the histogram plot shows the bins, while the vertical axis depicts the data point frequencies.
In SI, the prediction model can only be accepted when the residuals are randomly distributed (appears as white noise). This type of residuals indicates that the dynamics of a system has been fully captured by the SI model, leaving only un-modeled white noise as the residuals. The purpose of histogram analysis is used to view the distribution of the residuals. The histogram exhibits white noise as a Gaussian distribution, with symmetric bell-shaped distribution with most of the frequency counts grouped in the middle and tapering off at both tails [83].

A. NARX Modeling Results
The NARX model produced by BPSO selected five terms of the generated following equation to represent the FRA system: The OSA plots for the training and testing sets were generated based on Equation (18) (Fig. 3). The red line indicates the predicted output of the model, while the blue line indicates the actual FRA system output. Based on the low MSE between the system output and the predicted model, it was concluded that the model managed to represent the system well. However, another important criteria of system identification is the whiteness of the residuals. This is because non-random residuals indicate model bias as not all dynamics in the original system is sufficiently captured by the model. As shown in the histogram plots in Fig. 4, the distribution appears as a Gaussian indicating the residuals are random similar to white noise. The addition of the Moving Average (MA) terms described in the next section is presented to improve the prediction accuracy of the model.

B. MA Modeling Results
The second experiment produced the MA model presented here. The result of the NARX residuals is being fed back to the model in order to improve its prediction. The MA model obtained is defined in Equation (19) The OSA results of the MA model are presented in Fig. 5. The selection of higher-order terms in Equation (21) (lags above 2) indicates the complexity of the un-modeled dynamics still present in the system. The OSA prediction of the MA model indicates that the model managed to approximate the residuals well. The MA results were finally combined back into the NARX model to form the final NARMAX model (described in the next section).

C. NARMAX Modeling Results
The final NARMAX model was constructed by combining the results from the NARX and MA models (Equation 20): y t = 0.0203u t − 2 + 3.1204y t − 1 − 4.3509y t − 2 + 3.0420y t − 3 − 0.9453y t − 4 − 2.7440 t − 2 + 2.2573ε t − 1 + 1.8697ε t − 3 − 4.0561ε t − 2 ε t − 4 − 0.6180ε t − 4 + 0.0501ε t − 1 u t − 2 − 0.4490ε t − 3 u t − 2 + 0.6687ε t − 3 u t − 3 − 0.5049ε t − 3 u t − 4 − 0.0901ε t − 4 y t − 2 + ε 3 t The OSA for the NARMAX model and a histogram of residuals are shown in Fig. 6 and Fig. 7 respectively. It can be seen that the introduction of the MA terms had a positive effect on the model as the MSE was smaller compared to NARX. The model was also considered valid as the residual histogram also shows a Gaussian distribution, which signifies that the residuals are randomly distributed.

VI. CONCLUSIONS
A BPSO-based structure selection method for NARMAX model was successfully implemented in this paper. The selection process searches for the optimal structure using binary bits to accept or reject the terms to form the reduced regressor matrix. Construction of the model is done by first estimating the NARX model, then continues with the estimation of the MA model based on the residuals produced by NARX. The final NARMAX model combines the NARX and MA models to produce a model with improved predictive ability compared to the NARX model.