Nonparametric Regression Mixed Estimators of Truncated Spline and Gaussian Kernel based on Cross-Validation (CV), Generalized Cross-Validation (GCV), and Unbiased Risk (UBR) Methods

— Nowadays, most nonparametric regression research involves more than one predictor variable and generally uses the same type of estimator for all predictors. In the real case, each predictor variable likely has a different form of regression curve so that if it is forced, it can produce an estimation form that does not match the data pattern. Thus, it is necessary to develop a regression curve estimation model under the data pattern, namely the mixed estimator. The focus of this study is an additive nonparametric regression model, a mix of the Truncated Spline and Gaussian Kernel. There is a knot point in the Truncated Spline, while in the Gaussian Kernel, there is bandwidth. To choose the optimal knot point and bandwidth in a mixed estimator model, various methods can be used, including Cross-Validation (CV), Generalized Cross-Validation (GCV), and Unbiased Risk (UBR). This research proposes the optimal knot point and bandwidth estimation on the mixed estimator Truncated Spline and Gaussian Kernel model. Furthermore, the comparison between CV, GCV, and UBR is used to validate the proposed method. The simulation study was carried out by generating the Truncated Spline function and the Gaussian Kernel on a combination of sample size variations and variances. The simulation results show that the GCV method provides a higher coefficient of determination (R 2 ) value and better accuracy for each combination of sample sizes and variance variations.


I. INTRODUCTION
Regression analysis is one of the statistical methods used to determine the pattern of relationships between one or more variables in the functional form [1]. The relationship formed can be expressed in an equation that states the functional relationship between the response and predictor variables [2]. Early identification of the relationship pattern can be made by looking at the scatter plot [3]. If the form of the relationship pattern is known, then the parametric regression approach is used. However, not all data patterns can be clearly identified as the relationship pattern, so using nonparametric regression was proposed [4].
Along with the development of computing technology, nonparametric regression models, which generally require fairly difficult computational complexity, are being popular. The approach with the nonparametric regression model has advantages, such as being easy to use for data patterns with unknown patterns [2]. This approach has good flexibility so that the data is expected to adjust the form of regression curve estimation by itself without being influenced by the researcher's subjectivity [5]. The purpose of modeling using regression analysis is to find the appropriate form of regression curve estimation [6]. Many nonparametric regression curve estimators have been developed by researchers, including Spline [2], [7]- [11], Kernel [12]- [15], and Fourier series [16]- [20].
Models with the nonparametric regression approach developed by previous researchers assume the pattern of each predictor is considered to have the same regression curve so that only one estimator form is used for each predictor variable. However, in the real case, each predictor variable likely has a different form of regression curve. Thus, if it is enforced, it can produce an estimation form that does not match the data pattern [3], [21]. So, it is necessary to develop a mixed estimator of nonparametric regression curves, where each data pattern in the model is approximated by the appropriate curve estimator [1], [3], [4], [20]- [24]. The mixed estimator nonparametric regression model used in this study combines the Truncated Spline and the Gaussian Kernel.
In nonparametric Gaussian Kernel regression, determining the right bandwidth is very important, while in the Truncated Spline, the important thing is determining the optimal knot point. The knot point and bandwidth in the future referred to as smoothing parameters can greatly affect the formed regression curve. The smoothing parameter that is too small can produce a very rough curve and tend to fluctuate. On the other hand, if it is too large, it can produce a curve that is too smooth which is not matched with the data pattern [13]. Thus, it becomes an interesting problem to determine the right and optimal smoothing parameter [1]. The optimal smoothing parameter can be determined using several methods, such as Cross-Validation (CV) [25], Generalized Cross-Validation (GCV) [7], and Unbiased Risk (UBR) [26].
In this research, the simulation of relationship pattern form between the response and predictor variables that follow the Truncated Spline and Kernel pattern characteristics is proposed. This proposed method is validated using many combinations of sample size variation and variance. Furthermore, the relationship pattern from the simulation data results was modeled using a mixed estimator of the Truncated Spline and Gaussian Kernel. Cross-Validation (CV), Generalized Cross-Validation (GCV), and Unbiased Risk (UBR) are used to determine the optimal smoothing parameter. This research aims to compare the performance of the CV, GCV, and UBR methods in estimating the optimal knot point and bandwidth in the mixed estimator model of nonparametric regression. Moreover, the coefficient of determination (R 2 ) was used as the criteria for goodness.
The structure of this paper is organized as follows: Brief explanation of material, such as the Truncated Spline, Gaussian Kernel, Mixed Estimator Nonparametric Regression, Cross-Validation (CV), Generalized Cross-Validation (GCV), Unbiased Risk (UBR), and Research Methodology in Section II. Simulation results of several case studies are given in Section III, and Conclusions are given in Section IV.

A. Truncated Spline
The spline is a segmented polynomial model. The Truncated Spline function is a function that still maintains the properties of the polynomial function [6]. In general, a Truncated Spline nonparametric regression model can be written as follows: where ( ) i f x is the Truncated Spline function with degrees m and  is the knot point: Suppose given a paired data xi and yi, where i=1,2,…,n follows a Truncated Spline nonparametric regression model: with a Truncated Spline function: the regression model in Equation 4 can be written in matrix form as: Where y is the vector of the response variable of size (n x 1), X() is a matrix of size n x (m + r + 1),  is the vector of the regression coefficient parameter to be estimated and size (m + r + 1) x 1, and Ɛ is a random error vector of size (n x 1).

B. Gaussian Kernel
Suppose given a paired data ti and yi, where i=1,2,…,n. So, a Gaussian Kernel nonparametric regression model can be written: The Kernel estimator has advantages, such as flexible, easy mathematical form, and faster convergence [13].
The regression curve of h(ti) it to be approximated by the Kernel Function, the regression curve estimation can be presented in Equation 7. where: The kernel function used is the Gaussian Kernel: Based on the kernel function in Equation 7 that applies to each t=t1, t=t2, …, t=tn, then it can be written in matrix form as follows: Where y is the vector of the response variable of size (n x 1), G(α) is a matrix of size (n x n).

C. Mixed Estimator Nonparametric Regression
The nonparametric regression mixed estimator is a multipredictor nonparametric regression model whose regression curve is additive, where the regression curve was approximated by two or more types of estimators [3], [27].
For example, given paired data (xi ,ti ,yi) where the relationship between the predictor variables (xi , ti) and the response variable (yi) follows a nonparametric regression model.
(10) And then, the regression curve µ(xi, ti) is assumed to be additive such that µ(xi, ti) can be written into the form: The mixed estimator model used in this study is a combination of the Truncated Spline and the Gaussian Kernel. Furthermore, Equation 10 can be written in matrix based on Equation 5 and 9 form as follows: (12) Error can be written as follows: The estimation of  can be obtained through LS optimization as follows: Based on Equation 14, the sum squared of error can be written: To get an estimator of , obtained by using a partial derivative of ( ) Q β to  as follows: And then, Equation 16 equal to zero. Estimate results from β is: Equation 17 can be written as: Thus, the Truncated Spline component estimator can be written as: Equation 19 can be written as As a result, the Gaussian Kernel estimator can be written as is highly dependent with smoothing parameter (knot point and bandwidth).

D. Cross-Validation (CV)
Cross-Validation (CV) is a method developed by Craven and Wahba [25]. The formula developed is still limited to a single estimator form. Furthermore, the CV method can also be generalized to the mixed estimator form. The modified CV method formula for the mixed estimator form can be written: The CV method does not require variance information 2  .

B
can be searched based on Equation 20. The CV method gives different weights to each observation according to its contribution [28].

E. Generalized Cross-Validation (GCV)
Generalized Cross-Validation (GCV) is a generalization of the CV method developed by Wahba [7]. The GCV method formula developed by Wahba is still limited to a single estimator form. The GCV method can be used in the form of a mixed estimator, where the modified GCV method formula can be written: with,

G. Research Methodology
In this section, the step of the proposed method is presented as follows:   (27) and error is: Estimation of parameter β can be obtained using the 10. The number of knot points tested is only 1 knot, for a variable set as a Truncated Spline. 11. Determining the optimal smoothing parameter using CV, GCV, and UBR. Moreover, CV, GCV, and UBR formula have been modified based on the use of a nonparametric regression model mixed estimator. 12. Calculate the Coefficient of Determination (R 2 ) for each modeling process carried out.

III. RESULTS AND DISCUSSION
This section describes a simulation study of a nonparametric regression model using a mixed estimator. The proposed method is a combination of the Truncated Spline and Gaussian Kernel. Simulations are carried out under various regulated conditions. In this study, the sample size variations with n to be tested is 25, 50, 100, and 200. The combination of the error variance 2  to be tested is 0.05, 0.5, 1, and replication for each generated data is 20 times.
For example, a scatter plot between the response variable and each predictor variable with a sample size of n = 50 and 2 0.05   shown in Fig. 1.   Fig. 1 and Fig. 2 show that each predictor variable has a different form of regression curve. Variable x1 shows the characteristics of the Truncated Spline estimator, which has a changing data pattern at certain sub-intervals. In comparison, variable t1 shows a data pattern that does not have a certain pattern, so that it was modeled with the Kernel estimator. Furthermore, based on the scatter plot in Fig. 1 and Fig. 2, a nonparametric regression model was applied using a mixed estimator of the Truncated Spline and Gaussian Kernel.
The number of knot points to be tested is only one-knot point for variables defined as a Truncated Spline component. The simulation results in the form of the average CV, GCV, UBR, and coefficient of determination (R 2 ) are presented in Tables 1, 2, and 3. For the various sample sizes n, such as 25, 50, 100, and 200, with all variations of the variance tested, the GCV method provides better knot point and bandwidth estimation results compared to the CV and UBR methods. This is indicated by the value of the coefficient of determination (R 2 ) obtained from each experiment with GCV, which is higher than the other two methods. Furthermore, the residuals of each modeling results for each combination of sample size variation and variance follow a normal distribution.
For example, the number of samples n=25 and the error variance is , using the GCV method in selecting the optimal knot point and bandwidth, the average GCV value is 1.473 with an R2 value of 86.48%. Meanwhile, using the CV method and the same conditions obtained an average CV value is 0.136 with R 2 value is 85.65%. Using the UBR method, it was obtained an average UBR value of 0.009 and R 2 value is 83.71%.
The impact of the variation variance measures σ 2 in this study has an effect on the simulation results. It can be seen that the increase of the variance tested, the value of R 2 for all methods used both CV, GCV, and UBR tend to decrease. The variance shows the deviation of the data from the average, so that the higher the variance value that is tried, then there was a tendency for the data spread far from the average value. The illustration of generated data with n=200 and various variance conditions are shown in Fig. 3. Based on Fig. 3, Thus, it can be concluded that the size of the sample tested and the variance size is important. Moreover, it can be seen that in the variance σ 2 = 0.05, the Truncated Spline component has clearly shown a changing pattern at certain sub-intervals. While the Gaussian Kernel component does not appear to have a certain pattern. The increasing of the variance value, for example σ 2 = 1, the data pattern for Truncated Spline component implicitly still has shown a changing pattern in certain sub-intervals, but there is a tendency for the pattern to spread. While the Gaussian Kernel component looks more spread out and doesn't have a pattern. Based on the impact of the variance size and sample size, it can be seen that the GCV method still gives the correct estimation of knot point and bandwidth, so it can provide better coefficient of determination (R 2 ) value compared to the other two methods for each condition.
Based on the simulation results, the knot point and bandwidth estimation results from the CV, GCV, and UBR methods are quite good. However, the GCV method provides better performance and accuracy for each combination of sample sizes and variance variations tried. The GCV method produces optimal knot point and bandwidth to obtain the largest coefficient of determination (R 2 ) for each combination. As a result, the GCV method is more suitable for estimating the knot point and bandwidth in the nonparametric regression model mixed estimator of the Truncated Spline and Gaussian Kernel.

IV. CONCLUSION
Simulation studies on the nonparametric regression model mixed estimator of the Truncated Spline and Gaussian Kernel to compare the performance of the Cross-Validation (CV), Generalized Cross-Validation (GCV), and Unbiased Risk (UBR) methods in estimating the optimal smoothing parameter (knot point and bandwidth) have been successfully carried out. Based on the simulation results, with an error following the Normal distribution and in a combination of sample size variation and error variance. The GCV method provides better result performance and accuracy than the CV and UBR methods. The GCV method produces optimal smoothing parameters so that the largest coefficient of determination (R 2 ) is obtained for each combination. The results obtained in this study have the potential to contribute to the development of statistics, especially in the field of nonparametric regression.