Statistical Challenges of The Sustainable Development in Iraq

Many challenges on the road to sustainable development face Iraq. Data limitations cause massive constraints on sustainable development goals checking because about 70% of indicators are missing in Iraq. Statistical indicators are considered as the basis of sustainable development in each country. The healthy population was the third goal of sustainable goals because the healthy population produces a high level of sustainable development. Consequently, fertility is an important indicator of a healthy population because it is one of the three principal components of population dynamics that increases the size of any country's population. The important indicator of fertility is the number of newborns for the married women. Therefore, the problem of this research is about 70% of sustainable development goals indicators currently are missing in Iraq, especially inferential indicators including modelling of the association between the phenomena and their risk factors. Then the goal of this research is to build a model to represent the association between the fertility and its risk factors to provide an inferential indicator about this phenomenon to identify the most influential risk factors. Several risk factors are facing the pregnancy before childbirth, but what are the most influential risk factors cause it? The count regression models are a statistical technique may be used to express the association between the number of fertility and its risk factors. A simple random sample was drawn with size (200) consists of the mothers who visit the primary health care centres in Babylon province-Iraq in year 2018. Biological, behavioural and lifestyle factors are recorded. The results of applying Poisson regression model show that the data was overdispersion, with nonlinear function and with number of outliers. Therefore, the nonparametric Poisson model was used as an alternative model to this data. The results of applying this model show that the most influential risk factors are five linear risk factors in addition to three nonparametric smoothing factors. Keywords— poisson regression; influential risk factors; nonparametric poisson regression; statistical challenges; sustainable development.


I. INTRODUCTION
United Nations (UN), at the meeting of Sustainable Development (SD) held in September 2015, announced 17 goals and 169 targets to achieve inclusive development in all life fields [1]. However, attaining these goals requires a development strategy where statistical data plays an essential role in its success if it has been provided to related SD managers effectively. Statistical modeling has an active role in delivering indicators for different social, economic, and environmental subjects to express the relationships between various phenomena to achieve these goals.
The major challenges in Iraq's history over the last four decades had been the damage to institutions, wealth, and organizations that inherit conflict, war, hardship, insecurity, poor government, deliberate breakability, and the lack of many of the foundations of public peace [2]. Iraq was then ranked 121st out of the 188 Human Development Index (HDI) countries and ranked lower in 2015 than in other countries. All these factors have impacted the capability and facilities of the health and education sectors. Children's and maternal treatment and education metrics did not represent the SDGs.
Consequently, it is a difficulty and a limitation on mathematical analyses and assumptions to acquire full data to assess the progress towards achieving and classifying the SDGs by characteristics. The cumulative data difference exceeds 70% of the overall metrics for SDGs. However, methodological initiatives need to be re-evaluated and an integrated database set up to track the SDGs and incorporate previously inaccessible metrics into new surveys, particularly those relating to gender and the development situation in various areas of Iraq. However, the future statistical system and its impact on SDGs data collection are unclear in Iraq. On the other hand, Iraqi organizations do not provide complete and reliable organizational data [3], [4].
There is a prosperity of published material on SD in general and on the SDGs, particularly from the UN, international privet organizations, many other concerned organizations, and individuals more locally. Also, there is a UN website on SD, such as as well as a SD knowledge stage with updates on the high-level political forum, on individual topics and a directory of resources including recent publications [6], [7].
It has been specified that the third goal of SD is "Ensure healthy lives and promote well-being for all at all ages" includes many targets among them are "Reduce the global maternal mortality ratio to less than 70 per 100,000 live births by 2030, and end preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce neonatal mortality to at least as low as 12 per 1,000 live births and under-5 mortality to at least as low as 25 per 1,000 live births". Consequently, Iraqi ministry of health stated that data and information collected are not used in health policymaking decisions (NVR, 2019). Fertility is one of the three principal components of population dynamics that increases the size of a country (Reddy and Alemayehu, 2015). Three groups of risk factors influence fertility; biological, behavioral, and lifestyle factors [8].
The problem of this research is about 70% of SDGs indicators currently are missing in Iraq. Inferential indicators include modeling of the association between the phenomena and their risk factors on SD depends on the descriptive indicators that have ratios and rates only. This research aims to build a model to represent the association between fertility (number of live children per mother) and its risk factors to provide an inferential indicator of this phenomenon. The second goal is to identify the most influential risk factors affected fertility to advise the health care managers to focus on these risk factors to improve this phenomenon. It is an important component of population development and healthy lives and endorses well-being for all ages, which is the third Goal in SDGs.
The plan of this paper is as follows. In Section 2, we present the method and material, including data collection and statistical analysis, are discussed. We discuss the results of analyzing real data set in Section 3. Finally, some summarizing remarks are placed in section 4.

II. MATERIALS AND METHOD
Model specification, dataset description, and descriptive statistics were obtained for different risk factors and dependent variable Y is represented in the following subsections.

A. Model Specification
Fertility is measured by the number of live children per each mother, and the risk factors are categorized into three groups, include biological, behavioral, and lifestyle factors. Firstly, the descriptive statistics were obtained for different risk factors and dependent variable Y to give a vision about the phenomenon. Secondly, Standard Poisson Regression (SPR) model, which is designed to represent the association between the count dependent variable and the risk factors. The fitted regression model relates the dependent variable (number of live children per mother ) to one or more risk factors , = 1,2, . . , which may be either quantitative or categorical. Consequently, SPR model tries to explain this counting variable using risk factors , for = 1, … , and = 1, . . . , . This p-dimensional risk factors contains characteristics for the ith observation. Then SPR represents the conditional mean of given observations on as an exponential function [8].
this is sometimes referred to as a log-linear model, since: where is the number of live children per each mother, = 1,2, … , , is the ℎ risk factor and is an unknown parameter for ℎ risk factor. SPR model should be used under two restrictive assumptions; first of them is an equidispersion assumption as: This is a very restrictive assumption of the SPR model that the conditional mean equals the conditional variance. To test this assumption, the increase in variance is represented in the model by a constant multiple of the variance-covariance matrix as [10], [11]: The overdispersion parameter . can be estimated as the average Pearson statistic / 0 or the average deviance (D) [5]: this is analogous to the estimation of the error variance in the linear model, with / 0 or D replacing the residual sum of squares. This Pearson or deviance test is often used as a test for dispersion. Therefore, a value of . 1 = 1 indicates equidispersed whereas values below 1 indicate under-dispersion and values greater than 1 indicate over-dispersion. However, this assumption is violated in many applications. Generally, two sources of this violation are determined: heterogeneity of the population and excess of zeroes. The heterogeneity is detected when the data set can be divided into many homogeneous subgroups. The excess of zeroes is detected when the number of observed zeroes increased largely than the number of zeroes results from the fitted Poisson distribution [12]. In practice, often, applying Poisson regression model gives a higher variance. Therefore, the response variable's conditional variance comes with an overdispersion parameter (∅) as in Eq. (4). Consequently, [13] Stated that there are different methods of Negative Binomial (NB) models for modeling the count data with additional variability into account. The traditional model which was named as NB2 is the matter of this paper. This model is better than PR model because it allows for random variation in the Poisson conditional mean, , by letting = 6 where = such that is the ℎ row of matrix of order × + 1 with p covariates, is the parameters vector of order + 1 × 1 with intercept and 6 is a gamma noise variable with mean of 1 and a scale parameter of ν. [14]. The NB2 density function of the dependent variable is given [14] 8 , 9, = : ; < = > ⁄ where the overdispersion parameter 9 = 1 C ⁄ , the conditional mean | = and the conditional variance #$% | = 1 + 9 . Negative binomial regression (NBR) model is used to represent the association between covariates and a count response variable when the conditional variance of the count response is higher than its mean. NBR model is a kind of GLM in which the response variable Y is a count of occurrences. The traditional (NBR) model is: where , 0 , … , are the explanatory variables, and , 0 , … , are the unknown population regression parameters to be estimated. The fundamental NBR model for an observation i is written as [15]: Consequently, the maximum likelihood estimation (MLE) method is used to estimate the parameters 9 and s; then the likelihood function is [16]: The NB2 ML estimation of ( O ; 9 Q) is the solution to the firstorder conditions which can be obtained by maximizing the first derivative of the equation (9) with respect to and 9 as follows: and NBR is the same as other forms of regression requests to test the assumptions of linearity, homoscedasticity, and normality.
On the other side, SPR and NBR models assume that the link function linearizes the conditional mean of response in terms of regression parameters, meaning it is linear in predictors. Sometimes, some predictors enter the model with unknown form. Consequently, when the form of interring covariates in the model is unknown, then the form of SPR or NBR models are becoming as: where, 8 . is a real-valued function of only. If the functional form of 8 . is known, the design matrix is completely specified, and MLE of can be obtained. Meanwhile, if the functional form of 8 . is unknown for any , then the design matrix is not completely specified. Therefore, when these two assumptions did not verify, we propose the nonparametric methods to fit the SPR model. According to the strategies used to transform the variables, many nonparametric spline regressions have been developed. Among them, the cubic spline (CS) method tests the hypothesis that the relationship is not linear to be helpfully summarized by a linear relationship.
Splines are defined as piece-wise polynomial functions that fit together (at 'knots'); for CS, the first and second derivatives are also continuous at the knots [17]. Using a CS method in a regression analysis will adapt the generalized linear model in the variables ′Y and Z -2 piecewise cubic variables in and the intercept to give what is called the generalized additive model (GAM) defined by [18]. GAM's focus on exploring data nonparametrically, which are more suitable for analyzing the data and visualizing the relationship between the dependent variable and the independent variables. The additive model generalizes the linear model by modeling the dependency as: where \ , = 1,2, … , ; are smooth functions, ] = 0 $ _ #$% ] =`0. In order to be estimable, the smooth functions \ must satisfy standardized conditions such as @\ B = 0. These functions are not given a parametric form but instead are estimated in a nonparametric fashion.
Thus, a CSPR model with k knots at < 0 < . . . < b with c independent variables and the dependent variable Y follow Poisson distribution consisting of the original independent variables, and the spline transformed variables take the form [18].
After fitting the models, we will use several criteria to assess the goodness of fit for the statistical models. Firstly, Pearson chi-square and the deviance are used to assess the hypothesis that the models are fit the data adequately. If this p-value of these tests is less than some significance level, such as 0.10 or 0.05, there is a significant lack of fit. Secondly, the likelihood ratio test to assess the hypothesis that all the explanatory variables did not influence the dependent variable. Thirdly, Wald test is used to assess the hypothesis that the model's parameters equal to zero, which means that the risk factor does not have a significant effect on the dependent variable.

B. Dataset Description
A simple random sample is drawn with size (200) consists of all women who visit the primary health care centers in Babylon province in year 2018. Biological, behavioral, and lifestyle factors which are representing the characteristics of women under the study. This information was recorded from the records of the women who visit this center during 2018. Table 1 represents the description of the response variable which is the number of live children per each mother during the marriage life and the risk factors which are affecting this variable, as follows: • Age of mother (years (X1)) • Age of mother at marriage (years (X2)) • Mother's weight (Kg (X3)) • Husband's age (years (X4)) • Number of miscarriage (X5) • Exercise per week (Hours (X6)) • Sleeping per day (Hours (X7)) • Duration of breastfeeding (months (X8)) and the dependent variable is number of children born (Y). These data are analyzed with SPSS and SAS software using SPR, NBR models, and CSPR model and the results and discussion as in the following section.

III. RESULTS AND DISCUSSION
The dataset included a sample of 200 mothers, who were in the age of 17-49 in the Babylon Government. The response variable considered in this study was fertility (the number of live children pear each mother), which is a count dependent variable. The study's main goal is to choose the appropriate model to represent the association between the dependent variable and its risk factors and identify the most influential risk factors that influence fertility. Therefore, descriptive and inferential statistics are summarized and described in this section.
The descriptive statistics include tables and bar-charts to describe the frequency distribution and percentage of the study variables. Table 1 and Figure 1 show the frequencies of the dependent variable (Y) number of live children per mother, which is a count variable.  The data set consists of a quantitative variable with the following descriptive statistics, as in Table 2.  and normality must be met for SPR model. Therefore, residual analysis is very important to test these assumptions of the SPR model. If there are no significant deviations away from 0 and 95% of the residuals are under the absolute value of 2.0, then the model fits the data. Consequently, figure 3 represents the predicted value of the mean of response versus the standardized Deviance and chi-square residuals. This figure shows some values under the absolute value of 2; therefore, the model did not meet the assumptions. On the other side, [19]. Stated that there are potential outliers if the absolute value of standardized Pearson residuals is more significant than 2 or 3. There are some values above of 2, indicating there are outliers, as shown in Figure 2. Therefore, these results show that SPR is not the appropriate model. Then, NBR model is used as an alternative model with overdispersion property. The results of applying NBR model to investigate the dataset show that the model fits the data well because the kℎ − Ylm$%n/_8 = 0.187, which is greater than 0.05, then we can continue to interpret the other results of the fitted model. Also, the pvalue of the likelihood ratio test less than 0.05 confirms the model's statistical significance. Table 3 shows the estimated parameters from these results. The results in Table 4 show that only two covariates have a significant effect ( $ _ 0 ) where their p-values less than 0.05, but they have an opposite effect according to the odds ratio w 38 = 1.072 is above one, then for every one-unit increase in , the risk of the response occurring increases that many more times versus the reference category. Meanwhile, the odds ratio of w 38 0 = 0.946 is below one, then for every one-unit increase in 0 the risk of the response variable occurring will decreases many times versus the reference category. The other covariates did not have a significant effect because their p-values are more generous than 0.05. finally, the residual analysis is very important for testing the assumptions of linearity, normality, and homogeneity of variance of NBR. The results of analyzing the residuals and figure 3 show that all values of standardized Deviance residuals are above the absolute value of 2.0. The model is not fit the data. Also, the analysis of standardized Pearson residuals shows that there are values above a total value of 2.0, then outliers. Consequently, these results did not agree about the adequacy of NBR model to fit these datasets. We cannot consider it the best model to represent the association between the response variable and this dataset's covariates. Therefore, the CSPR model with c independent variables and the dependent variable Y follow Poisson distribution as in Eq. (7) was used to fit the data using SAS software [20]. The results of fitting this model show that there are five linear risk factors have a significant effect on the dependent variable with − C$ mn < 0.05. These most influential risk factors are Age of mother (years), Age mother at marriage (years), Husband's Age (years), mother's weight (Kg) and number of miscarriages. Meanwhile, the other three risk factors have a no significant effect on the response variable (p-value > 0.05) as shown in Table 4. Also, there are three nonparametric smoothing factors of Age mother at marriage, number of miscarriage and the duration of breastfeeding have a significant effect ( − C$ mn < 0.05) on the response variable, while the other nonparametric smoothing risk factors have a no significant effect on the response variable ( − C$ mn > 0.05) as shown in Table 5. This model's goodness of fit was tested using the null hypothesis that our model fits the data well against the alternative hypothesis that the model did not fit the data well. The Deviance (D) of the final estimate was equal 78.462 with F − C$ mn = 1, since ( − C$ mn > 9 = 0.05), then we do not have sufficient evidence to reject the null hypothesis that our model fits the data well. In other words, our model explains significantly more variation in the data, and it is an adequate model to fit the data.

IV. CONCLUSION
Iraq faces several challenges on the road to SD; one of them is data limitations, containing the lack of classified data where placed a vast restriction on SDGs checking. The other limitation is SD based on the descriptive indicators (Ratios, Rates), which describes the current case and does not give a future vision of the goal. The research focus on the challenges faced Iraq about the third goal of SD, which is consisting of "Ensure healthy lives and promote well-being for all at all ages." However, available data in Iraq shows developments in descriptive indicators (ratios, rates) of child and maternal mortality. Still, there are no inferential indicators to give a future vision about this phenomenon. Therefore, this research focuses on building a model representing the association between one of the dynamic population components (Number of newborn children) and its risk factors. Since the data was used is count data, then the PR model was used. Still, fitting this model shows that this model is not adequate, then we use NBR as an alternative model. The results also show that this model is inadequate to model to represent the association between this response ant its risk factors. Consequently, we propose to use the CSPR model to fit this data. Providing this model shows that the CSPR model is adequate and has significant goodness of fit.
These results also show that there are five most influential risk factors of the linear part and three risk factors of the nonparametric smoothing part of the model, as shown in Eq. (15). Therefore, we advise the health mangers to focus on these risk factors to gate good rates of the population's fertility and good health care for mothers and the women generally. For future works, study the effect of other risk factors on this phenomenon and apply this model to produce inferential indicators for other SDGs. Consequently, interpolation studies among these target groups are necessary. More studies are needed to confirm this observed association between SDGs and their risk factors to give future vision about SD in Iraq