An Agglomerative Hierarchical Clustering with Various Distance Measurements for Ground Level Ozone Clustering in Putrajaya, Malaysia

Ground level ozone is one of the common pollution issues that has a negative influence on human health. The key characteristic behind ozone level analysis lies on the complex representation of such data which can be shown by time series. Clustering is one of the common techniques that have been used for time series metrological and environmental data. The way that clustering technique groups the similar sequences relies on a distance or similarity criteria. Several distance measures have been integrated with various types of clustering techniques. However, identifying an appropriate distance measure for a particular field is a challenging task. Since the hierarchical clustering has been considered as the state of the art for metrological and climate change data, this paper proposes an agglomerative hierarchical clustering for ozone level analysis in Putrajaya, Malaysia using three distance measures i.e. Euclidean, Minkowski and Dynamic Time Warping. Results shows that Dynamic Time Warping has outperformed the other two distance measures. Keywords— agglomerative hierarchical clustering; Dynamic Time Warping; ozone analysis; time series


I. INTRODUCTION
Putrajaya is one of the developed cities located in Malaysia. With the dramatic economic development and population expansion, several environmental pollution issues have arisen. One of these issues is the increasing of Ozone pollution. Apparently, such increment has a significant impact on the human health [1]. Several stations have been employed nowadays to observe the ozone trends. In order to analyze such trends, machine learning techniques especially clustering technique can be considered as a great opportunity in terms of detecting significant patterns. Clustering aims to aggregate similar data points within clusters [2]. In this manner, similar trends could be aggregated in a single group which facilitates the cause analysis. However, the key challenge behind clustering ozone levels lies on the representation in which the data is being represented in time manner [3].
Time series has emerged as a response to the data evolution of chronological representation where the data been made in time intervals [4]. There are many kinds of time series data such as financial, weather forecasting, pattern recognition, etc. [5]. The common task of time series data mining is the process of identifying similar sequences. Such process is performed using clustering techniques.
There are many clustering techniques could be used in this task. One of the common clustering technique is hierarchical clustering which has been considered as the state of the art for various environmental and metrological data in the literature [6]- [8]. Hierarchical clustering aims to build a hierarchy of clusters in which the data points are being initialized as one cluster and then split into multiple clusters (Divisive Hierarchical Clustering), or each data point could be initialized as a cluster and then merged into a smaller number of clusters (Agglomerative Hierarchical Clustering) [9].
On the other hand, the similarity or distance function used by the clustering technique plays an essential role in terms of the performance of the clustering task [10]. Several distance measures have been proposed including Euclidean, Minkowski and Dynamic Time Warping distance measures. In fact, integrating an appropriate distance measure with an appropriate clustering technique is a challenging task [10]. Therefore, identifying suitable distance measure for ozone level clustering represents a vital demand process. This paper aims to conduct a comparative analysis between three distance measures including Euclidean distance (ED), Minkowski distance (MD) and Dynamic Time Warping (DTW) using Agglomerative Hierarchical Clustering (AHC).

II. MATERIAL AND METHOD
Various studies have tackled the problem of detecting the ozone trend, for instance, Solazzo et al. [11] have conducted a comprehensive analysis for surface-level ozone based on air quality in Europe and North America in which an ensemble clustering approach have been used to group the similar data.
On the other hand, Saithanu & Mekparyup [8] have proposed an agglomerative hierarchical clustering with Euclidean distance measure for clustering ozone level at the east of Thailand. In their study, the authors have concentrated on the significant factors that lead to increasing the ozone level such as temperature, wind direction, humidity and wind speed.
Similarly, Austin et al. [12] have concentrated on factors associated with ozone levels such as temperature, pressure and sea level for identifying ozone detection using k-means clustering. The data used in such study is a daily data collected from Boston Logan airport. In their studies, the authors have attempted to identify the most appropriate number of clusters. Results showed that five number of clusters has obtained the superior performance.
In addition, Malley et al. [7] have proposed a Hierarchical Clustering Analysis (HCA) with non-negative matrix factorization for classifying ozone level in Europe. Multiple datasets have been used in such study related to ozone variation measurements for the period of 1991-2010. The grouping clustering has been used to identify relationships influence the ozone levels.
Finally, Ahmadi et al. [6] have applied two kinds of clustering including k-means clustering and agglomerative hierarchical clustering for ozone level analysis. Basically, kmeans has been used firstly in order to detect significant patterns of ozone. Then, agglomerative hierarchical clustering has been used to identify hourly ozone patterns. Finally, multiple regression tasks have taken a place in order to predict ozone based on seasons and zones.

A. Proposed Method AHC with DTW
The proposed method of this study is an agglomerative hierarchical clustering that has been carried out as a complete/maximum linkage with Dynamic Time Warping as a distance measure. The application of the proposed method has been performed to classify the ground level ozone in Putrajaya, Malaysia for the year 2006.

B. Research Method
The research method of this study consists of five main phases as shown in Fig. 1. The first phase is data which discusses the collection, details and characteristics of the data used in the experiments. The second phase is preprocessing which discusses the cleaning tasks that have been performed to turn the data into an appropriate form. The third phase is clustering which discusses the application of agglomerative hierarchical clustering. The fourth phase discusses the distance measures including ED, MD and DTW. Finally, the fifth phase is evaluation in which the clustering results are being validated using certain evaluation method.

C. Data
Data has been collected from LESTARI [13] which is the Institution for Environment and Development in Malaysia and the Asia Pacific. Such institution has been established since 1994 with the structure of Universiti Kebangsaan Malaysia (UKM) in order to deal with environment and development issues. The data contains ozone levels for one year (i.e. 2006) particularly for Putrajaya city. The data has been represented hourly as time intervals, which contained 8544 instances.

D. Preprocessing
This phase aims to prepare the data in order to be more suitable for processing. Basically, each data includes irrelevant data, noisy and uncompleted instances. Handling such data plays an essential role in terms of improving the performance of clustering process [14]. Hence, two tasks have been proposed for this purpose; cleaning and discretization. Cleaning aims to handle the missing values and the calibration errors where such values has the ability to cause incorrect matches in the process of clustering [15]. In this manner, Microsoft Excel has been used to detect such values in which the 158 missing values and 431 calibration errors have been identified and dealt with by Matlab ANN prediction algorithm. Whereas, discretization task aims to limit the class values within a specific interval. Such interval will facilitate the process of clustering where the values will be reduced into a particular range. Such process of discretization is essential for specific algorithms such as hierarchical clustering [16].

E. Hierarchical Clustering
This phase aims to apply a hierarchical clustering technique. In general, hierarchical clustering algorithms work by aggregating the objects into a tree of clusters [17].
Hierarchical clustering can be categorized into two types; agglomerative and divisive. Such categorization is inspired from the mechanism of grouping the objects whether bottom-up or top-down approach. AHC is considered as a bottom-up hierarchical approach where each object set in a separated cluster then AHC will merge such clusters into larger clusters [2]. Such process is continuing until a specific termination has been reached. Complete linkage algorithm aims to identify the similarity between two clusters by measure two nearest data points that are located in different clusters. Hence, the merge will be done between the clusters that have minimum distance -most similarbetween each other.
In this paper, AHC has been applied as a maximum linkage with three distance measures including Euclidean Distance (ED) [18], Minkowski Distance (MD) [19] and Dynamic Time Warping (DTW) [20], these measures are illustrated in the next sub-section.

F. Distance Measures
The key characteristic behind clustering process lies on the function that will be used to identify the similarity between two data. Such data varies where it could be formed as raw values of equal or non-equal length, or it could be formed as vector space of feature-pairs [21].
For Euclidean distance, let and be a P-dimensional vector, then the Euclidean distance can be measured as [21]: For Minkowski distance, Let and be a Pdimensional vector, Minkowski distance is a generalization of Euclidean distance, which is computed as follows [21]: where q is a positive integer. On the other hand, DTW has been widely used to compare between discrete sequences and sequences of continuous values [21]. Let = { , , … , , … , } and = { , , … , , … , } be a two time series sequences. DTW will minimize the differences among these series by representing a matrix of × ! [22]. In such matrix, the distance/similarity between and will be calculated using Euclidean distance.
However, a warping path " = {# , # , … , # , … , # $ } where max(!, ) ≤ + ≤ ! + − 1 will be elements from the matrix that meet three constraints including boundary condition, continuity, and monotonicity [22]. The boundary condition constraint requires the warping path to start and finish in diagonally opposite corner cells of the matrix. That is # = (1,1) and # $ = (!, ) . The continuity constraint restricts the allowable steps to adjacent cells. The monotonicity constraint forces the points in the warping path to be monotonically spaced in time [23]. The warping path that has the minimum distance/similarity between the two series is of interest. Hence, the DTW can be computed as follows:

A. Evaluation
One of the challenging tasks behind clustering is evaluating its results in which the question 'what is the best way to group the data' should be clarified [24]. Two main approaches have been proposed for validating clustering process; external and internal validation of clusters [25]. External validation aims to validate the clusters based on the distribution in which the common information retrieval metrics such as precision, recall, and f-measure. However, such mechanism of validation relies on a labeled data. Since, the real-life data is usually unlabeled thus, applying external validation tend to be insufficient. On the other hand, internal validation aims to measure the correctness among objects within a cluster (i.e. intra-cluster) and the correctness among objects within multiple clusters (i.e. inter-cluster). Basically, the main aim of the clustering task is to make sure that the objects within a single cluster are mostly similar, as well as, the objects within multiple clusters are mostly dissimilar. Hence, computing the Root Mean Square Error Standard Deviation (RMSE-SD) would measure the homogenous of the objects within a single cluster and within multiple clusters. Note that, the smaller value of RMSE-SD between the objects within a single cluster leads to better performance in which the objects are very similar. In contrast, the bigger value of RMSE-SD between the objects within a single cluster leads to lower performance in which the homogenous among the objects is being maximized. Therefore, best results associated with a smaller value of RMSE-SD among intra-cluster, and with a greater value of RMSE-SD among inter-clusters.

B. Experiments
The experiments have been conducted using C# programming language in which the data has been transformed into columns and eliminating the noisy data. In addition, the agglomerative hierarchical clustering has been performed with max-linkage using the three distance measures including Euclidean, Minkowski and DTW. The clustering was performed using a multiple number of clusters as parameters with a range of 3-15 number of clusters. Such ranged has been set as a result of analyzing the data and identifying the appropriate classes.
In this section, the results of the proposed AHC using ED, MD and DTW are being declared. Basically, the results have been obtained based on a multiple number of clusters. Based on the observation of data, the number of clusters should be ranged from 3-15. Table 1 shows the results for intra-cluster and Table 2 shows the results of inter-clusters. As shown in Table 1, the minimum results of RMSE-SD have been obtained at 3 number of clusters for ED, MD and DTW by achieving 0.00954, 0.0127 and 0.0039 respectively. As mentioned earlier, the smaller value of RMSE-SD for intra-cluster leads to better performance. Therefore, 3 number of cluster is the most accurate one. However, DTW has shown the smallest value of RMSE-SD which compared to the other distance measures. This means that DTW has outperformed both ED and MD for the intra-cluster. As shown in Table 2, the maximum value of RMSE-SD for ED was at 10 number of clusters by achieving 0.118, for MD at 12 number of clusters by achieving 0.253, and for DTW at 3 number of clusters by achieving 0.34. As mentioned earlier, the maximum value of RMSE-SD for inter-clusters leads to better performance. By comparing the three values of RMSE-SD for the three distance measure, it is obvious that DTW has the greatest value. This means that DTW has outperformed the other distance measures for the inter-clusters. Fig. 2 shows the performances of the three distance measures for both intra-cluster and inter-clusters. Basically, comparing the results of the proposed AHC with DTW against the related work seems to be a challenging task due to multiple reasons. First, datasets used in the related work are different. Second, the evaluation of clustering is varying among the studies. Third, the aims of applying the clustering are also different in which some studies were addressing relationships that influence the

DTW Euclidean Minkowski
variation of ozone level. Finally, the number of years and the covered regions are differing among studies. However, since Euclidean and Minkowski distance measures have been used with AHC in the related work, it can be concluded that the proposed DTW with AHC has shown competitive performance.

C. Discussion
The US Office of Air and Radiation [26] have discussed the factors that lead to air pollution. In their investigation, the ozone was one of the main factors that could harm the human health. For this manner, AirNow (2009) has provided 5 categories of air pollution which are shown in Table 3. In order to provide more critical analysis of the acquired clusters, the best number of cluster based on the RMSE-SD which is 9 will be considered. In addition, the AirNow (2009) categorization will be considered. Therefore, a comparison is being held between the two number of clusters 5 and 9. The comparison will be based on multiple variables including starting values of ozone, maximum peak, maximum peak of median and ending values. Table 4 shows the values of 5 number of clusters. As shown in Table 4, the number of days included in the 'unhealthy' category is nearly representing the half of the year which seems to be overestimated categorization. This means that this category should be divided into more categories. Whereas, the 'moderate' category contains only eight days which also seems to be underestimated categorization. Generally, this category is supposed to contain more days. However, Table 5 shows the values of 9 number of cluster.  As shown in Table 5, unlike the standard 5 categorizations, the 9-categorizaiton has the ability to provide a better description of the year's days. This can be represented by giving more categories.
For instance, the 'unhealthy' category has been split into two categories as 'unhealthy' and 'very unhealthy for sensitive group'. These categories have shown reasonable contained number of days. In addition, the category 'moderate' has been split into three categories as 'high moderate', 'moderate' and 'low moderate'. Similarly, these categories have contained a reasonable number of days. Finally, the category 'good' has been also divided into two categories as 'very good' and 'good'. However, Fig. 3 and Fig. 4 show the distribution of categories over the number of days. This paper has conducted a comparative study between three distance measures including Euclidean Distance (ED), Minkowski Distance (MD) and Dynamic Time Warping (DTW) for clustering ozone level in Putrajaya, Malaysia using Agglomerative Hierarchical clustering (AHC). Data used in this paper is an hourly observation of ozone level for one year (i.e. 2006). Results showed that DTW has superior performance compared to the other two distance measures. In future direction, conducting a comparative analysis of different clustering techniques such as k-means, k-medoids, density-based and others, would contribute toward improving the effectiveness of clustering results.