Comparative Analysis of Data Mining Techniques for Malaysian Rainfall Prediction

— Climate change prediction analyses the behaviours of weather for a specific time. Rainfall forecasting is a climate change task where specific features such as humidity and wind will be used to predict rainfall in specific locations. Rainfall prediction can be achieved using classification task under Data Mining. Different techniques lead to different performances depending on rainfall data representation including representation for long term (months) patterns and short-term (daily) patterns. Selecting an appropriate technique for a specific duration of rainfall is a challenging task. This study analyses multiple classifiers such as Naïve Bayes, Support Vector Machine, Decision Tree, Neural Network and Random Forest for rainfall prediction using Malaysian data. The dataset has been collected from multiple stations in Selangor, Malaysia. Several pre-processing tasks have been applied in order to resolve missing values and eliminating noise. The experimental results show that with small training data (10%) from 1581 instances Random Forest correctly classified 1043 instances. This is the strength of an ensemble of trees in Random Forest where a group of classifiers can jointly beat a single classifier.


I. INTRODUCTION
Data Mining or Knowledge Discovery in Databases (KDD) process is used to discover new patterns from large datasets and has had a profound impact on the society by solving real-life problems [1]. Data mining aims to extract useful knowledge and represent the new knowledge to make it understandable. This knowledge can be utilized for future use [2]. Recently, a new wave of research has been conducted on time-series data mining. Time-series data mining is the process of analyzing the sequence of data points that contain successive measurements made over a time interval [3]. Several domains nowadays are relying on time-series data such as financial, stock market, climate change and others [4].
Climate change analysis analyzes the behavior of weather for a specific period of time [5]. The key characteristic behind climate change lies in the nature of its data that is captured in time point manner [6]. One of the climate change tasks is rainfall forecasting where specific features such as humidity and wind are used to predict rainfall in a specific location. Many techniques such as Support Vector Machine (SVM), Naïve Bayes (NB), Neural Network (NN) and others have been analyzed for rainfall forecasting. Most techniques tend to be supervised learning techniques. The key point behind supervised learning technique is selecting an appropriate technique with appropriate features. The performance among such techniques widely varies which leaves room for improvement by combining multiple techniques or improving present techniques.
Rainfall forecasting is a challenging task to predict factors associated with rainfall such as wind, humidity, and temperature. Basically, rainfall forecasting task is usually performed using supervised learning techniques. Since there are many different supervised learning techniques, different performances could be gained from them. In addition, rainfall data could be formed in different forms including long-terms (e.g. months) and short-terms (e.g. daily). Therefore, selecting an appropriate technique for a specific duration of rainfall is a crucial task.
Several approaches have been proposed for rainfall forecasting for many locations such as Korea, China, South Africa and others [7], [8], [9]. The current techniques for rainfall prediction including Neural Network [10], K-Nearest Neighbor and Naive Bayes [11], Support Vector Machine [12] and others. Hence, there is a need to investigate multiple techniques in order to identify the best performance in terms of rainfall prediction. In addition, there is a need to investigate new locations for rainfall forecasting such as Malaysia. Therefore, this study aims to address multiple supervised learning techniques for rainfall forecasting using Malaysian data.
This study performs a comparative analysis among several supervised learning techniques including Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), Neural Network (NN), and Random Forest (RF) regarding their ability to predict rainfall data. The data has been collected from multiple stations in the state of Selangor in Malaysia.
The El-Nino phenomenon has wreaked havoc with the global weather patterns including rainfall [13]. This leads to increased research efforts that addressed the rainfall prediction task. Past efforts have utilized many prediction techniques, several features/indicators, and multiple preprocessing approaches. For instance, [2] proposed a new preprocessing approach using moving average and singular spectrum analysis. Such preprocessing task will be employed on the classes of the training data in order to transform it into low, medium and high categories. Then, an Artificial Neural Network (ANN) will analyze the data in order to predict the classes on an unseen portion of data (testing). Two daily mean rainfall series from Zhenshui and Da'ninghe watersheds of China have been used as datasets for experiments.
Modular Fuzzy Inference System that aims to predict monthly rainfall data collected from the northeast region of Thailand is proposed in [14]. The hypothesis of such study lays on the uncertainty of rainfall prediction where the classes usually yield many potential instances. Fuzzy set theory has been utilized in order to estimate the membership for each input variable. Each instance will be annotated with a membership value, and then a rule-based approach was implemented in order to predict the classes of each input variable.
A multilayered Artificial Neural Network with learning by back-propagation algorithm configuration has been used to analyze data from www.indiastat.com and the IMD website [7]. The input parameters for the ANN are the average Humidity and the average Wind Speed for the 8 months in 50 years from 1960 to 2010. The output parameter is average rainfall in the 8 months of every year from 1960 to 2010. Results have shown that as the number of neurons increases in an ANN, the Mean Squared Error (MSE) decreases. In other words, more data relates to lower prediction error.
A hybrid method of feature extraction and prediction technique for predicting daily rainfall data collected from National Oceanic and Atmospheric Administration (NOAA) for more than 50 years was proposed by [12]. Basically, the features consist of humidity, pressure, temperature and wind speed. Neural Network has been used to classify the instances into low, medium and high classes based on a predefined training set.
Bayesian algorithm for rainfall prediction in India using historical data collected from the Indian Metrological Department is proposed by [15]. Six features were utilized including temperature, pressure level, mean sea level, relatively humidity, vapor pressure and wind speed. The Bayesian algorithm was trained on the data based on the mentioned features. The prediction model is observed to be more accurate if the training dataset is very large.
More recently, [11] proposed a comparative study using Regression tree (CART), naïve Bayes, K-nearest neighbor and Neural Network. The dataset of 2245 samples of New Delhi rainfall records from June to September (the annual rainfall period) from 1996 to 2014 has been used including the features mean temperature, dew point temperature, humidity, sea pressure and wind speed. Neural Network performed the best with this data with 82.1% accuracy, second best is KNN with 80.7%, Regression Tree (CART) scored 80.3% while Naive Bayes provides 78.9% accuracy.
Random Forest Ensemble Classification and Regression to improve rainfall assignment during the day, night and twilight based on cloud physical properties (remote sensing data) is proposed in [16]. The results proved that the proposed method is able to assign rainfall rates with good accuracy even on an hourly basis [16].
From the related work, machine learning techniques have been widely used for rainfall prediction. In particular, Support Vector Machine (SVM), Naïve Bayes (NB) Neural Network (NN) are the most widely used by the related work [7], [11], [12], [15]. This demonstrates the usefulness of these techniques. Hence, this study will use such techniques with two additional prediction techniques including Decision Tree and Random Forest in order to investigate the performance of these methods for rainfall prediction.

II. MATERIALS AND METHODS
The data source is obtained from the Malaysia Meteorological Department and Malaysia Drainage and Irrigation Department spanning from Jan 2010 until April 2014. The location and description of the data obtained is shown in Table 1. The features in the dataset consist of temperature, relative humidity, flow, rainfall, and water level ( Table 2).  Table 3 shows the details of the attributes for each feature. The data pre-processing phase aims to prepare the data prior to further analysis. The weather data includes irrelevant data, noise, and incomplete instances. Pre-processing such data plays an essential role in terms of improving the performance of prediction process [17]. Hence, two tasks were performed for this purpose; cleaning and normalization. The cleaning phase will process data missing values that are represented by characters '?', '*' or negative values. Missing values has the ability to cause incorrect matches in the process of prediction [18]. Table 4 shows a sample of data with missing values. In order to overcome the missing data, this study used the mean average mechanism for filling up such instances. Such mechanism aims to sum all the instances in the selected attribute then dividing the summation by the number of samples.
Normalization aims to limit the values within a specific interval. Such interval will facilitate the process of prediction where the values will be mapped onto a particular range. Normalization is essential for specific algorithms such as Neural Network and Support Vector machine [19]. In this study, the interval is set to a range between -1 to 1 based on the following formula [10].
Where x is the data that has to be normalized, is the maximum value of all the input data, is the minimum value of all the input data, y is the normalized data, is the desired maximum value, and is the desired minimum value. Table 5 and 6 show the values before and after the normalization using formula (1). As shown in Table 6, the data has been normalized which makes it ready for further processing. After the data has been prepared, the rainfall prediction will be performed with five techniques (Decision Tree (DT), Naïve Bayes (NB), Support Vector Machine (SVM), Neural Network (NN), and Random Forest (RF)). The evaluation of these techniques was performed using 10-fold cross validation and percentage split.
Information Retrieval metrics such as recall, precision and f-measure have been used in this study to evaluate the proposed method. The aim of Precision is to evaluate the True Positive (TP) entities that are the correctly classified entities with respect to the False Positive (FP) that are the incorrectly classified entities. It can be calculated as follows: The aim of recall is to evaluate the True Positive with respect to the False Negative, which are the entities that not classified at all. It can be calculated as follows: (3) However, using two values, we often cannot determine if one algorithm is superior to another. For example, if one algorithm has higher precision but lower recall than another algorithm, how can you tell which algorithm is better. A solution to this matter is by using F-measure that is the average of precision and recall calculated as follows:

III. RESULT AND DISCUSSION
The experiments have been performed using Weka 3.7 that is a suite of Machine Learning software that includes various techniques. However, some of the used techniques were already installed in the software such as Naïve Bayes and Decision Tree, whereas the other techniques have been installed using plugins in the software using package manager. Note that, the experiments have been performed using two approaches; first, 10-folds cross-validation and splitting mechanism. The techniques will be discussed in terms of performance based on F-measure as follows.
The best results for Decision Tree were achieved when the training was split at 30% training and 70% testing by obtaining a score of 73.7% for F-measure (Table 7). Therefore, the Decision Tree model at 30-70 split is considered as the best model for this technique. The best result for NB was achieved when the training was set at 20% training and 80% testing by obtaining 67.3% of F-measure (Table 8). The best result for SVM was achieved when the training was set at 20% and testing was at 80% by obtaining 67.1% of F-measure (Table 9). Therefore, 20% training -80% testing will be considered as the best model for SVM The best result for NN was achieved when the training was set at 60% and testing was at 40% by obtaining 74.1% of F-measure (Table 10). Therefore, 60% training -40% testing will be considered as the best model for Neural Network. The best result for RF was achieved when the training was at 30% and testing was at 70% by obtaining 71.9% of Fmeasure (Table 11). The model from 30% training -70% testing will be considered as the best model for RF. For the cross-validation approach, NN has outperformed the other techniques by obtaining the highest scores for Precision (72.1%), F-measure (72.5%) and Recall (74.4%). RF outperformed NB and SVM by achieving 70.7% of Fmeasure and Precision (70.1%). Finally, the lowest value of F-measure has been obtained by SVM (Fig. 1).
(4) Fig. 1 Comparison of cross-validation approach among the five techniques On the other hand, in terms of percentage split (Fig. 2), the effect of splitting approach on techniques performance is varying from technique to the other according to the percentage split between training and testing. The differences were affected by the behavior of each technique. The higher F-measure is 74.1% from NN using 60% training data and 40% testing data. This indicates that NN depends on more data for training to ensure a good model. This is the behavior of NN that needs to adjust the optimal weights for the training data. The portion of training data in this situation may be not enough to get an optimal weight to fit the data. Decision Tree comes in the second with Fmeasure 73.7% using 30% training data and 70% testing data. Decision Tree builds the tree based on the rules that represent the training data. Decision Tree looks for the feature that has more ability (more information) to split the data in order to build the tree.
Random Forest behaves similarly to Decision Tree; however Random Forest builds an ensemble of trees (i.e. forest) and the main principle behind ensemble methods is that a group of "weak learners" can come together to form a "strong learner". Random Forest is a combination of separate trees. Thus, each tree is a weak learner. However, when the trees are ensembled in a Random Forest, the end model is a strong learner. With the ensemble strength, Random Forest achieved 72.3% F-measure and the model is based on 30% training data and tested on 70% of testing data.
Naïve Bayesian classifiers behavior assumes attributes have independent distributions so it is not sensitive to irrelevant features. Naïve Bayes models also use the method of maximum likelihood, therefore it required only a small amount of training data for prediction. The best model for Naïve Bayes using 20% training data and 80% testing data scored 67.3% for F-measure. By using kernel functions, SVM is able to learn high-quality decision boundaries that can be efficiently generalized onto test data, so it can learn and find optimal hyperplane using a small set of training data. For the rainfall data, the SVM model based on 20% training data and 80% testing data achieved 67.1% Fmeasure score.
Comparing these five techniques performances for rainfall prediction, Decision Tree, and Random Forest are the top performers. This is supported by the fact that although these models were trained on a low portion of training data, the models were able to predict the higher portion of testing data with the top F-measure scores. Compared with Support Vector Machine and Naive Bayes that were trained on a small portion of training data and predict the higher portion of test data but scored lower F-measures. Neural Network it is an efficient method but it needs a large portion of training data to train in order to predict very small portion of test data.
All techniques produced low predictions' scores between 63% and 75 %. There are three possible reasons for this low performance; first are the sizes of the datasets in these experiments. This study used data collected between January 2010 and April 2014(less than 5 years). On the other hand, previous research [7], [11], and [12] used data from 10 to 50 years for rainfall prediction. The minimum period (10 years) is more than double of the data used in this research (less than 5 years). The longer period equals to more data and produces a more informative model.
Second are the 117 missing values for water flow and water level. Third is the lack of other relevant features like wind speed which were used in previous research. Therefore, to improve the results, a larger collection dataset of at least 10 years is needed with more relevant types of data and better methods to estimate the missing values.
As a final evaluation, we further analyzed the experiments for all classifiers using 90% training data and 10% testing data. Since our data is quite small, we focused on the model that utilizes the largest amount of instances to train the model (10% training data and 90% testing data) (Table 12). From Table 12 we can conclude that with small training data (10%) from 1581 total, Random Forest correctly classified 1043 instances (highest number of correct instances), therefore Random Forest is in the forefront of the five techniques in this study. This research has successfully accomplished the objectives where five classification techniques (Naïve Bayes, Decision Tree, Support Vector Machine, Neural Network and Random Forest) were performed for Malaysian rainfall prediction. The main objective of this study is to identify the best technique for rainfall prediction. Hence, after applying the five techniques, a comparative analysis has been performed in order to determine the most appropriate technique. The experimental results showed that for Rainfall prediction, Decision Tree, and Random Forest perform well because of their abilities to train on little data and predict the higher portion of data with higher F-measure. Support Vector Machine and Naive Bayes also trained on a small portion of data to predict higher portion but with lower Fmeasure. Neural Network it is an efficient method but it needs a large portion of training data to predict the very small portion of testing data. In addition, we can conclude that with small training data (10%) from 1581 instances Random Forest correctly classified 1043 instances. This result put Random Forest in the forefront of the five techniques we have been used.
For future work, the following suggestions can be considered; Combining two or more prediction algorithms has the ability to enhance the process of predicting; Use more valuable features that can generalize or discriminate the classes has a significant impact on the effectiveness; Exploit the rainfall prediction has a significant impact on predicting flowed where there is a direct correlation between the rainfalls and flowed; Use more dataset and explore more areas and locations in the world would be a valuable idea.