Big Data’s Tools for Internet Data Analytics: Modelling of System Dynamics

— In this paper, an application based on Apache Hadoop is deployed to gather, store and analyze the data from Internet, especially online and social media. Nowadays, this application is a common tool for media analysis. In our case, it is used to assist in the modelling of system dynamics. Basically, There are several tools that will be used, such as for file system, data crawling from the Internet, data indexing, data storage, and data analytics. The selection of technology is as the industrial trend. Surely, this is not the best approach, but as another perspective for modelling of system dynamics. A system dynamics model is developed to study the profitability of the telecommunication company and how the complaint or negative sentiment will impact to their profits. The clustering analytics is used to identify the components of the system. In continuation of the improvement process, the clustering analytics will be used not only as one time effort. It runs periodically to develop a better model of the system. Sentiment analysis tool is used as the input for one of the component, which is the complaint component. The sentiments are sourced from online and social media. Manual investigation and analytics of Internet data is required in developing the relation between the components.


I. INTRODUCTION
There are three things to be noted in this study. First, Big Data, it is a concept related to the increasing amount of data (Volume), the diversity of data (Variety) and growth rate of the data (Velocity) [1]. Technologist, businessman and academia need to address this issue to meet the target achievement and continually grow without constraint. Second, Internet as a source of data is a big pool consisting a variety of data. It consists of structured, semi-structured and unstructured data. There is other type of data, such as streaming data which is not in our concern at this moment. In this paper, the focus is on online and social media data, which the updating is not as frequent as streaming data. Third, System Dynamics [2]- [6] is a methodology based on the system theory that studies behaviour of the components of the system and its relations. The critical part of system dynamics is designing a model for simulation. A model is a mimic of the real system. The capability to analyse Internet complex data could be benefits in assisting creation of the model of system dynamics. It is hard but it is possible [7]. The challenge is how to develop and run the analytics tools for Internet data, which is considered as Big Data to assist in identifying the components of the system and its relations [8]. A framework is needed to make the overall process clear. The simple one is by modifying one of the existing system dynamics framework and designing how it work collaboratively with Big Data. Clustering method is used to identify the new components. It runs continually and conducts the improvement at the same time. It's a long way to be perfect, but this paper will become a good starting step for this approach.
The strength of the Big Data, Internet and system dynamics needs to be synergized to produce the optimum output. It requires the right tools to perform analytics against large data [8]. For a case study, the tool developed in this research is used to create a model of a simple system. The model is about profitability of one of GSM company in Indonesia. The questions addressed in this paper are (1) How the tools help in identifying the components? (2) How the sentiments from peoples that are recorded in the online and social media can be the input of one of the component? (3) How this complaint is able to impact the overall profits indirectly. Because of the lack of information in crawled data, designer analyse the data from online annual report of the company to establish relations between components.

A. System Dynamics as the Method
Here, a process framework based on system dynamics is created for processing the Internet data (Big Data) to build a model of a real system before running the simulation. In this paper, the framework is modified from Jay Forrester's [2] (see Figure 1). But, there are other approaches as well which are discussed by some experts [6], [9]. The analytics of Big Data is used to identify the components of the system, the equation or input to/from the components, and also the relation between the components. The identified components and its relations become the foundation of the model. Designing a model is an art [10]. Subjectivity of the designer is depending on his/her own mental construct [11]- [13]. It is very dominant to everybody. Each designer has a different point of view when creating a model of a system even though the data is the same. A designer needs a double loop learning [14] capability to become more objective.
Here, we run the identification process of the model using the framework. The system involved is the profitability system in one of GSM company in Indonesia. In step 1, the designer needs some inputs in the identification of components of the systems and its relations. Input can be from designer's mental construct or other's opinion from others papers [15]- [19]. Another input is Big Data analytic of the Internet data crawling.
Step 2 is created based on the data from Internet, other sources and also some assumptions. In our case, the Internet is the main data source. There is no ideal way to formulate the equations. Although this exact calculation is an objective process, it could be full of subjectivity of the designer. It is known as subjective modelling in some papers [20]- [24].
Steps 3 until step 6 are the processes of running simulation with multiple scenarios or cases. Improvement is conducted in every step, and it's not one time cycle. It could be repeated several times before finalizing a mature model.
The Big Data process supports the step 1 to identify the components of the system and relations between components and step 2 to provide any number calculation and correlation between components [8]. It could be done using assumptions, logic calculation, estimation and prediction. In some cases, intelligent processes are conducted. Unfortunately, data and information of Big Data do not always satisfy the designer to achieve a purpose.

B. Source of Data
On this occasion, the data is acquired in large amounts from Internet, which source is from online and social media. Each of them has their own characteristics. Online media or digital media is basically news in the Internet. The data and information is published on their formal web site. It is commonly unstructured data [25]. Social media, such as Facebook, Twitter, and YouTube is very popular at this moment. There are many fake identities and information in the social media [26]- [30]. This happens if the issues are related to crimes and politics, especially in non-democratic country. As Indonesia has Information Technology Regulation to protect this issues and enforcement commitment from the government, it will be ignored at this moment. Fortunately, most of the complaints on the services on the telecommunication [31] are really from the unsatisfied customer, even if it is using a fake identity. In this research, the online media is the popular ones in Indonesia. The social media is only Twitter.

C. Software Application for Data Analytics
It is identified that some application softwares are required to support the data analytics process, such as: • Operating System (OS).
• File System • Application, such as Web Application, Data Crawler Application and Geospatial Application • Database, such as NoSQL and SQL • Analytics Application • Cluster Management (optional) Subsequently, the data flow diagram is made. It is a common process of online media analytics (see Figure 2). At this moment, the solution based on Apache Hadoop is used.  Eric Pruyt [35] has explained that there are three ways in analysing large data and relevant to System Dynamics. First, develop smart methods to not analysing large data directly. Second, develop filtering, clustering and selection method to reduce the large data. Third, develop methods to face large data directly. Another possible way, it is the combination between them. In this paper, it is called the fourth way, and it is relevant to our approach in this paper.
In this paper, clustering method will be used to identify new components of the system dynamics. Some web document clustering algorithm, such as Agglomerative Hierarchical Clustering (AHC), Divisive Hierarchical Clustering, K-Means, Suffix Tree Clustering (STC), Semantic Hierarchical Online Clustering (SHOC), DBSCAN (Density Based Spatial Clustering of Application with Noise), OPTICS (Ordering Points to identify Clustering Structure), STING (Statistical Information Grid) and Lingo have their advantages and disadvantages, and studied well in some papers [36], [37].
Before using one of the clustering analytics, the designer with their subjectivity identifies components at the first time. In Figure 3, those components are C1, C2 and C3. The next step, cluster analytics as the Big Data's tool is used hierarchically to identify new components based on the individual components [38]. On the phase 1, the cluster analytics are run on the all components.
After running the clustering analytics in phase 1 the first component  In the final process, the designer is using their subjectivity to eliminate the component that is not relevant or less relevant to the case. The components are chosen selectively to build a model. Unfortunately, there is no guarantee that clustering analytics process will run smoothly with 100% accuracy. If this is the case, the designer may add more components to make the model more relevant. This final process is the ultimate phase and the input before creating a model. After a certain time, the designer will redo the same process again with additional data to identify new components and build a revised model.
Another tool is the sentiment analytics, which has been discuss in some papers [39], [40]. Basically in our study, the designer make entries of some words or phrases which related to negative, neutral and positive connotation in the database. Next step, the words and phrases of the crawled data is analysed by comparing with the database before being decided as a negative, neutral and positive comments or news. Manual online data analytics is used as the additional data or information.

D. Software Development
On this occasion, the application softwares, which is used to facilitate the collection, storing, indexing and analysing the data from the Internet are described as below (see Table  I). Linux Operating system is chosen in our case. The HDFS is the hadoop file system for NoSQL database (HBase+Phoenix), and the NFS for the SQL database (MySQL). Apache Nutch is used for data crawling and Apache Solr for indexing. Eight units of servers are used and the functionality is defined below (see Table II). In Figure 4, the detail of services on each physical servers are described.  In the clustering analytics, Carrot 2 based on Lingo Algorithm is chosen because of the portability in integrating with the overall solution, even though it is not the ideal algorithm [41], [42].

III. RESULTS AND DISCUSSION
Application software is only tool. Expertise of a designer of system dynamics is still required to develop a model. The designer needs his/her own idea or mental construct. In our case, one of the functions of the application software is used to identify components of the system. Creation of model of system dynamics is not a one-time effort, but it is a continual process. In other papers [43], [44], it is called as system breakdown structure (SBS). The first state of the model is created by intuition of the designer. It is full of subjectivity. But, the more expert is the designer, the more objective is the model. The objectivity is more dominant by utilizing more and intensive data. Based on the feedback from some experts [15]- [19], the model of profitability of GSM company is simplified as above (see Figure 7). The SBS has not been run on this state.
Model is evolved as the designer find new components and new relations or modification of the existing components and its relations. In our context, Cluster Analytic, one of the features that exist in the application software is used to identify the new components and relations. It identifies the words on the online media and clusters it. There are few key words that are used in this analytics. It is related to the components of model state 1 which are "Profits(1)", "Subscriber(2)", "Advertisement/Promotion(3)" (see Figure 8 for one of the result, "Pelanggan/Subscriber"). There are few components identified in cluster analytics of "Subscriber", such as "Cost", "Competitor" and "Cyber Crime". It is a continual process. If clustering analytics on the "Cost" as the key of cluster analytics, it gives "Operation Cost", "Internet", "Politics or Policy", "Price" and "Infrastructure Development". The process is continual until no new component is identified.
SBS is run using clustering analytics of Big Data of online media. The identification of new components is strongly related to the content of online media, because the input is from online media. This tool is helpful in identifying new components associated with the system. After collecting several components, the designer needs to use their expertise to selectively choose the components. The designer selects some components that are relevant to the purpose. Next, the model state 2 is created as below (see Figure 10) after selectively choosing all the identified components. It is recommended to run this clustering analytics periodically to identify new components for the future, as almost all of the systems are dynamics.
The created model of system deals with the profit of a communications company and the effect of customer satisfaction to the profitability. Another analytic, which is sentiment analytics, will become the input to one of the component. In Figure 10, main components, which are "Profits (1.1)" and "Subscriber (2.1)" are reviewed. More subscribers need not only more operation income but also more operation cost. GSM company will spend more money for "Advertisement/Promotion (3)" to invite more subscriber and retain existing ones. Customer complaint is negatively impact to the subscribers [15]- [19]. Usually, it is related to satisfactions of Customer Services, Performance of GSM (Voice) and Performance of 3G / LTE (Data). At this moment, the complaint is put into one component only. To overcome this complaint, GSM Company needs to provide better services that require more cost. The next step is creation of the equation. Gathering more data is needed. Several techniques are used; the major one is digging information from Internet. It assumed that only 30% of the services and advertisement/promotion spending is effective and for US$100 spending can invite one person to apply for registration. The same thing with the complaint, the effectiveness is 30% and it will influence 0.01% of existing subscriber to churn. The income is $4 per subscriber and the operating cost is $1 per subscriber. It uses 10% of the income for services and another 10% for advertisement/promotion. Above (see Table III) is the summary.
There are few telecommunication industries in Indonesia such as Telkom Indonesia (TI), Indosat (I), Telkomsel (T), Excelcomindo/Axis (E), Hutchison Telecom (HT), Bolt (B). (note : Bolt is only doing in 4G/LTE business). At this moment, comparing between companies is not our main focus. So, one of the company is picked in this study. Figure  11 and 12 are the sentiment analytics from online media and social media of one of the telecommunication company. Based on the graph in Figure 11 and 12, the Table IV is created. It shows the total complaint as the difference between total negative and positive comments, which is summarized in every week (the last 15 weeks).  IIIV  SENTIMENT TABLE OF ONE OF THE GSM COMPANY ("TELKOMSEL" 321  276  9  0  38  85  124  77  10  4  41  311  1156  882  11  29  47  306  1284  996  12  0  4  97  104  11  13  1  2  220  539  320  14  0  3  0  0  3  15  0  0  21  61  40 Since this simulation needs 100 weeks. The Table III is reusable. There are three scenarios in our simulation, to simulate the growth of subscribers. First, The complaint is not resolved and periodically it happens. Second, the complaint drops 90% after 50 weeks. Third, no complain after 50 weeks. The Result is as in Figures 13 -16. Based on the simulation, the total subscribers are dropping if the complaints are not manageable. It will impact negatively to the income for the company. Of course, The operation cost will drop as well but it is not as big as the income. It assumes that fewer subscribers mean less operating cost. Overall, The complaint will impact to the profits of the company (see Fig 6). The faster the complaint drops, the batter the profits increase.

IV. CONCLUSIONS
The Big Data, Internet, System Dynamics are interesting topics. The good collaboration between them will create better benefits in understanding the systems. Clustering Analytics as the tool based on Big Data helps in identifying the components of the system in order to assist development of system dynamics model representing complex system. It is considered as the process of breaking down the system to identify all components of the system, which is a good start in the modeling process. The modeling process is critical in system dynamics method. In addition to this improvement, Sentiment Analytics, another tool applied for Big Data, captures data to be used as input to one of the components in the system dynamics model. In this paper, the component related to this Sentiment Analytics is "Complaint". Both of this analytics tool exhibit the collaboration between Big Data and System Dynamics, where the analytics tool assist in identifying the components and input data for the model. The manual data analytics is still needed to compliment these analytics tools in the model development process, especially in identifying equation or interrelation between components. The simulation run of the completed system dynamics model reveals that good handling of customers complaint will increase profits.
Future enhancement of the work presented in this paper will be beneficial in producing better analytics and enhance the simulation results produced form the system dynamics model. First, the improvement of analytics capability and collaboration with more intelligent tools can increase the accuracy and multiple sources of data will enrich the analytic capability. Second, The complaint has several criteria that are not discussed in our paper. It should be differentiated into several categories based on the action performed by subscriber, such as: complaint to the firm, complaint to government agency, take legal action, warn family and friends about firm, and boycott the product [45].

ACKNOWLEDGMENT
We would like to thank PT. ODP for the permission to use the Big Data tools which author is as the project manager.