Opinion Mining Summarization and Automation Process: A Survey

In this modern age, the internet is a powerful source of information. Roughly, one-third of the world population spends a significant amount of their time and money on surfing the internet. In every field of life, people are gaining vast information from it such as learning, amusement, communication, shopping, etc. For this purpose, users tend to exploit websites and provide their remarks or views on any product, service, event, etc. based on their experience that might be useful for other users. In this manner, a huge amount of feedback in the form of textual data is composed of those webs, and this data can be explored, evaluated and controlled for the decision-making process. Opinion Mining (OM) is a type of Natural Language Processing (NLP) and extraction of the theme or idea from the user's opinions in the form of positive, negative and neutral comments. Therefore, researchers try to present information in the form of a summary that would be useful for different users. Hence, the research community has generated automatic summaries from the 1950s until now, and these automation processes are divided into two categories, which is abstractive and extractive methods. This paper presents an overview of the useful methods in OM and explains the idea about OM regarding summarization and its automation process. Keywords— opinion mining; natural language processing; automation summaries; summarization; decision making.


I. INTRODUCTION
The concept of data mining is drawing popularity in the world of computer science with the passage of time. It is the incorporation of quantitative methods, which are called mathematical methods used in the process of mining the usage patterns and trends in the historical and temporal data. These mathematical methods may include some of the mathematical equations, algorithms, prime methodologies, traditional logistical regression and neural networks [1]. Data mining is the procedure of categorization of the large set of data. It is to classify the trends and solve the problems of the data according to the data analysis [2]. Four stages of data mining are shown in Fig. 1. A method in which a statement is divided into small parts and learns from those parts perform and communicate to one another known as analysis [3]. Analysis of human behavior according to their textual contents is called Opinion Mining (OM). It is a type of Natural Language Processing (NLP) to examine the temper of the public about a product or a service [4], [5]. OM sometimes is called Sentiment Analysis (SA), which involves building a system to collect and categorize opinions about a product or a service. SA is a predecessor to the field of OM in which it examines how people feel about a given topic (positive, negative, and neutral). SA aims to determine the attitude of a speaker, writer, or another subject concerning some topic or the overall textual polarity or emotional reaction to a document, interaction or event.
The purpose of OM is to generate the opinions from the textual form of data. Many organizations spend their money and resources to search for the opinions and SA. Similarly, everyone including organizations and individuals wants to know about the opinions of the public for a specific product, service, problem, event, etc. This kind of survey is beneficial for research. Therefore, the data collection process is more comfortable due to different available sources such as blogs, web forums, discussion platforms, comment boxes, etc. Information and knowledge discovery extraction is the key area of research. Web data has the dynamic nature due to this reason, and the extraction of that type of data is a difficult task.
With the passage of time, according to the updating processes, data is a change for every new transaction and web usage for every field of applications. Therefore, one of the significant web applications is to collect user opinions from different sources, and after this extraction process, presentation of performance is useful for information. Over 75,000 blogs build up on a daily basis with approximately 1.2 million new upcoming posts. In the modern world, the 40% population believes in opinions, reviews, and recommendations collected from diverse sources [6].
In textual format data, the automatic recognition process plays an important rule. Different organizations and companies are paying attention to this issue on how to know about the public demand. This is the focal point and that point related to the opinions. So, available sources used for regular compilation of customer reviews regarding a product or service. According to these reviews, companies make sure that the public opinions about their product or service in term of good or bad aspects. In business aptitude, classification of every opinion according to the features of the product play significant rule like the quality of a product, the order of a product, and the integrity of the product [7]. Summarization is entirely different from the classical text. In contrast, OM relies on the aspects of a product while the reviewers give their opinions in the positive or negative sense [8].
OM used in every field of life such as government matters, recommendation policies, citation criteria, human behavior with machines and inspiration of computer behavior [9]. In the same way, extraction of information is a useful process to search the different levels of exactness and exclusiveness [10]. This paper presents a review of different methods and approaches used in the process of OM and opinion summarization.
The remaining part of this paper organized as follows. Firstly, the literature review will explain briefly on text classification. Secondly, we give the evaluation process of this research. Finally, we sum up experimental results base of discussion and conclusion. These parts are discussed subsequently.

II. MATERIAL AND METHOD
This section describes a brief overview of OM that had been discussed by different researchers. OM keeps four types of issues, which are shown in Fig. 2. These issues related to dimensions of the words or sentences, sources like where the researchers get the data for their experiments, set the record as target and summary of the data. Previously, many OM techniques proposed by different researchers with multilingual based work. Table 1 describes the degree of automation regarding automatic and semiautomatic with some supervised, unsupervised and semisupervised classifiers in which heterogonous form highlighted. Text summarization has divided into two types such as single-document and multi-document summarization [22]- [23]. Due to the redundancy problem, multi-document summarization is a very complicated task as compare to single-document summarization. Therefore, Carbonell reduces the redundancy issue with Maximal Marginal Relevance (MMR) approach [24]. Another researcher works on this problem and gives the solid solution with the help of similarity measures [25].
Numerous researchers work on multi-document summarization and try to improve the accuracy rate regarding automatic summarization process [26]- [31]. Previously, a variety of approaches provides the solutions to text summarization problems on different levels as mentioned in Fig. 3. These text summarization problems are further divided into four main types and so on. Fig. 3 shows the hierarchical structure of the text summarization problem solution methods, tasks, generic and specific types. The focus of this research is to review the previous process of text classification regarding extractive and abstractive forms in which study the existing and developed approaches with their datasets and performing measures.
Text classification is divided into two parts extractive and abstractive levels. In the evaluation process, the abstractive level keeps the initial understanding on the subject of the notion about any data and generates the abstract summary of the text, which is extensive level NLP. On the other hand, extractive level obtains the relevancy between the selected sentences and the original documents with learning algorithms that used for the training of different classifiers. This research will focus on the area research on OM approaches and describe the summarization of documental opinions from users. The focus of this summarization has mined ideas from the reviews along resultant opinion directions. Intensity features dig out the entities and work on the dimensions. With the help of opinion strength, dimensions give a summary through ranking [32]- [33]. Table 2 shows the process of evaluation which researchers used for opinion summarization.

Abstractive
For opinion summarization, Carenini [34] present and compare two traditional techniques. Because of MEAD, researchers [35] give a sentence extraction method for multi-documental summarization, and it cannot provide quantitative measurement. Based on aspect level sentiment analysis, the performance of language generation method is better and find out the aggregate. Lastly, a generator of evaluative arguments provides the summary [36]. Researchers [37] examined the outcomes of human evaluation for summarizers, and they present an improved idea of multiple summarizers. Gerani [38] considered speech configuration and present a method for product reviews. First parsing performs than after that changing in trees, and each leaf keeps feature words, those trees generate a directed graph. In that graph, every node indicates a feature; each edge indicates a relation and that relation show the variation of selection by PageRank algorithm. Based on features and relations, a framework is present for summaries.

Extractive
Nishikawa et al. [39] present optimization issue in the graph. According to a graph, every node indicates a sentence while the proposed algorithm builds a route that routes touch every node. Objective function represents the summation of text scores that calculate the rationality among sentences with their features. Extractive approaches focused on source documents for text summarization process and got the summary of that source document as a resultant. As can be seen in Fig. 4, there are five different categories of extractive approaches and they present their importance for the specific tasks. Next section will discuss further on these categories together with their working style.

A. Category-1: Statistical based Approaches
The nature of statistical approaches is independent in which concerning area of the extraction are sentences and words of the original document. These techniques are independent so that is why they do not require any supplementary or multifarious information associated to the language regarding statistical aspects such as sentence location, optimistic and non-optimistic reserved words, sentence similarity, comparative sentence length, numerical data and appropriate noun in the sentence, sentences thick pathway, aggregated similarity, etc. After this, high acquire commutated, and the calculated sentence is helping to generate a good summary [23], [40]. Fig. 5 shows multiple steps of statistical approaches used for the extractive process of text summarization.

B. Category-2: Topic-based Approaches
The document area under discussion is known as the topic in which all the written material shows the description of the manuscript. Establishment of topic design is the main idea related to procedures, which appear recurrently [41]. Representation of the topic is further divided into five sections, which are shown in Fig. 6. Where input firstly was selected as text content than extract topics from that contents where an enhanced feature are important for topic signatures. Then next step shows the thematic signature extraction where the thematic signature is detected in contents and it help the researcher to design the model easily. Finally, the process produces the templates related to topic representation. Once the process completes, it then produces a full-enhanced topic representation about text contents.

C. Category-3: Graph-based Approaches
In these approaches, sentences or words show in nodes form while semantics representation is in edges form. Fig. 7 shows multiple levels of graphical approaches used for an extractive process of text summarization.

D. Category-4: Discourse-based Approaches
This method determines the relationships in the form of links between words or sentences. In the field of estimation in linguistics, Mann and Thompson represent a Rhetorical Structure Theory (RST) in which first input consider text than divide this text into small units called elements with RST that have two basic rules generated for processing. The first rule is about the consistent text, which keeps small parts of the text, and these parts associated with links. Second rule examination of the links for checking the performance of the structure [42]. Rhetorical Structure Theory (RST) basic rules are represented in Fig. 8.

E. Category-5: Machine Learning based Approaches
This approach used machine learning in the tracking of words while divided them into three broad categories such as supervised, unsupervised and semi-supervised approaches. Under these categories, Fattah and Ren used regression [23], Decision Trees (DTs), Multi-Layer Perceptron (MLP), and after this, they extend their work according to the Support Vector Machine (SVM) and Naïve Bayes (NB) classifiers for this purpose [43]. Here we show the hierarchy of machine learning approaches in Fig. 9. Among compressive methods, numerous researchers had focused their work on extractive text summarization. The primary purpose of these methods is to transform a significant sentence into a small one, which is grammatically correct while this sentence keeps the main part of the text. In this section, this research compares between yearly approaches and then sum up with their needs. Table 3 shows the classification results regarding extractive text summarization approaches. With the help of source document, abstractive text summarization generates a summary like abstract while the base of this abstract is opinions or concepts, which are present in another form. This process includes the requirement of NLP and it is more difficult as compare to extractive process. While Table 4 provides an overview related to abstractive text summarization approaches. Automatic summarization is a necessary step for summary assessment because besides generating a summary of the input document, the system can also evaluate that summary, due to certain reasons such as limitation and association. It is a demanding assignment for human to recognize the correct information provided in summary from the original text. Knowledge keeps modifications concerning their rationale behavior, so automatic summary generation is very tricky. Fig. 10 shows the hierarchy about the summary generation through extractive or abstractive levels and some arrangement that has been kept for summary assessment measures.
There are two possible options for examining this question in term of extrinsic assessment and intrinsic assessment. First is the extrinsic assessment for generation of summary, where the quality of the generated summary is determined and its effects on other jobs like text categorization, retrieval of important text and a list of question answers, etc. These assessments are divided into two further parts such as relevance assessment and reading comprehension. On the other hand, in intrinsic assessment, the quality check is a comparison between the machinegenerated summary with the human-generated summary. The main issue that always been asked by most researchers is choosing the best way to examine the performance of text summarization. In which many criteria need to consider and evaluate in order to choose the best assessment measures. In this era, text summarization in OM is becoming the hottest area of research. This paper focuses on summarizing the researcher's work and their pros and cons in OM. This research provides a good start for new researchers in the field of OM. An overview of extractive and abstractive text summarization approaches and advancement in the methods are categorized. A comparison between extractive and abstractive text summarization categories along with their baseline approaches, datasets, calculated methods, and measures in detail, is presented.
Based on this study, text summarization still need many studies and more enhancements in the current approaches to addressing new features like semantics (related to words and sentences), modification in the categories of the summaries generation, linguistic methods, improvement in coherent summary contents, advancements in summary assessment processes and so on. Hopefully, this paper will help researchers in improving those limitations, and more future research could be dedicated to these problems.