Social Network Mining (SNM): A Definition of Relation between the Resources and SNA

Social Network Mining (SNM) has become one of the main themes in big data agenda. As a resultant network, we can extract social network from different sources of information, but the information sources were growing dynamically require a flexible approach. To determine the appropriate approach needs the data engineering in order to get the behavior associated with the data. Each social network has the resources and the information source, but the relationship between resources and information sources requires explanation. This paper aimed to address the behavior of the resource as a part of social network analysis (SNA) in the growth of social networks by using the statistical calculations to explain the evolutionary mechanisms. To represent the analysis unit of the SNA, this paper only considers the degree of a vertex, where it is the core of all the analysis in the SNA and it is basic for defining the relation between resources and SNA in SNM. There is a strong effect on the growth of the resources of social networks. In total, the behavior of resources has positive effects. Thus, different information sources behave similarly and have relations with SNA.


I. INTRODUCTION
Tendency of increasing the aspects of life involving social networks is not only aligned with the popularity of social networking sites such as Facebook, Twitter, VKontakte, QZone, Odnoklassniki, etc., but also as a result of the growth of information on the Web (as information sources) [1], [2].In this case, the computer network as the social network [3] implies that the Web as social media containing the big data [4] to be a big picture of the real world, and in the structure we can express it through the social network [5], [6].Currently, one of the results of social network is a resultant from the methods of social network extraction, whereby actors/vertices, relationship/edges, and web/documents are resources of social network [6], [7].Social network as information about the behaviour of social actors to be important in decision making such as the importance of information source like Web.
In the social sciences, Social Network Analysis (SNA) developed from a conjecture of anthropologist's observations [8] about relation in the face-to-face groups [9] and it based on mathematical graph theory [10].Unlike generating a conventional social network, on the one hand, extracting social network from Web dealing with everything that changed dynamically [11], i.e. enormous amount of information of the social actors and the clues about relations among them.On the other hand, every time we did not just find the new web pages, but also the presence of new actors.This led to the extraction of social networks be more complex and involves data increasingly large, while extracting the social network depends heavily on the limited services of tools such as search engine [12].Therefore, extracting the social networks from information sources like the Web relatively just based on samples [7], and it is useful to learn the behaviour of the social network.That is, Social Network Mining (SNM) to provide a means of discovering the behaviour of either the information sources or the resource, SNM is not same as the SNA, but to overcome a bit of information connectedness between them.Thus, this paper aimed to express the relation among vertices, edges, and Web in the growth of social networks.
This paper based on the conceptual bridge about SNM and SNA.Therefore, this paper consists of four sections: In Section II we present what already known as the material and method.While in Section III is an attempt to present information through cognitive structures and outlines them as a presentation for interpreting them to new things in results and discussion.The last section in the form of summaries or the conclusion, whereby based on it we reveal an issue for future work.

II. MATERIAL AND METHOD
Social Network Mining (SNM) have been declared as social network data mining [11] is to refer to the definition of the data mining: a process for discovering useful information automatically from the Web such as the large data repositories.The SNM techniques are deployed to explore large structure in order to formulate rules that can adopt useful patterns where they may have not unknown [7].To obtain the social structures from the unstructured information, where one of them is a social network, is by extracting the information (social network) from the Web [12].In other words, SNM or simply the network mining [13] fully become a part of knowledge discovery in a social structure, which is the pre-processing that transforms the raw input data into a social network.For subsequent analysis, the process that converts the structured data into useful information [14], and a post-processing for ensuring that valid and useful results for the decision support system [15].The term "network" does not have the same meaning in different fields [16].In a dynamic context, social network is a collection of information about ties between all pairs of social actors where a tie connects a pair of actors by one or more relations, thus the network data fully give a complete picture about relations in the population [17].The Web has been becoming the largest text database contained information about social actors, but the web characterized unstructured, insufficient and incomplete information.Therefore, SNM involves the information extraction, i.e. a systematic approach for the study of indiscernibility of social actors.Thereof, the social network extraction requires a semantic technology such as the use of co-occurrence to collect the clues of the relation between actors [12].
Naturally, a social network can be modeled by a graph G<V,E> whereby either SNA or SNM use the similar approach to visualize a network that consists of V as a set of vertices and E as set of edges [18].In statistical literature, the researchers have the general model of a social network as a Cartesian product of n actors with generating their relations [19], [20].Cartesian product can be represented as an nxn matrix M [21]: An edge e kl = m kl in M is 1 for all e kl in E, if a pair of vertices is adjacent of v k in V and v l in V, e kl = 0 otherwise.For example, Fig. 1  In pre-processing as a part of the overall process of SNM, the extracted social network (resultant) that is represented as SN = <V,E,A,R,γ 1 ,γ 2 > with the conditions as follow [22]: Therefore, social networks can grow in a comparison pattern of star-graph (n-1) and the complete-graph (n(n-1)/2).In other words, the first condition states that any actor's name refers to vertex individually, although the possibility of the actor literally represented by different name text, so there are several approaches that each actor represented by a text name, for example by adding keyword [23].Next, unlike the classic relationship between social actors, a relationship based on the information source generally expressed in the strength relation.Therefore, the second condition defines that a relationship between two actors consists of a set of relations [22].Upon consideration of the provisions relating to social networks, the extraction method exploits search engine for obtaining documents related to the participating actors in occurrence and co-occurrence, subsequently calculate the strength relation [19], and we call it as superficial method [12].
Today, in daily life, anywhere, the information about social actors and their relations are so important.For any application, every social network, including the extracted social network (the resultant of extraction methods) [24], can be analysed to obtain the behaviour of a social community, such as the size, density, degree, reachability, distance, diameter, geodesic distance, etc., they have been disclosed in the literature of SNA [25].SNA is a study of social networks for interpreting the structural relationships between actors, or in traditional SNA is descriptive.Therefore, kind of SNA requires more sophisticated measurements, i.e. a study of social network for discovering knowledge from information sources [26], we call it as SNM or it is predictive [27], by using automatic learning and analysing the content of information, or by involving the clustering techniques to identify the relevant content.
SNA, in general, involves a part of social network resources (vertices and edges) in a unit analysis [15], but SNM involves all resources (vertices, edges, and information sources) and plays a role as a bridge between the social network resources and SNA.Information sources are like Web, corpus, or documents.As different units: vertices and edges have characteristics, respectively.Their characteristics not only determine their behaviour, but they determine behaviour of relation among them [28], [29].Although the social network conceptually has designed to map actors relationship that can be observed, but in SNA is to mark patterns of ties between actors and to present a variety of social structure according to interests.For defining the relation between the resources: seeds, actors, vertices, edges, papers, web pages and degree of actors, based on Fig. 2, we conduct some computations as follows, where the degree of vertices as the basis of SNA [25].First, we collect a set of actors as seeds (sample) for generating social networks [7], and we conduct a random test of sample against sorted names.Seeds and papers are data obtained during the pre-processing of the SNM.Second, we collect papers for each seed and we conduct extracting social networks from documents based on publication year.For each seed, we have a timeline of social network: The growth of the vertices and edges by a growing number of papers.Third, we collect the hit counts for patterns of actor name or the hit counts for actor names in a query [30], [31], [32].We test the randomness of the sample to be able to represent the social communities using test runs.Firstly, we test for a sequence of the actor names based on academic level, and then we test the sequence of the number of papers, the hit counts with quotation marks, and the hit counts as validation that there is a relation between seeds and information source (papers and web pages) [33].In this case, the hit counts as a representation of the big data that presented by any search engine [34], [35], [36].Based on the principle and assumption about the sample: If the population can converge into a big data and then it can become sample [12], then a sample of the actor population is the seed, and a sample of the web pages population is an online database such as DBLP.In this case, an online database according to its size can refer to as a big data.Therefore, the proposed method is based on the principle of extraction can be used for evaluating the big data [7].
We use the multiple regression to generate the behaviour of relations between resources.In the multiple regression, the independent variables x i , i=1,…,n and dependent variable y, the average of y|x i given by linear regression models, i.e.
and the estimation of responses obtained from the regression equation of a sample is We calculate a total relation as follows where β 1 = Π j=1 β j means a direct effect and Π j=2…n β j means the indirect effect.For a sample of the quantities which consists of k clusters may be measured by Y = Σx i , i=1,…,n.
In order to reduce the constant of internal consistency of the sample behaviour in general, we use α-Cronbach or by which σ 2 Y is the variance of the score total was observed while σ 2 xi was the variance of i-component for sample x i .Variance is calculated by using σ 2 xi = 1/n Σ(x i -x), x is average of x i .In general, rule of use α is to use the marker of (a) α > 0.9: The internal consistency of behaviour is very good, (b) (b) 0.7 ≤ α ≤ 0.9: The internal consistency of behaviour is good, (c) (c) 0.6 ≤ α ≤ 0.7: The internal consistency of behaviour is acceptable, (d) (d) 0.5 ≤ α ≤ 0.6: The internal consistency of behaviour is poor, and (e) (e) α < 0.5: The internal consistency of behaviour is not accepted.
Consistency states that information maintained from time to time.In other words, the data that is processed can be replaced by similar data and generate the similar characteristics.For example, in similar format we can replace the random data with other random data, but cannot replace it with data is not random.Randomness be the characteristics of the data, and becoming behaviour for social actors associated with that data.

III. RESULT AND DISCUSSION
In this experiment, we have defined as many as 37 names of actors as the seeds to generate other actors and to build social networks.Actor names mentioned we collected from website of Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia (http://www.ftsm.ukm.my/).Actors divided into 2 categories based on academic level (al), that is 13 Professors (pr) and 24 Associate Professors (ap) as shown in Fig. 3. Suppose taken a name of social actor as a seed, i.e. "Azuraliza Abu Bakar" (AAB).A number of papers (n(p)) written by the actor until 2015 can be obtained from any online databases, such as DBLP [6], i.e. n(p) = 69.The scientific papers that have been published every year from the first year until 2015 we structure into a database, where the enhancement about the number of actors and number of edges that form a social network based on seed are like in Table 1 (first rows).Every year for each seed, we can collect the new actors and relationships between them based on the concept of co-occurrence.We get first: nes = the number of edges based on the number of actors where the seed as the centre of social networks (star graph) in a complete graph, and the next: ner = number of edges generated (reality graph) in a complete graph.Comparison between nes and ner revealed the existence possibility of sub-networks as small groups build the new collaborations if the star graph and the reality graph are not overlapped.For example, until 2015 there were 80 vertices for the actor "Azuraliza Abu Bakar", with a vertex degree n-1 = 79, but in the social network, there are 313 edges: resulting in more than one vertex having a degree > 1 besides the centre vertex.So there are several leaves of star graph interconnected and then grew into a branch of leaves others.In this case, we generate relations between the papers, the social actors, and their relationships based on seed as follows: • If start of reality/complete is 1, then number of authors for first paper is 2 actors; • If start of reality/complete is less than 1, then number of authors for first paper is more than two actors; • If start of reality/complete is 0, then number of authors for first paper is an actor (the paper was written independently).
Then, the value of reality/complete smaller than 1, indicating the growth of social networks and also the dissemination of knowledge continuous.Each actor as the seed has published a number of papers as the source of information whereby web pages as social media also can reflect the activities of seed.That is, some results (the hit counts) that returned by search engine to respond to submitting query either with quotation mark or not, i.e. ("hc" ("Azuraliza Abu Bakar") = 4,740 hits) and (hc (Azuraliza Abu Bakar) = 5,480 hits), respectively, showed the social indication of actor's name.Likewise, the social actor "Shahrul Azman Mohd Noah" (SAMN), see Table 1 second column.Behaviour of growth for each social network based on seed have shown that there is a positive direct effect β 11 of papers toward edges, but some seeds generate the negative direct effects β 11 , likewise the indirect effects β 1 β 12 , where the negative direct effect on seed means that influence seed to the growth of other actors is getting smaller.For all seeds of social networks, the influence of each factor on the growth of social networks, as shown in Fig. 3, can be considered as similar.Therefore, if the hit counts can represent a social actor and the papers of seed can generate a social network, then the hit counts also generate a social network with the same behaviour.
In examining the behaviour based on the data or for the data to describe the behaviour, the characteristics of the data must be disclosed.Therefore, the corresponding data expressed through the data engineering.To do the data engineering in accordance with the needs of research related to big data, we must first review the basic behaviour of the population, namely the randomness of the sample: a test used to see whether the samples was taken at random so that samples can be representative of the population.A sequence of the actor names in alphabetical order (Fig. 2) with 18 runs (r = 18) toward the academic levels (al): pr and ap.Distribution of samples r approaches a normal distribution of Z: npr = 13 and nap = 24.Hypotheses are H 0 : Order of pr and ap in a row is random.H 1 : Order of pr and ap in a row is not random.
The function of the hypothesis is as temporary answer to the issue and still as a presumption, because they still have to be verified.Thus, in addition, to push for the emergence of the concept of the relationship between the SNA and the SNM, but also as a framework to draw up conclusions on research on the social network.In this case, the average is μ r = 17.87 and the variance is σ r = 2.73.By using α = 0.05, Z count = 0.05 < Z 0.25 = 1.96.So H 0 accepted, or row of the actors as the seeds (sample) is random with a confidence level is 95 percent.For case n(p) as number of papers in DBLP, we have Z count = 0.51 < Z 0.25 = 1.96 whereby H 0 accepted.While for "hc" and hc, we have calculated that Z count of "hc" and hc are negative, but H 0 accepted whereby Z count = -0.83> Z 0.25 = -1.96and Z count = -0.16> Z 0.25 = -1.96,see Table 2. Thus, the sample based on all factors are the random.
As one of the factors, the scientific papers of actors (act as seeds) have a role in the growth of vertices and edges in social networks so that each social network has its own behaviour.At the time of conducting experiments, we have 37 n(p), 37 "hc", and 37 hc as three collections of factors.Thus k=3.For 37 seeds, we have Σn(p) = 781, Σ"hc" = 540692, and Σhc = 1468210, and then Σ i=1…37 σ 2 xi = 11978412919 and σ 2 Y = 20892233528.Therefore, three factors are reliable for representing each other because α-Cronbach is 0.64 (acceptable).In this case, the median of n(p), "hc" and hc individually are 14, 2800, and 8520.Thus we have number of data greater than median for n(p), "hc", and hc are 19, number of data less than median for n(p), "hc", and hc are 18, while based on the sequence of actor names we have transitions 21 for n(p), 17 for "hc", and 19 for hc.Thus Z -0.25 =-1.96 < Z count =0.51;-0.83;-0.20 < Z 0.25 =1.96 for n(p), "hc" and hc, or behaviour of sample is random for three factors.
In SNA, the degree of a vertex is major categories in determining the role of an actor in community based on social networks: One of them is as central of research groups.Thus, each seed in social networks have degree greater than 1, and each actor connects with others by more than one edge.Therefore, it is possible, an actor as the seed first build a social network in form of star-graph or as the centre of other actors, but later became a centre of the research group.The behaviour of resources in a social network can be predicted by using the multiple regression.Each resource has relation between one to another, while information sources based on internal consistency test also mutual representing, where the role of actor also represented by hit counts: each web page may contain more than one name of actor, it forms co-occurrence and then relations for generating edges, and unit analysis in SNA.Thus, there are more than one relations between resources and SNA.

IV. CONCLUSION
In this study, we have presented an analysis for finding behaviour of resources of the social network.The resources of social networks -actor/vertex, relation/edge, Web/document, or connection/path -have different behaviour toward the growth of social networks based on seed.Analysis with statistic computation produces a relation between resources and SNA (based on the degree of vertices).More than one relation among actor, vertex,

Fig. 1
Fig. 1 Some of the relationships among five social actors

Fig. 1 (
Fig. 1(a) and matrix M 5(a) explain the relationship among five social actors in their activities their separately.While Fig. 1(b) and matrix M 5(b) reveal that among five social actors there is at least one paper together.Fig. 1(c) and matrix M 5(c) show author-relationship for five social actors, with which Fig. 1(d) and matrix M 5(d) mark only the relationship between a supervisor and a student.Fig. 1(e) and matrix M 5(e) reveal that some of the authors conduct citation against one another author.The last, Fig. 1(f) and n} is a set of social actors, and (b) γ 2 : R → E or e j = γ 2 (r s (a k ,a l )) where R = {r s |s=1,…,p} is a set of relations, and a k ,a l in A.

Fig. 2
Fig. 2 An approach to defining the relation resources and SNA

Fig. 3
Fig. 3 List of seeds as a sample

10 )Fig. 4
Fig. 4 Network of direct and indirect effects for four factors (resources)

TABLE IV β
FOR EQUATION (2) FOR DATA IN TABLE 3