Clustered Approach for Clone Detection in Social Media

With the popularity of Online Social Networks (OSN), the number of different types of digital attacks has been increased. Identity Clone Attack (ICA) is one of the leading among them that illegally uses a genuine user's information by duplicating it in another fake profile. These attacks severely affect an identity since another malicious profile can misuse it. Hence, these clone profiles need to be identified and removed to increase the protection of users. This research introduces a novel approach to detect clone profiles on Facebook by using a clustering technique on its profile attributes and network connections. The detection process included three main stages: filter by name, cluster using weighted categorical attributes, and measure the strength of friend relationships among profiles, which follow one after another. Finally, the list of possible clones with their percentages representing the amount of duplicability to a given victim profile was presented as the model's output. With the Agglomerative hierarchical clustering algorithm and Jaccard similarity measurement, a low-average within-cluster distance in cluster density performance and a precision of 88.75% was shown in the results. Instead of suggesting the exact clones, the duplicability percentages make this approach more practical since there are many similar profiles but not clones. This methodology increases the model's adjustability to any other dataset as the selection of weights, thresholds, and clustering algorithm is done based on considering the distribution of the dataset. Keywords— Clone profile; clustering; identity clone attack; network similarity; profile similarity. Manuscript received 13 Jul. 2019; revised 30 Aug. 2020; accepted 10 Dec. 2020. Date of publication 28 Feb. 2021. IJASEIT is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.


I. INTRODUCTION
Recently Online Social Networks (OSNs) have become a significant part of people living where 2.46 billion of the global population is using it and is expected to reach around 2.95 billion in 2020 [1]. In addition to the many benefits facilitated by these networks, there are some protection and security risks caused by hidden identities called fake profiles. According to statistical estimations, Facebook has 81million fake accounts, whereas 5 percent of Twitter accounts are forged [2].
Identity theft or Identity Clone Attack (ICA) is one of the most popular attacks in OSN, and it is performed by profile cloning. Profile cloning is a way of stealing information from an existing user and creating new similar fake profiles using those details. Cloning a profile in OSN can be done with several intentions: to trick users, abuse financially, damage a person's reputation, and steal sensitive data of others [3].
When cloning profiles in OSN, the adversary first creates a fake profile using the victim profile's publicly available attribute information. A social network platform profile has a name, most probably a first and last name with another set of attributes such as birthday, hometown, and school to represent its identity. In profile cloning, most of the victim profile attributes will be copied by the clone profile. Usually, the name is the main feature both clone and the victim should have in common [4]. However, some of the attributes will not be copied by the same value; instead, some will be kept as empty or private. This is because clones can duplicate victim's features and maintain their privacy setting by making some of the attributes private. Also, an adversary can make some attributes public in which the corresponding victim had set to private such as birthday, where most of the users try to keep it private. According to the study [4], these activities may make the faked identity more realistic.
Typically, after cloning a profile, it will send friend requests to friends of the victim. Since the clone profile looks more like the genuine profile, friends of the victim will tend to accept a friend request from the clone without noticing that it is a duplicate profile of their friend [3], [4]. Hence, the adversary gets the chance to publish misleading content to the victim's friend audience using a clone profile to damage his good profile. Also, there can be some other problems caused due to the exposition of the victim's friends' private data to the adversary.
Before adding a friend to the network, a cautious user will first look for his friend list to check whether that user already exists or not. In that case, adding a considerable number of friends of the victim may not be manageable. Hence, the adversary tries to add the victim's recommended friends so that the clone becomes more genuine and makes it difficult for the victim to add those recommended friends [4]. The OSN platform usually generates the recommended friend list. They are the list of people who are not yet friends of the victim but having similar backgrounds or mutual friends between them.
As mentioned above, now a clone profile and the genuine profile will be very similar to public attribute values, friend networks, and recommended friend networks. Under these assumptions, the purpose of this study is to introduce a novel detection model that can use similarities in profile attributes and network details to find the possible clones for a given victim. The initial search space will be reduced in a more massive amount by filtering only the profiles with names like the victim's name to increase the efficiency. Next, these filtered profiles will go through clustering based on public attributes and filtering using network similarities. Finally, the suspect profile list will be presented with the amount of duplicability as a percentage. The model was developed for Facebook, which has the highest popularity, the largest number of user-profiles, and the highest number of fake profiles [1], [2].
The rest of the paper is organized as follows. Section II is arranged into four main parts; where the first part provides the details about the previous work done to resolve the problem of clone profile detection. The second and third parts explain the overview of the proposed approach and the dataset's information, respectively. Finally, the detailed model building approach was explained in the latter part of section II. Next, the evaluation of the results with the limitations and future work has been presented in section III, and finally, the overall conclusion has been given in Section IV.

A. Motivation from previous work
The research area for detecting duplicate profiles in online social media networks has evolved recently, and most of the research findings were published after 2010. Since the research approaches are different depending on different OSNs, selecting the most suitable platform was the first most crucial step. Many researchers have addressed the problem of detecting identity clone attacks in a single platform, and the most common platform selections were Facebook [5]- [7], Twitter, Google+ [8], and LinkedIn. Moreover, some authors have used multiple platforms such as Google+ and Twitter and Facebook as their social environments [9], [10].
Different OSNs provide different types of quantitative and qualitative information such as name, gender, location, education, work, age, number of friends, comments, likes, and friend requests [2], [11]. However, due to different accessibilities for these data, researchers have used many techniques to gather them. The study [12] has used publicly available data, while in the study [7] the author has not used real data set for his implementation. Some researchers [6] have gathered data by creating fake experimental profiles called "Honey Profiles," and that was difficult than gathering data via online tools such as Facebook Graph API [9], [13], and Snapshot tool [6]. Finally, most of the past studies have gained datasets from external sources such as Barracuda Labs [14] and SNAP Library [9].
After gathering data, researchers have started their fake and clone detection process by experimenting with different techniques. The technique of modeling social graphs representing friend network connections was one of the commonly used approaches, and they identified duplicate profiles by analyzing friend patterns [3]. Another study [15] has used a social network of Facebook, and according to user similarities, it was divided into smaller communities. Inside these communities, the strength of the relationships was calculated. A case study [6] was performed to identify the fake nodes by considering network density, degree of nodes, and the correlation between nodes. Some algorithms like [16] have presented a method to detect clone profiles using a graph and network-based approach by analyzing the social network's structural similarity.
According to many researchers, the comparison of profiles based on calculated similarity measurements was a very effective clone detection technique. Some algorithms [17] have directly matched the strings in information fields to measure the similarities between profiles. The approach [18] has introduced a weighted dice similarity measurement to calculate the similarities and rank the selected attributes. An attribute similarity and friend network similarity approach has been discussed in some papers [4], [19]. They have considered three types of friend network features for analysis: friend list recommended friend list and excluded friend list.
Some algorithms have tried to solve this problem of identifying OSN fake clone profiles based on classification approaches. The study [7] proposes a three-step model to match two different profiles from different social media platforms. They have used a binary classifier for feature extraction based on users' information regarding friend requests and friend lists. Some approaches are there to find user profiles that belong to the same user over different social networks [12]. They have generated a similarity vector using a known dataset of paired accounts belongs to the same user across multiple networks. These vectors were then used as the training dataset for supervised classifiers such as KNN, Naïve Bayes, Decision trees, and SVM.
Finally, most of the OSN researchers were unable to validate their results on a real platform, while others [18] have performed result validation through social authentication in which general asking questions from the suspect clones about their profile friends' information. When these suspects are unable to answer the questions, they will be verified as clones. Another way of validation is asking for unique real-world ID from the suspects [15]. Furthermore, the researchers of some studies [20] have got the help of the Facebook security team to validate their findings.

B. The proposed approach
There are three main stages of the detection model. Not all the profiles will go through all these steps, but only the profiles filtered by each step will be forwarded to the next steps. First, the model will input the name of a victim profile who has claimed to find his clone profile/s. Under the assumption that the first step of making duplicate profiles is stealing another genuine identity, all the profiles with the same name as the victim will be filtered and forwarded. These selected set of profiles are referred to as the candidate profile set.
When a fake user wants to forge a genuine user, it is assumed that it will make the profile looks like that user. Hence, most of the public features will be the same in both profiles. Under these circumstances, the same name profiles, including the victim, will be sent to the second detection step, which is the clustering based on their weighted categorical attributes except the name. According to the cluster results, the candidate profiles grouped into the same cluster with the victim will be sent to the next step of detection as the suspect list. The method of selecting the clustering algorithm and the number of clusters will be discussed later in section D-3. By now, the filtered profiles have a higher similarity to the given victim based on their attribute features. Fig. 1 shows the steps of the detection stage.
The next step further verified the duplicability between the victim and each filtered suspect user profile by checking the combined friend and recommended friend network similarity. If this calculated profile similarity value between each victim and suspect is above a predefined threshold, then those profiles were selected as possible clone accounts of the victim, and the amount of duplicability of each suspect profile was given as a percentage. The way of defining the threshold value will be discussed later in section II-D-6 of the paper. For testing the accuracy of the model, an artificially generated profile set was used.  [21] for the present study. The dataset has consisted of 4039 users, 88234 connections between them, and 26 profile attributes. A unique integer ID represented the network users, and an anonymous integer replaced all the attribute values. Only the subset of attributes given in Table 1 was used for further processing, and they were found by the literature [3], [11], [12], [15], [18], [22], [23] as frequently used attributes in detection methodologies. 2) Artificial Clone Set: Due to the difficulty of finding an originally verified clone profile set, this research study modified some of the existing profiles in the dataset as the clone set, which is to be 2% of the original dataset, and it was around 80 profiles. According to the characteristics stated in section I, clone profiles were given the same name as the victim, similar values for many attributes, and few NULL values.
It is known that a clone will not only duplicate a victim's attributes; rather, it will have similar network details due to the addition of the same set of users. Thus, the clones' friend networks were also modified to be similar (not exactly) to the victims' network. Furthermore, the dataset was created so that one genuine user can have one to three corresponding clone profiles.

3) Generate Recommended List:
Recommended friend list for a victim was not randomly selected from their nonfriend profile list. Instead, a set of non-friend users with a higher number of mutual friends were selected as recommended friends of a particular user when they have the same values on attributes such as hometown, location, school, and work employer [11], [24].
D. Model Building 1) Filter by Name: As stated in section I, the name is the key feature that will be the same in duplicate profiles. Hence as the first detection step, the users were filtered by their name, whose first name is like the given victim's name and forwarded to the next detection step. This step is essential to reduce the large population into a smaller one.

1) User Clustering -Attribute Weight Calculation:
Before the second detection phase, which is the clustering, the attributes except the name were assigned with a weight according to the importance of them. Weights reflect the effect of each attribute during the process of detection and decision-making. Previous studies [18] have presented some formulas such as rank exponent, rank order centroid, and rank reciprocal. This study has calculated the weights using a method represented in the study [4], which considered the attribute value distribution in the dataset. For a particular attribute, the similarities between the values of each clone and victim pairs were calculated. Finally, the average of those similarities was taken as the weight of that attribute. In other words, this method of calculating the weights of attributes is more adaptive to any situation since this uses the known clone and victim pairs of any given dataset. Table 2 shows the process of estimating the weights briefly using an example. 2) User Clustering -Clustering Optimization: In this study, the clustering technique was used to separate the profiles with similar attributes into the same groups and dissimilar attributes into different groups. The best number of clusters (K) considering the density performance for several clustering algorithms, namely kMeans, kMedoids and Agglomerative were calculated using the filtered candidate lists of each victim of the dataset. Then the average number of clusters for each clustering algorithm was found as in Table  3. Finally, due to the highest distribution performance shown as in Table 4, the Agglomerative clustering with complete Link Distance and corresponding K value was selected to cluster the profiles using nominal distance.

3) Network Similarity
Calculation-Friend and recommended friend network similarity: Similarity is the measure of how much alike two data objects are. Profile similarity measurement is a value calculated to evaluate whether a given profile can become a clone of another account based on their networks. If the network similarity is higher than a predefined threshold value, then one of the two profiles is said to be cloned. This studied the friend network and recommended network information to calculate the network similarity among two profiles.
Based on the Jaccard similarity measurement, the following Equation 1 can be used to measure the similarity between friend lists (F) of two profiles. The Jaccard was selected among various similarity measurements due to its popularity and applicability in finding similarities of webbased applications and binary vectors [25].  Using Equation 3 the overall network similarity can be calculated by aggregating the network similarities between friend networks of victim and clone (S_ff) and the recommended friend list victim and friend list of clones (S_rf).

4) Network Similarity Calculation -Aggregate Network Similarity:
− The similarity between friend lists of two profiles. − The similarity between the friend list of the clone and recommended friend list of the victim.
The importance of those two network similarities is different [4] where the importance of S_ff is higher than S_rf of the overall aggregate network similarity (S_n). Thus, α > β and they were calculated as α=0.9 and β=0.1 by taking the average S_ff and S_rf between all known clone victim pairs.

5) Network Similarity Calculation -Similarity Threshold
Generation: Threshold similarity value was estimated by considering the aggregate network similarity values between the known clones and victim pairs. Instead of taking an average aggregate network similarity, the minimum among all known pairs was taken as the threshold to avoid losing some of the actual clone profiles without being detected by the threshold. An example was given in below Table 5.

6) Network Similarity Calculation -Decision Making:
The minimum threshold taken by the known pairs was 0.93. All the suspect profiles with a similarity value higher than this value were given a percentage indicating how similar they are to the genuine victim. The application should have a real-time validation mechanism to verify the actual clone profiles.

III. RESULTS AND DISCUSSION
This research study refers to user filtration based on clustering and statistical similarity method. However, the clustering technique was a fully unsupervised learning mechanism; some known labels such as the clones and their victims were used in the result evaluation. This technique was used to evaluate how well the clustering matches the gold standard [26] classes of victims and clones. Nevertheless, this gold standard calculates some statistical values such as similarity threshold, attribute weights, and network weights.

A. Testing and Results
After building and training, the detection model was tested on the unknown dataset. The testing dataset was generated artificially with 3000 profiles using a data generating tool. The following figures show the corresponding outputs of a given victim with ID=48 and clone with ID=2079. From 3000 profiles, 94 were filtered (Table 6) by the name of the victim. Among those 94 profiles, only 15 profiles have been grouped with the victim (Fig. 2) from clustering. Then from these 15, only 4 profiles have been selected as the possible clones (Fig.  3).

B. Model Performance
The following Fig. 4 was taken from the above victimclone pair example, and the differences between the similarity percentage of the actual clone and the other predicted clones are distinguishable. Nevertheless, considering all the victimclone pair examples the precision [ TP/ (TP + FP), where TP -examples selected the actual clone as the possible clones with the highest clone percentage and FP -False Positive, examples did not select the actual clone as the possible clones with the highest clone percentage] was 88.75% and it was considered as a good performance. Moreover, the system's performance depends on the clustering technique in which most of the similar profiles will be filtered out from a large sample. Hence the selection of a suitable clustering algorithm and a similarity or distance measurement is crucial. The density-based cluster performance evaluation was used to evaluate the clustering method's performance, and it gave relatively low average within-cluster distance values for most of the examples where it was -50.446 for the above example.

C. Limitations and Challenges
Most previous studies conducted the experiments on OSN deal with getting a realistic dataset with all the required information. Here, due to the modification of the dataset by adding artificial clones and recommended friend lists, the original dataset's natural patterns were changed. Due to anonymous names replaced by integers, similar names with little changes will not be detected as similar. Further, the inability to validate the model on a real OSN platform by verifying the resulted clone profiles was another main limitation of the work.

D. Future Improvements
As future improvements to this work, the model can be tested with more than ten attributes to identify the relationships between different attributes and clone profiles. Further cross-platform detections where the clones are in a different platform, using string matching algorithms to match the actual text of a name when non anonymized features are given, can improve this work. Also, this research study's main future interest is to build a model to detect the actual person behind this clone who created the clone profile by analyzing the behavioral patterns of profiles in OSN.

IV. CONCLUSION
Due to the popularity of the platform and the simplicity of making profiles, the threat of creating clone profiles has been increased on Facebook. With this attack, users' personal information can be misused and can cause damages to their good reputation. This paper introduces a model with three primary stages to detect these clone profiles on Facebook, wherein at each stage, the amount of computation to be done was reduced by filtering profiles in each of the stages. This method was a simple but more effective method that also showed a higher precision. Furthermore, as most of the calculations are done considering the dataset's distribution, this model can be easily adjusted to a different dataset by only finding values for few parameters.