Augmented Session Similarity Based Framework for Measuring Web User Concern from Web Server Logs

In this paper, an augmented sessions similarity based framework is proposed to measure the web user concern from web server logs. This framework utilized the best user session similarity between any sessions based on the relevance of accessed page in a particular session and syntactic structure of web URLs. The framework is tested using K-medoids clustering algorithms with independent and combined similarity measures. The merits of generated clusters are evaluated by measuring average intra-cluster and inter-cluster distances. The results demonstrated the superiority of combined augmented session dissimilarity metric over the independent dissimilarity measures using augmented session regarding cluster validity measures. Keywords— augmented web user sessions; user concerns; page relevance; syntactic structure; page stay time; the frequency of page; dissimilarity metric


I. INTRODUCTION
With brisk enlargement and continued popularity of the World Wide Web, most of the institutes are presently putting their data on the web and offer online services [1].So, many users now rely on the Web to seek information and gain knowledge by navigating websites.This user navigation results in an enormous information about their surfing behavior on the web hosting server in the form of weblogs [2].A challenging aspect of web usage mining is to know in advance about user's habits, interest, and expectations [3] and deduction of knowledge from logs.This knowledge can be revealed by applying different web usages mining approaches on the weblog repositories which keep the record of past surfing behavior of users [4], [5].The clustering is proved very effective in grouping users with similar browsing activities, navigational pattern, and access behavior [6], [7], [8].However, clustering results are very much depending on similarity metric used for capturing user concerns and their accommodation.Normally, the higher precisions of similarity metric lead to the enhanced quality clusters of web user sessions and vice versa [9].This paper discusses a combined dissimilarity based model to estimate a user's concern from augmented web user sessions.This dissimilarity metric is based on accessing page relevance and syntactic structure of page URLs.The page relevance is estimated using harmonic mean of the duration and frequency of the page while syntactic similarity is derived from the syntactic structure of URLs.The effective clustering of the web user session requires a precise definition of similarity metric between user sessions.In literature range of similarity measures are reported to recognize accurate similarity among users based on their access behavior.In [10] visitor behavior is consolidated by visited page content to built up likeness between various page groupings.In [11] common paths between two sessions isolated into the inward part and the external part of generalized web session and a new similarity measure is proposed from users' navigation patterns to identify the common paths.In [12] the element of uncertainty in user's navigation patterns was handled using belief function based similarity measure and Dempster-Shafer's theory of mass distribution for clustering.In [13] an enhanced similarity based user session clustering algorithm was presented to reduce the time and space complexity convention k-means [14] and ROCK algorithm [15].The sequence alignment based similarity method and associated measures were used to cluster web user sessions [16].In [9] comparative study of the effectiveness of most popular similarity measures including cosine, Jaccard, and Pearson coefficient was performed, and this work is further extended by [17] and [18].In [19], [20] The user interest on a page was defined by implicit measures and further extended by [21] and proposed some implicit measures such as duration and frequency of the page for assessing the interest of a user for a particular accessed page.In [22] author used abstract similarity graph and comparative time used up on general greatest subsequence for clustering.In [23] the various leveled structure of universal resource locators (URLs) on the site was used to introduce the concept of sequence alignment and clustering.The relationship due to the concept hierarchy of web pages is quantified and incorporated it in similarity measure using a binary vector representation of web user sessions, assuming an equal degree of user concern are presented in [24].The exact page viewing time of visiting pages and URL of the pages are used to find similarity between users in [13].In [25] a website concept hierarchy based similarity scoring system was introduced and further integrated with other similarity measures like page stay time and page visiting sequence.In [26] the sessions intuitive augmented similarity concept is discussed, and the effectiveness of same was tested in [27] and [28].The rest part of this paper is arranged under following sections: The materials and methods used in present work are discussed in Section II.In Section III experiments are performed with combined similarity approach and with individual measures and results are discussed in detail.Lastly, in Section IV, this study is concluded with some suggested future work.

II. MATERIAL AND METHOD
The methodology adopted to carry out the present work is described in following subsections.

A. Pre-Processing of Web Server Logs
The explicit or implicit web access behavior of a user is recorded in weblogs.The weblogs maintain user information in the form of standard fields.The exact sequence and number of the field are varying from one log format to others.However essential information including address and the login name of the remote host, name of the user, request time, method, full path of the server, used protocol, status code, access data size, and agents used to access the information are present in every format [29].All recorded information like entries made by automated software agents [30], default entries for embedded objects, unsuccessful requests and non-human user method are not useful.First, we remove these entries from log files.Second, user sessions are extracted from log entries.The user sessions are accumulated activity of a user on a web server.Whenever a fresh IP address is originated in the log file, a new session is generated and subsequently demand from the identical IP address is supplemented to the session based on some predefined elapse time.In this work 30 minutes elapsed time is used.Otherwise, a fresh session is initiated by closing the current session.Multiple sessions are possible for any web user because of the possibility of multiple visits and spending of an outlandish measure of time between back to back visits [30], [31].

B. Session Representation in Vector Space Model
For a particular website; we are assuming usage sessions are identified S i = S 1 ,S 2 ,….S m and number of different URL's (pages) P i = P 1 ,P 2 ,….P n are accessed in some time interval.Then each user session may be represented by the following equation Where every S k i corresponds to a harmonic mean of the frequency of page within the session S i , and the duration of the page in session S i , as shown in Eq. ( 1) and Eq. ( 2).

S k i ← Page frequency Page duration (in seconds)
Page size (in bytes)

C. Computation of Scale of Web User Concern for a Web Page
In [20] concept of implicit measure of user interest of a page was introduced and further extended by [21] and proposed Page stay time and page access frequency to measure the user concern for a page.The following metrics are used to compute the relevance of a page in any user session.Further, this relevance is used to measure the web user concern for a web page.

1) Duration of Page&DoP':
The time used up by a user on any page is known as the duration of the page.Measured by the precise time difference between two consecutive requests for web pages in the session.A Higher value of page stay time implies more concern of the user to any page.However, sometimes the small size of a web page may lead to a swift transition to another page.Therefore, the time spent on the page is normalized by the page size.The time spent on the page is again normalized by the max page stay time in that session.The Eq. ( 3) is used to measure the DoP (P i ) in session (S k ) However, in the case of last access page, it is not feasible to compute the difference of requests time.Therefore, the average duration of the relevant session may be considered.

2) Frequency of Page &FoP':
The count of visits of a page 2 in any session.The high value of this count indicates more concern of a user for any page.The frequency of a page is divided by the accumulated frequency in the session: The Eq. ( 4) is used to determined the FoP (P i ) in user session (S k ) Where 0 ≤ &FoP' P i ≤ 1.These two metrics are consolidated to measure the page relevance for a user.The harmonic mean of &DoP' P i and &FoP' P i is computed for estimating user concern because it will moderate the impact of large and small outliers [6].After applying Eq. ( 1) and ( 2) on pre-processed log intermediate results are generated as shown in Table1 3) The Relevance of the Page (RoP): From Table1 the page relevance is computed by the harmonic mean of DoP and FoP.The Eq. ( 5) is used to calculate the relevance of a page (RoP) (2 ) in user session ( 3 ) (RoP) P i = 2×&DoP' P i ×&FoP' P i &DoP' P i +&FoP' P i (5) Where 0 ≤ (RoP) P i ≤ 1.
The Eq. ( 5) demonstrates that high value of both DoP and FoP leads higher user concern for a particular page in any session.
4) Augmented Web User Sessions: first, we compute page relevance matrix ( RM m×n ) Eq. ( 6) of using equations (3) to (5).The relevance of each page in every session is calculated from this relevance matrix.The high value of relevance suggests more user concern for the page.

5) Augmented Session Similarity Based on Page Relevance:
The Eq. ( 7) shows the modified cosine similarity measure by incorporating relevance of a page.This is termed as page relevance based augmented session similarity measure and more realistic than binary cosine session similarity measure [9].
However, the key constraint of this measure will remain same as it abandoned the hierarchical grouping of web URL's and placed in the same directory if web pages are related [32]

D. URL based Syntactic Similarity between i th and j th Page URL's
In [32] authors also consider the syntactic structure of URL's and computed their similarity by their respective position in the web hierarchy by using Eq. ( 8).This measure will quantify the path overlapping between different visited pages.
=Min 51, 6LoP)P &a,i' *∩LoP-P &b,j' .6 Max-1, Max7LoP)P &a,i' *, LoP-P &b,j' .8-1. 9 Where LoP)P &a,i' * is the length of URL (or a number of edges) of the path followed by the root node and a particular node of P i in the user session US a .By incorporating this syntactic similarity of page URL's, the similarity between two augmented web user sessions -AS a p i ,AS b p j . is computed by Eq. ( 9).Here, they assume a uniform URL based Syntactic Similarity of 1 for any node and its parent as well as all sibling nodes of any parent., AUSS -AS a p i ,AS b p j .

E. URL based Syntactic Similarity between i th and j th
(Augmented URL based syntactic similarity) Will be producing better results.Proposed combined augmented session similarity (CASS) measure utilizes the best characteristics of both the individual measures and considers the most idealistic accumulation to generate the better similarities between web user sessions using Eq.(10).
CASS &AS a ,AS b ' =Max :ASS &AS a ,AS b ' ,AUSS -AS a p i ,AS b p j ., ; As a requirement of relational clustering, this combined augmented session similarity is converted to the dissimilarity metric using Eq. ( 11).Which satisfies the necessary conditions [18] of to be a Euclidean metric.(i) Non-Negativity (ii) Self Dissimilarity and (iii) Symmetry.The above-described methodology for proposed framework is summarized in Algorithm 1.This algorithm is used to compute essential dissimilarity matrix for relational clustering of web user sessions.

F. K-medoids Clustering
To evaluate the performance of the proposed augmented user session (dis) similarity framework, we apply the most common implementation of K-medoids algorithm known as partition around medoids (PAM) on user session data with different session relational measures computed in this framework.The K-medoids is preferred for experimentation because; it selects actual user sessions as a cluster prototype to represent the cluster.At the same time in K-means [14], cluster prototypes ( known as centroid) are the mean of sessions of that cluster.This fundamental difference makes the k-medoids algorithm more suitable for user session (dis)similarity metrics (relational data) [33].
Given a set of user sessions S i = S 1 ,S 2 ,….S m for i=1,2,…m, where the vector of ndimensions represents each session S i = S i 1 ,S i 2 ,…S i n , ∀i=1,2,…,m .The objective of K-medoids clustering algorithm is to find k representative sessions known as medoids, to such an extent that the aggregate difference between different sessions to their nearest medoid is minimized.Let C←{c 1 ,c 2 ,… c k } be the set of medoids.The k-medoids objective function is shown as Eq.( 12) Where, S i is i th user session.δ c is the medoid of cluster C c .
' is the dissimilarity between session and medoid of cluster = > represented as δ c .The membership functionμ ci that minimizes F k-medoids can be derived from Eq. ( 13): Once the membership matrix ?= A > is fixed the new cluster medoid B > that minimize F k-medoids can be derived by the Eq.( 14):

3) Computation of relevance of any web page in user
sessions for each pair of ( S k ,P i ) in session S k and page P i where, i =1,2,…n.and k=1,2,…,m.
• Compute the duration of a web page ( P i ) in user session ( S k ) using Eq. ( 3).• Compute the Frequency of web page ( P i ) in user session ( S k ) using eq.( 4).• Compute the relevance of web page ( P i ) in user session ( S k ) using Eq. ( 5). 4) Computation of page relevance based augmented session similarity matrix by using Eq. ( 7). 5) Compute the page relevance and hierarchical URL similarity based augmented session similarity matrix by using Eq. ( 9). 6) Compute the combined augmented session similarity by using Eq.(10).7) Compute the combined augmented web user session (dis)similarity matrix by using Eq.(11).

G. Validity of Generating Clusters
The unsupervised cluster assessment techniques are utilized to validate the worth of produced clusters.These techniques are based on compactness and separation which is measured by intra-cluster (Eq.( 12)) and inter-cluster (Eq.( 13) distances [34] [35], [36].The low estimation of intra-cluster and high estimation of inter-cluster separation is desirable for worthy clusters [37].
Where, d kl 2 &AS k ,AS l ' is the distance between two sessions in the cluster C i and |C i | is the number of sessions in C i .

III. RESULTS AND DISCUSSION
The combined augmented session similarity framework is evaluated for its effectiveness and efficiency.The experiments are performed on the web server access log recorded from a heavy trafficked and global service provider portal.This portal provides information about visa, insurance, and other related services to international travelers to the USA.The proposed framework is implemented and tested using MATLAB [38].Experiments were performed on Intel® Core™ i3 (CPU M370 2.40 GHz) with 4.00 GB RAM, and operating system Windows 7 (64bit OS).In all experiments, k-medoid clustering was performed with different dissimilarity measures and computed both average intra-cluster and inter-cluster distances against a varying number of user sessions clusters.First, we applied pre-processing [29] on the used web log file and generated 468 sessions using 30 minute threshold time.We have filtered out 128 web robot sessions because autonomous software agents generate them and their access behavior is monotonous [31].There are 216 very small sessions of length 1 or 2 are discarded from total identified session because they are not able to convey any useful information for clustering.After this process, we obtained 124 user sessions.These user sessions combined accessed more than 150 unique pages.We have reduced pages to 95 unique pages by removing very less frequent and very short duration pages visited by this session.First, we compute D 124×124 using Eq. ( 7) and get augmented session dissimilarity ASS &AS a , AS b ' Metric and run k-medoid clustering for 5, 10, 15 and 20 number of clusters.This algorithm assigns 124 sessions to a different number of clusters respectively.The summarized readings of experiments are recorded in Table 2. Then a 95×95 URL Similarity metric computed for all referred unique pages(URLs) by using Eq.( 9) and by using this a URL based Syntactic dissimilarity AUSS -AS a p i ,AS b p j .
of 124×124 by Eq. ( 8) is computed.Again, the same procedure is adopted, and the results are reported in Table 2. Now we compute a combined augmented session dissimilarity metric CASS &AS a ,AS b ' by using Eq.( 10) and ( 11) which takes the maximum value of dissimilarity from both dissimilarity matrices.Dissimilarity measure on both intra-cluster and inter-cluster evaluation parameters.It is also observed that optimize and stable results are found with ten numbers of clusters.In this experiment we consider a small size log to avoid preprocessing overhead but in the same future experiment can be extended on the large log file.
a p i ,AS b p j .= ∑ ∑ AS a (RoP) i ×AS b (RoP) j ×USS 7US a p i ,US b p j 8 m j=1 m i=1 ∑ AS a (RoP) i m i=1 × ∑ AS b (RoP) j m j=1(9) Page URL's Page relevance plays the major role in deciding the similarity of two web user sessions.If any web page pair has small USS -US a p i ,US b p j .( URL based syntactic similarity) then ASS &AS a ,AS b ' ( Page relevance based augmented session similarity) will give good results and for a large value of USS -US a p i ,US b p j .

Algorithm 1 : 2 )
Page relevance and URL based syntactic (dis)similarity metric of web user sessions.Input: {Log file: ℒ of records where L←{r 1 ,r 2 …r n } , where n⋙1, U n×n -URL based syntactic similarity matrix} Output: { D m×m | (Dis) similarity matrix} 1) Pre-processing of web server access log • Removal of extraneous information: L c ←{r 1 ,r 2 …r n } Log file after cleaning • Identification of web users U i = U 1 ,U 2 ,….U n for i=1,2,…n.• Identification of web user sessions S i = S 1 ,S 2 ,….S m for i=1,2,…m .Where m ≥ n Vector representation of web user sessions

Fig. 1
Fig. 1 Avg.Intra-Cluster distance Vs.No. of clusters for different dissimilarity measures

TABLE I SESSION
TABLE WITH PAGE FREQUENCY AND DURATION