I recently worked on Data Mining and Analysis on Twitter as my semester project at EPFL. During the 4 months that we worked on the project, we were able to achieve many goals. In this post, I will be describing how we can cluster a small group of users on Twitter based on different similarity metrics and a brief comparison of several clustering methods that could be used for this.

For this analysis, we used three different user lists on twitter as the ground truth data for a group of about five hundred users. We obtained all the tweets from the users who were listed in the three lists and then tried to obtain clusters by using different similarity metrics using the spectral clustering algorithm. In addition to this, we also explore different connections between users in addition to just the social connections in order to find out other features that affect the users being listed together. We present results of applying spectral clustering algorithm using the modularity matrix and the symmetric normalized Laplacian matrix. We compare the results of these approaches while using several different input matrices formed by different combination of the above similarity measures.

If we have a set of n objects \$\$x_1,x_2,x_3,\dots,x_n\$\$ with a pairwise similarity function defined between them which are symmetric and non-negative. Spectral clustering is the set of methods and techniques that partition the set into clusters by using the eigenvectors of matrices.

The motivation behind using eigenvectors for clustering is that the change of representation induced by the eigenvectors makes the cluster properties of the initial data set much more evident. In this way, spectral clustering is able to separate data points that could not be resolved by applying directly k-means clustering, for instance, as the latter tends to deliver convex sets of points. Since the introduction of spectral methods in "Lower Bounds for the Partitioning of Graphs" there have been several researches where scientists have tried using different matrices for the calculation of eigenvectors followed by applying clustering on the eigenvectors. A complete discussion about the different spectral algorithms and matrices would take a new post, so I wouldn't go into much details here.

Now, I will present the results of applying spectral clustering algorithm using the modularity matrix and the symmetric normalized Laplacian matrix. We compared the results of these approaches while using several different input matrices formed by different combination of the above similarity measures. Before I begin, let's have a look at the spy plot of the user connections. The users have been ordered by the lists that they belong to and therefore, we can immediately observe three communities present in the network by looking at the plot.

We present the results of application of the algorithm on users’ social connections as well as several other individual similarity measures (user mention similarity, description content similarity and tweet content similarity) followed by a simple combination of the different similarity measures. For finding the combined similarity measure, we sum all the different similarity measures. Since the different similarity measures can be on different scales, these similarity measures are normalized before we add them together and apply the clustering algorithms. Therefore, the adjacency matrix that corresponds to the combined similarity measures is the sum of all the individual normalized adjacency matrices.

In order to measure the accuracy of our clustering algorithms, we use several different cluster evaluation objective functions to compare obtained clusters with the ground truth data of clusters which represents the distribution of the users into different lists. I will present the values of Normalized Mutual Information and RandIndex during the analysis. A higher value of these factors means a better correspondence with ground truth. The table at the end of this post summarizes these values for different clustering algorithms and similarity metrics.

I will now show some visualizations of our obtained results. The results have been obtained using Gephi for visualization. The communities are represented by different colours in the visualizations. Note that the arrangement of the nodes in the space doesn’t represent any communities. The arrangement of nodes in the visualizations corresponds to a layout. We keep the same layout for all the visualizations so that it is easy to see the results of the clustering process. The different clusters in the visualizations are represented by different colours. This means that all the nodes in the visualization that have the same colour have been placed into one cluster for that visualization. Figure (a) shows the communities formed using the ground truth data. Figure (b) shows the clusters obtained for the network using the user connections as the only similarity measure using the modularity matrix for spectral clustering. Figure (c) shows the results of community detection using the combination of all similarity measures for the modularity matrix. Finally Figure (d) shows the results for community detection applied on combined similarity measures using the Symmetric Normalized Laplacian matrix.

a b c d We can observe that user’s social connections are the most dominating factors for this group of users while dividing the users into different communities. We can also observe that the high values of benchmark results correspond to the good community detection that corresponds closely to the ground truth data by looking at the visualizations in Figure (a) and Figure (b) which show the ground truth clusters and the clusters obtained by applying community detection using modularity matrix on social connections respectively. The other individual similarity measures like the user mentions, tweet content similarity and description content similarity don’t perform very well when used for community detection using the modularity matrix. We also observe that the results for combined similarity measures, i.e. the sum of connections, mentions, tweet content similarity and description content similarity doesn’t perform as well as the connections. This means that the addition of low information contents like the mentions, tweet content similarity and description content similarity decreases the accuracy of the clustering algorithm even in the presence of the highly informative social connections. A reason for the bad performance of the similarity measures based on the tweets, descriptions and mentions can be that the group of users are similar and generally post similar content on the web. This also means that the user behaviours don’t seem to be consistent with the ground truth data.

However, the detection of communities for the user’s social connections using the Symmetric Normalized Laplacian matrix fails. This is because the Laplacian based methods are known to be quite sensitive to the presence of disconnected nodes in the graph. Therefore, the results of the social connections for the Symmetric Normalized Laplacian are similar to the other individual results which mean that we are not able to reconstruct any valuable cluster information when using the Normalized Laplacian. But the combined matrix performs consistently even when using the Normalized Laplacian for community detection. This is because addition of several different kinds of information to the social connections makes it a connected graph and therefore, we now observe results consistent with the ones obtained while using modularity matrix for community detection.

Therefore, we can conclude from these results that user connections are a very good measure of information and for this model; they give the best clustering results when used with the modularity matrix for spectral clustering. We can also see that adding several other non-informative measures to this layer decreases the accuracy considerably when used with the modularity matrix. However, an interesting observation occurs from our discussion and results regarding the Symmetric Normalized Laplacian matrix that even a highly informative layer (user connections) can prove to be a very bad indication of clustering if not used with the correct clustering algorithm due to the presence of disconnected nodes in the graph. We also found that even adding a non-informative (or slightly informative) layer as in this case can improve the connectivity and therefore improve the clustering results.

 Similarity Matrix Modularity Matrix Symmetric Normalized Laplacian Matrix NMI RI NMI RI User Connections 0.3868 0.7174 0.0077 0.3374 Mention 0.0130 0.3398 0.0077 0.3374 Tweet content similarity 0.0074 0.3371 0.0077 0.3374 Description content Similarity 0.0780 0.5254 0.0088 0.3381 All combined 0.2500 0.6175 0.2931 0.6472

All posts in this series: