Most of the researches in the field of community detection consider only social connections as the similarity measure for obtaining communities in the network. In this post (as a part of my series of posts on Data Mining and Analysis on Twitter), we will discuss about various other possible similarity measures between different users and discuss why and how they can be used to cluster users in the network.

  • User Connections: This is the similarity measure that is used the most in the literature to define a connection between two users. We define a social connection on Twitter to be a following or a being followed relationship between two users on twitter. As we will see in further sections, this is one of the most dominating factors that produce a community structure on twitter and this is the reason that it has been used so extensively in most of the researches. We define an edge of weight between two users  and  if either  follows  or  follows . Therefore, the users social connections (or the user connections matrix as we will call it now) is a symmetric matrix with a link between two users who have either of the following or being followed relation between them.
  • User Mentions: This is another form of a connection that can be defined between two users. As described before, mention is the event of mentioning another user in our tweet. The image below shows an example of a mention on twitter. The user named @EPFLNews has mentioned another user named @SmallRivers. This mention is as a result of @EPFLNews saying something about the user @SmallRivers and therefore wanted to let him know. It has been observed that a mention occurs as a result of discussion or good relationship between users and therefore, it can serve as a good measure of relationship between users. Another important motivation to consider mention as a similarity measure is that it corresponds closely to user connections but is much more selective. We count the number of mentions that two users make on twitter and assign the weight of the link between two users  and  as the total number of tweets posted by  that mention  and the tweets posted by  that mention the user .

An example tweet showing a mention on twitter

  • Description Content Similarity: Users on twitter can post a description about themselves which is shown in their profile. The image below shows an example of description on twitter for the user with screen name @pulkit110. This description can sometimes be used to measure the similarity between users on twitter. Since this description generally describes the keywords about what the user likes to do or where he works/studies at, it can serve as a very good suggestion of user similarity on twitter. Therefore, we consider the cosine similarity between the users’ descriptions as one of the similarity measures in determining the clusters of users on twitter.

Example of user description on Twitter

  • Tweet Content Similarity: The most popular concept of twitter is the concept of tweets. It is tweets that most users on twitter are interested in and therefore, it can be used as a similarity measure between different users. We define the tweet similarity between two users as the cosine similarity between the documents formed by combining the tweets of a user into one. The text similarity measure between the tweets helps us to observe if the users are interested in talking about similar topics. If the users talk about the same topic then it is quite possible that they are interested in similar things and is an indication of good similarity between them.
  • Hash tag similarity between users: Hashtag is a unique concept on twitter which allows users to specify important keywords in their tweets my prefixing ‘#’ before a keyword in a tweet. Hashtags have been used on twitter to set trending topics as well as start chat rooms etc. The hash tags allow users to specify what they think as an important keyword in their tweet and therefore can be considered as a very strong factor to compare two users’ similarity. The image below shows an example of the user @diwakarsapan posting a tweet with hashtag #IEUsers. The hashtag shows that the user wants to emphasize on a particular keyword in his tweet. We define the hash tag similarity between two users as the cosine similarity between the collections of hashtags of the different users.

Example of hash tag on twitter

All posts in this series:

  1. Analysis of Fast Modularity Clustering on Twitter
  2. Analysis of Spectral Clustering on Twitter
  3. Predicting future mentions on Twitter
  4. Similarity Metrics on Twitter