Writing scalable recommender system with Hadoop

Lets say we want to implement a very simple “people you might know” recommender algorithm. Obviously, the speed and scalability of such an algorithm is as important as the actual logic behind the algorithm because such algorithms generally run over a “huge” graph and implementing these normally would probably take a lot of time for recommending friends even for just one user. I will show how to implement a very naive recommender algorithm efficiently. The basic idea of the algorithm is that if person A and person B do not know each other but they have a lot of mutual friends, then the system should recommend that they connect with each other. We will do this is hadoop which is an open source software for highly reliable, scalable distributed computing. The main idea behind it is map and reduce jobs. I assume that you are already familiar with these concepts (tutorial). If not, I wouldn’t recommending reading this post further without having a look at what a mapper and reducer is.

In order to recommend such friends, we first need to count the number of mutual friends that each pair of persons have in such a network. For this, we will need a map reduce job that functions similar to the map-reduce job for finding the frequency of words in a file. For every pair of friends in a list, we output the following tuple: <[friend1 friend2 1], 1> . In addition to this, we also output a tuple <[source friend -1], -1> for every pair of persons who are friends. i.e. For an input:

source -> {“friend1”, “friend2”, “friend3”}

the mapper outputs

{[source, friend1, -1], -1}; {[source, friend2, -1], -1};
{[source, friend3, -1], -1}; {[friend1, friend2, 1], 1};
{[friend1, friend3, 1], 1}; {[friend2, friend3, 1], 1}

Now, we need to define our equals() method such that it can looks at the two names in the key to find equal keys. We also need to define our Comparator such that it sorts the keys based on the third attribute in the key. Therefore, the reducer gets as input a key denoting a pair of friends along with the list of their number of common friends. In the reduce function, we can check if the first record has a -1 in the key. If there is such a record, we can ignore that pair of friends because they already have a direct connection between them. Finally, we aggregate the values for all the keys in the reducer and output the tuple <[friend1, friend2], numberOfMutualFriends> for all the pairs of friends which don’t have third attribute as -1 in the key.

function map(input)
  split input to obtain source and friends[]
  for (i = 1:n)
    f1 = friends[i]
    output <[source, f1, -1], -1>
    for (j = i:n)
      f2 = friends[j]
      output <[f1, f2, 1], 1
    end
  end
end

function equals(key1, key2)
  if ((key1[1] == key2[1] && key1[2] == key2[2]) || key1[1] == key2[2] && key1[2] == key2[1])
    return true;
  end
  return false;
end

function compare(key1, key2)
  return (key1[3] < key2[3]);
end

function reduce(key, values)
  if (values[0] == -1)
    return;
  end
  numberOfCommonFriends = 0;
  for (value v in values)
    numberOfCommonFriends += v;
  end
  output <[key[1], key[2]], numberOfCommonFriends>
end

After the first map-reduce job, we obtain a list of pair of persons along with the number of common friends that they have. Our final map-reduce job looks at this list and outputs a list of persons they have maximum number of common friends with. Hence, our map job just outputs<[pair[1], numberOfCommonFriends], pair[2]> and<[pair[2], numberOfCommonFriends], pair[1]>. Our equals() method would look at the person in the key to find equal keys and our comparator would look at the numberOfCommonFriends to sort the keys. This would ensure that the tuples for the same person go to the same reducer and in a sorted order by the number of common friends. Our reducer then just needs to look at the top 10 values and output the list.

function map(input)
  split input to obtain friend1, friend2 and numberOfCommonFriends
  output <[friend1, numberOfCommonFriends], friend2>
  output <[friend2, numberOfCommonFriends], friend1>
end

function equals(key1, key2)
  if (key1[1] == key2[1])
    return true;
  end
  return false;
end

function compare(key1, key2)
  return (key1[2] < key2[2]);
end

function reduce(key, values)
  suggestions = []
  for (i = 1:min(10,size(values)))
    suggestions.push(values[i]);
  end
  output <key[1], suggestions>
end

This is all we need to do to have our small recommender system up and running on Hadoop which will already scale quite well. There are still plenty of things that can be done here both in the algorithm as well as the implementation to improve the results and scalability.