Fahhem's Blog – Matching people based on their comments

Matching people based on their comments

I'm on the board of a student group on campus that has a 'buddy system' that's been going for a few years now. Every year, the board gets together to match up interested students on campus in a mentor-mentee setup; students designate whether they want to me a mentor or a mentee (or both) and the board combines them depending on their major and what they said in their comments. Younger students in BioEngineering are typically paired with older students in BioEngineering so that the older student can pass on a lot of their knowledge about UC Berkeley: what they did, what they should have done, when things happen, etc. The pairing process normally takes about a whole day, drawing on all 8 board members and their knowledge of each student (it's a very diverse community and board); we've started at 1pm and ended at 12am before.

As the VP of Technology for this group, I maintain and add to their website (http://calampd.org/) and I am always trying to find ways to get technology to make their operations better. I'm limited by the amount of time in the day, but recently I've found some free time and I thought of using some graph theory I learned recently to find some rudimentary pairings between students for the buddy system.

So today I'm posting the code I'll be using. There are two methods it provides, taking in 1-2 lists of dicts and a few options:

match_two_sets(users_one,users_two,field='data',callback=lambda o,t: None)
match_one_set(users,num_groups=2, group_size_range=None, field='data')

These each fulfill different uses.

match_two_sets()

If you have two mostly disjoint sets (since in my use-case mentors could also be mentees if they wanted alumni mentors) that have some text describing them, call match_two_sets(). The lists of users really only require 1 field ('data', can be configured), but I check for matching id's so those need to be defined if the code isn't modified:

users_one=[ {'id':1,'data':'data that describes user 1','key':'value'}, ...]

users_two is just the other set. field is the key to be used on the passed in dicts of users to get the data necessary and defaults to 'data'. The callback is a function that takes two dicts, the first from users_one and users_two, that indicates a pair. I know, it's not Pythonic, it's actually very Javascripty. I was getting into the implementation details on a webOS app just before I started this so I had it on the mind.

It doesn't return anything so the callback needs to be used. The callback can append to a list and that can be used as a return value.

Internals

For each user, I take the data regarding them and take all 4+ letter words (I'm generalizing that words 3 or fewer letters long are 'unimportant'). With these words, I use the NLTK to find their stems and compare these lists of word stems between users of each set. The NLTK's stemming functions aren't very good, returning stems that are too short very often, but they're better than the raw data.

Once these stems have been compared between all the users, I have weighted edges between all the users in one group to the other. Normally, this would be where bipartite matching is used, but I needed the weights to be taken into account so I used a more generalized matching algorithm already available in Python online. If anyone knows of a maximum bipartite matching algorithm, feel free to comment and I'll look into it.

Speed analysis

This is an O(n\^2) implementation on my end, with O(n\^3) for the matching. I ran some timing tests and it takes 33 minutes to run on our 264 active-member database. It seems most of this time is spent in the stemming code so I added some caching to my stem() wrapper, but those results haven't come back yet (I'm running the algorithm on sizes of 2-264 at increasing increments and averaging 10 trials so it's going to be done after I wake up tomorrow). I'll post them when it finishes.

The regression I calculated in Excel comes out to 0.0003x\^2+0.0768x-5.08 which seems to indicated that if I somehow compiled these to C/C++ (Shedskin or Cython) I might get some good speedups. Shedskin doesn't support third-party libraries (yet?) due to its relatively new state, but I see some improvements coming in the pipeline that might open up many more libraries to compilation thanks to closure support.

match_one_set()

This was the function that was most interesting and used the most graph theory, mainly because it's not only a non-bipartite graph by default, but also it supports splitting into multiple groups. The point of this function is, given some set of people and information regarding them or their interests, to then split them into a desired number of groups; I was inspired for this function when my Professor asked all his students to email him what they were interested in doing their final projects on and he would then attempt to match them up according to their wishes. Upon hearing that, and given that I only recently finished a class on algorithms whose back-half was almost entirely about graphs, I set upon writing this function.

Calling this function is basically the same as with two sets, except some arguments have changed. There is obviously only one set, and no callbacks. Instead, a list of lists of users is returned in the normal Pythonic way. Finally, the group_size_range is a two-tuple (or list) that indicates the allowed minimum and maximum size of each group; if not provided, it defaults to the number of users divided by the number of groups plus or minus 25% (an arbitrary number chosen based on hearing many of my teachers say that groups are to be of size 4-5).

Internals

A double loop (triangular to not double-pair users) is run over the user set provided, running the same stem-based edge creation algorithm as for match_two_sets(). Afterward, python-graph/pygraph is pulled in to take the graph of all these edges and run maximum flow (using the two most heavily-connected nodes as source and sink) to split the graph into smaller components. The largest such graph is split until no graph is bigger than the biggest allowed according to the group_size_range[1] number.

Now, all graphs are either too small or just right so it's time to combine the small ones. All the too-small graphs are found and all pairs are checked to find the pair with the most weight to gain from combining and they're combined. This occurs until there are either no more graphs that are too small or there is only one. That last one is split into single-node graphs and those nodes are put into the graph that gains the most weight from accepting it (the group that has the most in common with that node).

There are other ways to handle it, but this, to me, made the most sense in my envisioned use-case of students and groups.

Speed analysis

I haven't run anything on this, and I'm not likely to get very far on this myself since the big-O situation is highly dependent on the input and the relationships between users that might cause more graph-splitting or not. It could be anywhere from O(n\^2) to O(n\^4) depending on the input. There may be ways to lower the upper limit on complexity, but I'm not motivated since my datasets won't be above the triple-digit range for a while, and I know it's definitely not exponential.

Code

There are several dependencies that I used: python-graph (import as pygraph to check), python-nltk (import nltk). The mwmatching.py is from http://www.xs4all.nl/\~rjoris/maximummatching.html. If you use the less useful match_two_equal_sets() function which only works on two sets of equal size, you'll need http://www.dix.polytechnique.fr/INF441/INF441b/code/kuhnMunkres.py.

Thecode I wrote is available online, but make sure to download the prerequisites before trying it out or else it'll fail. The python files should be local, python-graph can be installed locally, but nltk must be system-wide or added to the pythonpath before using any of the functions.

Posted on: Mon 24 January 2011

Category: Software – Tags: math