A couple months ago, Daniel Tunkelang posted an algorithm on his blog that attempts to emulate PageRank for Twitter. I implemented a toy version I dubbed TunkRank, and then suggested that name on his blog. It got some traction, so I figured what the heck and decided to implement it on TunkRank.com.
Now, there appeared to be a little debate about just whether it is actually emulating PageRank or something else on Daniel’s blog, but I leave it to you to read the comments on his post if you’re interested. There are also plenty of ideas there on the best way to establish a measure of influence. I’ll limit the discussion in this post to the basics.
- The amount of attention you can give is spread out among all those you follow. The more you follow, the less attention you can give each one.
- Your influence depends on the amount of attention your followers can give you.
As a twitterer, your influence does not depend on how many people you follow. However, your usefulness as a follower does. Having higher influence depends on having many followers who follow relatively few people but are followed by many. Followers like that are more likely to pick up on your tweets, act on them, retweet them, whatever. You gain influence through the social graph thanks to their influence.
Therefore, your TunkRank score is a reflection of how much attention your followers can both directly give you and give to you.
I implemented this algorithm in Ruby using Merb, MySQL, Capistrano, nginx, and ActiveRecord (and, of course, Git for version control). While my job involves working on a web app, my role has mostly been on back-end NLP stuff. I’m still quite new to the whole Rails-level-web-app-world. For those who don’t know, Merb is a framework similar Ruby on Rails. So similar they are merging and will become Rails 3. ActiveRecord is an Object-relational Mapping (ORM) that Rails uses. The standard ORM for Merb is DataMapper, but I stuck with something I’m more familiar with to limit the variables in my little project.
There are many aspects of getting a web app up and running that I had only heard about in passing — and many more I’m still lost on. But I figured implementing TunkRank would be an interesting place to start.
Phase I – Data Collection
As I said, I implemented TunkRank as a toy the same night that Daniel posted his algorithm. Things seemed to work out quite nicely and I liked it on theoretical grounds as a measure. When I decided to implement the real version, the task of hammering Twitter millions of times suddenly loomed. I suppose I thought there were maybe about 1 million active accounts on Twitter. I have harvested over 2 million before slowing my harvesting down in favor of other development. I have also collected about 40 million edges in the social graph (user A follows user B is one edge). Of the 2 million users I have encountered, those 40 million edges are for only 25% of them. I still haven’t gotten the followers for the remaining 1.5 million. When I do so, I’m sure I’ll discover another million or three users I haven’t seen yet.
I stopped where I did because I was using Ruby’s marshal functionality to dump the social graph to disk. Each dump was weighing in around 250 MB and it was exceeding Marshal’s ability to function. At this point I threw everything into a MySQL database. Bleh! I can’t even describe the pain in the ass that was. If I were to do that again, I would certainly use PostgreSQL, and may still do so. Better yet, I would use some sort of column store database. But it’s in the MySQL db now and running ok (just ok, not great or even well). MySQL dies quietly and annoyingly at times. I hate it.
Doing the operations I was doing before in memory in ActiveRecord instead is mind-bogglingly slow by comparison, as you’d expect. Twitter just released the ability to pull all follower ids in one request, which would have made my life easier, but I still can benefit from it going forward. Also, I should have been storing more information about users than just the twitter username. Having to go back and collect that was slow and annoying, but it’s done.
Phase II – Implementing the Algorithm
The algorithm is simple to compute. Check out this gist for a version that calculates it using ActiveRecord. I’d post it here, but WordPress.com sucks and I’m stuck with it. The code uses ActiveRecord more than I’d like, so I rewrote it in SQL using twitter ids. The gist for that is here. The #{p}
and #{self.twitter_id}
are Ruby variables.
Phase III – Doing the Web App
The web app itself is both the most important step and the least fun for me. I very much enjoyed putting together the code to collect the Twitter social graph and then computing the TunkRank scores, but all the nuts and bolts of getting a web app up and running are tedious. Some of it is interesting. Merb isn’t so bad, though I feel like the documentation is shitty. There is an open source Merb book that is missing stuff in all the sections I needed the most. The API documentation isn’t bad, but isn’t easy to search for high level things that you would normally find in a tutorial. Nor should it be — it’s API documentation not a tutorial.
Fortunately, most things were easy enough that I could find a solution eventually. The whole deploying step is foreign to me, and I’m an apache noob so when it comes to balancing mongrel instances I’m like wtf? Fortunately, I found a few tutorials I was able to piece together.
So the final product is hosted on my 1.8 GHz dual core Dell laptop with 2 GB RAM running Ubuntu 8.10. If you check it out, hopefully it won’t overtax my pathetic server and bring the site down. My data is becoming a little stale so if your username isn’t found, please be patient. When a new person is encountered, I queue them for processing.
Final Thoughts
You can also follow @tunkrank on Twitter. I originally had that account acting as a bot that tweets scores when it encounters influential users. Also, I was having it auto-follow anyone it grades, but upon reflection, it occurred to me these two things were just plain spammy. I chalk it up to a bad decision in the dead of night. Instead I will just have it follow anyone who follows it. See my twitter philosophy for how the account will be managed. I will post updates there on changes, fixes, and up/downtime.
The TunkRank score itself can grow quite large, especially for users with a high number of followers. I present percentiles as the measure, so everything falls in the interval [0,100]. That does not properly reflect that someone in the 100th percentile can be almost 1000 times more influential than someone in the 99th. I’m open to suggestions about how better to show this information. Neal Richter had a few good ideas, perhaps I’ll try one of those. Still, though, I’m left feeling a little dissatisfied by all of the scoring mechanisms (my own included). As Neal pointed out, his ideas are starting points and I’d like to hear what other people would like to see before proceeding with a different scoring method.
Let me know what you think.