I decided to create a recommendation engine for board games, using data scraped from boardgamegeek.com. My code is available on GitHub, but be warned—I managed to get my IP blocked from BGG while collecting data. In my defense:
My apologies to the Board Game Geek Powers That Be for my insolent behavior. I promise to never do it again. Or I’ll at least do it (even) slower.
Before I was blocked, I managed to scrape the ratings lists from 1,700 users who had rated at least one game. Over 1,000 users had rated at least 10 games. There are almost l2,000 games in the database, although I only used the 1,000 most-rated games for the user similarity measure (described below). All the web scraping code is in Python 3, and I also used Python to parse the XML and write the matrix of ratings to HDF5.
A recommendation engine needs to do two things:
However, the problem can be considered more simply as predicting the rating a user would give to a game they’ve not yet rated. Once you’ve made those predictions, you can simply recommend the games with the highest predicted ratings.
I followed the outline described in this post, and implemented the prediction engine in R. Users are represented as a vector of ratings, and the similarity of two users is the cosine between their vectors. Then, a rating is predicted by taking the average of other users’ ratings, weighted by their similarity with the user in question.
The issue of missing data is not addressed in the post linked above. I wrote a modified cosine function to handle NA values. By default, the cosine function returns NA if any of the inputs are NA, but I wanted it to project onto the largest subspace with no missing data for a given pair of users, and calculate the cosine of the projected vectors.
For testing purposes, my recommendation function predicts ratings for all games for a given user. It ignores the user’s true ratings in the process, which has no effect on making recommendations, since you wouldn’t recommend a game the user has already rated, but allows honest comparisons between true and predicted ratings for testing purposes.
Here are the 20 games with the highest predicted rating for a particular user, along with the mean and standard deviation of their ratings.
> r[1:20,] game predictedRating rating 1 21 8.248609 8.0 2 91 8.132131 9.0 3 6 8.110919 8.5 4 59 8.105064 8.5 5 4 8.026672 9.0 6 7 7.979619 9.0 7 36 7.939304 9.0 8 48 7.935049 9.0 9 52 7.922401 8.0 10 29 7.888088 9.5 11 45 7.859293 9.0 12 62 7.833052 9.0 13 153 7.832915 9.0 14 19 7.832199 7.5 15 492 7.816715 NaN 16 80 7.812988 NaN 17 81 7.763401 9.0 18 8 7.755711 9.0 19 139 7.750971 8.5 20 132 7.749858 NaN > mean(r$rating[!is.na(r$rating)])  6.914315 > var(r$rating[!is.na(r$rating)])^0.5  1.489091
The list is sorted by predicted rating, so it’s satisfying to see high true ratings at the top of the list.
There are a couple things I’d like to do.