Brad DeLong's Weblog Archive Page

« James Fallows: Two-Class Voting and the Great Newspaper | Main | Smoking Gun on Bill O'Reilly »

July 31, 2007

Tom Slee on Distributed Collaborative Filtering: The Netflix Prize: 300 Days Later

A very interesting experiment, on lots of levels:

Whimsley: The Netflix Prize: 300 Days Later: Online DVD rental outfit Netflix caused a real buzz last October when it announced the competition. If anyone can come up with a recommender system for predicting customer DVD preferences that beats its own algorithm (Cinematch) by a certain amount, Netflix will hand over $1million. The prize got a lot of attention because it exemplifies the idea of crowdsourcing. Not only does Netflix rely on crowdsourcing of DVD ratings (user ratings of DVD titles) but the competition itself is an attempt to use crowdsourcing to develop the algorithms to make the most of those ratings. Instead of doing the work itself, or hiring specialists, Netflix lets whoever anyone enter their competition and pays the winner...

[...]

As soon as you start looking at the data set it becomes obvious why it is so difficult to get good results. Databases don't have the linear algebra and other mathematical tools for taking a run at the prize but they are convenient for exploring data sets, so I loaded the data into a SQL Anywhere database (The developer edition is a free download, and I'll provide a perl script to load the data if you really want it) and started poking around. Here are a few of the more obvious oddities (all these observations have been posted elsewhere - see the Netflix prize forum for more):

  • Customer 2170930 has rated 1963 titles and given each and every one a rating of one (very bad). You would think they would have cancelled their subscription by now.
  • Five customers have rated over 10,000 of the 17,770 titles selected - and presumably they also have rated some of the others among the 60,000 or so titles Netflix had available when they released the ratings. Are these real people?
  • Customer 305344 had rated 17654 titles. Even though Netflix make it easy to rate titles that you have not rented from them (so they can get a handle on your preferences) can this be real?
  • Customer 1664010 rated 5446 titles in a single day (October 12, 2005).
  • Customer 2270619 has rated 1975 titles. 1931 were given a 5, 31 were given a 4, 10 given a 3, 2 given a 2 (Grumpy Old Men and Sex In Chains) and a single title was given a 1. That title? Gandhi, which has an average rating of over 4 and which less than 2% of those who watch it give a 1.
  • The most often rated movie? Miss Congeniality with ratings by over 232,000 of the 480,000 customers. And which title is most similar to it in terms of ratings (using a slightly weighted Pearson formula)? Bloodfist 5: Human Target.
  • Most highly rated - Lord of the Rings: Return of the King (Extended Edition), with 4.7...

[...]

So what I get from the Netflix prize is that there are probably significant limits to recommender systems. Even the smartest don't do a whole lot better than the simple approaches, and a lot of work is required to eke out even a little more actual information from the morass of data. It seems surprisingly difficult to get reliable, factual information on this important question of how useful they can be. Part of the reason is that they are new - Amazon has only been in business for about ten years after all - and part of the reason is that the behaviour of these systems is often a closely guarded secret despite the aura of openness that web companies cultivate.

This matters because there is a surprising amount riding on the effectiveness of recommender systems. Silicon Valley's new-economy enthusiasts see them as the key to developing a new level of cultural democracy: they see recommender systems as a trebuchet hurling rocks at the castles of the old elite of mainstream media, big publishers with big marketing departments, big-chain book stores and Hollywood sequels. Recommender systems are claimed to embody the "wisdom of crowds". The idea is that everyone just publishes stuff (blogs, wikipedia entries and so on) and amateur readers or viewers decide what has merit by their actions (rating stories, buying and rating books and DVDs and so on). The work of critics is "crowdsourced" to customers, but it is the recommender system that distills these ratings to yield the aforementioned wisdom.

If faith in recommender systems is misplaced, then the new boss may look much like the old boss only with more computer hardware. There is a danger that recommender systems may simply magnify the popularity of whatever is currently hot - that they may just amplify the voice of marketing machines rather than reveal previously-hidden gems. Even worse, their presence may drive out other sources of cultural diversity (small bookstores, independent music labels, libraries) concentrating the rewards of cultural production in fewer hands than ever and leading us to a more homogeneous, winner-take-all culture.

I'm no futurist, but I see little evidence from the first 300 days of the Netflix Prize that recommender systems are the magic ingredient that will reveal the wisdom of crowds.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/106400/20467636

Listed below are links to weblogs that reference Tom Slee on Distributed Collaborative Filtering: The Netflix Prize: 300 Days Later:

Comments

I did some basic research on pattern matching algorithms a long time ago. One of the things that emerged is that there are clusters of similarity that can be very small but highly relevant.

In my case I was using scientific research papers where the audience is small to begin with, for a given subspecialty there may only be 20-50 people interested, but every time a new paper appears in this area they want to see it. The trick is find the factors that identify the area.

In the entertainment market I expect that a similar situation exists especially for the less popular items. The problem is that the relevant data may be of a nature not usually available by looking at the listing. The obvious items like cast members and director are usually listed, but how does one capture plot type or setting or other hard to quantify measures?

Assuming that people will distill this for you in a simple rating system is unrealistic.

I wonder what Hal Varian would say about this?

If someone can create a decent recommender algorithm and gives it to Netflix for the chance of getting a $1M prize s/he's an idiot.

The descriptions of the customers' recommendations would seem to make that particular set of data pretty much useless. Does the data set include information on the movies that customers actually rented? Well, given that it's NetFlix, movies that customers allowed to reach the top of their queue and have delivered? It would seem that an "I watched Ghandi" is more relevant than "I rated Ghandi highly even though I haven't watched it".

How widespread is the case of our household, where there is a single account used by two individuals with radically different taste in movies?

I am unimpressed with both Amazon's and Netflix' recommendations. Taking Amazon, the first thing they recommend is all the other titles by the same author. Well, no duh, I had already thought of that. And next you get stuff in exactly the same genre, even if the writing isn't anywhere near as good and the mood is completely different. And then you get books bought be the same people who bought your book. This is a bit better, but they might like my book for a completely different reason than I do.

I want recommendations based on a bunch of books I like, not just that one book. I want a system that sees that I like Jane Austen and Terry Pratchett and understands that I want humor and irony and excellent writing, not generic regency romances (or "classics") or fantasy.

Brad says,
"There is a danger that recommender systems may simply magnify the popularity of whatever is currently hot"

This reminds me of a retired economists reflections. Looking back at his and others economic publications, he proposed that most relied on problematic data, particularly autocorrelations, that couldn't have produced believable results (could only produce more hypotheses).
Hmmm, enough of economists gloom.
Onto statisticians,
who might say that a nation of 300 million people is an enormous sample size, both for macro-economic data and for song preferences.
One would expect either numerous correct conclusions can be made, or new/difficult statistics must be invoked using a scarce skill.

"magnify the popularity of whatever is currently hot"
would happen with a recommendation system that attended more to quantity than quality.
However, from a Bayesian viewpoint, a larger quantity of ratings for one DVD would reduce the estimate's variability, but not its probability of being a good DVD for a Netflix onscreen recommendation.
Thus, done well, a recommendation system would present new artists (or DVDs) to the world.

Presenting new artists could work very well with music, since individuals often make songs while corporations make DVDs.
Presenting new artists and only legal music was the goal of
Irate Radio
whose problems rating songs can be seen at
http://www.goingware.com/iRATE-radio/

Pie-in-the-sky solutions might seem incalculable when you have millions of Netflix users and tens of thousands of DVDs.
In the Naive Bayesian approach (probably the most effective method, mentioned by Paul Graham) to reducing email spam,
not even correlations get calculated.
However, correlations get implicitly included thru the probably best-of-the-Naive-Bayes approaches,
http://www.paulgraham.com/wsy.html
where Paul Graham mentions Bill Yerazunis' CRM114 99.87% accuracy (very few false positives and very few false negatives).
One option with CRM114 includes 5 adjacent "words" at a time as 1 word, so correlations between words get implicitly incorporated.
This alleviates the need to filter documents/email; eg, CRM114 does amazingly well with very few words entered into wordy varied long forms from various automobile manufacturers -- to identify auto safety problems.

If one attempted to include 100,000 DVDs in a correlation matrix, one must consider 10 billion entries in a correlation matrix -- too large.
However, if one forms every combination of 100,000 DVDs taken 2 at a time, one still has that large number of items, but most combinations have no actual data (users using both DVDs), so you end not with a list 100,000^2,
but 100,000*2 or *3 (an arithmetic, not geometric order). One can then apply many techniques that would produce probably good results.
Your programming would probably use a "hash" (sometimes called an associative array) to directly access each real pair-combination of DVDs.
It's surprising how few programming languages include hashes (Perl does).

This is an important problem to solve not for Netflix, but for unknown song artists.
Two years ago, when I presented this problem to a statistics professor at George Washington University,
he considered me daft for considering such a problem.

The real problem NetFlix has is simple. There is no incentive for a recommender to be accurate in their recommendations.

They need to take a page from Leslie Fine and some other people at Hewlett-Packard and make a mechanism where recommenders get points for being correct about what other people will say other people will like.

Amazon's book recommendation system works on the book level. Amazon determines what other books were also purchased by the people who bought a particular book. Books that are not widely popular are given more weight. They don't factor in the user rankings that much. People usually like the books that they buy. The overall customer recommendations are just selections from the book level recommendations.

I am pretty happy with Amazons recommendations, but they don't do anything fancy.

The benefit to the recommender is the naive hope that a more accurate rating will bring them better movie choices.

" am unimpressed with both Amazon's and Netflix' recommendations. Taking Amazon, the first thing they recommend is all the other titles by the same author. Well, no duh, I had already thought of that. And next you get stuff in exactly the same genre, even if the writing isn't anywhere near as good and the mood is completely different. And then you get books bought be the same people who bought your book. This is a bit better, but they might like my book for a completely different reason than I do."

Hmm.
I have found Amazon's EXPLICIT recommendations (ie the stuff you get when you go to their recommendations page for you) useless. I have also found their recommendations for CDs and movies useless.
I HAVE found the "incidental" recommendations; the stuff that appears as small items peppered all over the page I am currently looking at, and on the page that appears after I order something or add it to my wishlist, to be remarkably useful.
I wonder about the extent to which this reflects the fact that most of what I buy on Amazon is non-fiction, and somewhat minority taste non-fiction (ie I'm not buying some celebrity's autobiography or presidential candidates' plans for how to change the world). Perhaps once items are popular enough, they appeal to a large enough segment of society that their presence is really telling you nothing about what is unique to the tastes of this particular person. Which raises the interesting question of whether you could generate a recommendation service that more people find delightful through the rather counterintuitive method of stripping out from your collection anything that becomes too popular. Hah --- Netflix --- you owe me a million dollars.

It is pretty tough to create a good recommender system from a set of static data. But imagine you could change your recommendation system every hour, try it out on thousands of customers, and see what worked and what didn't.

Trying to solve the recommendation problem based on a dataset is thinking like an economist. A learning system with trial and error feedback is thinking like a computer programmer. ;-)

What would you suggest they do instead, ogmb?

Kevin, such systems are for producing mediocre results when you don't really know what you're doing.

"The descriptions of the customers' recommendations would seem to make that particular set of data pretty much useless. "

Posted by: Michael Cain

No, just hard to work with using classical statistical methods. I wouldn't be surprised if looking at the bulk of customers (i.e., those with no more than a few dozen ratings) would purge a lot of the oddities.

A reference which might be of interest:
http://webuser.bus.umich.edu/feinf/research/Published/Ying,_Feinberg,_Wedel_(2006)_-_Leveraging_Missing_Ratings_to_Improve_Online_Recommendation_Systems.pdf

As Bhauth said, it's useless to try to predict the preferences of outliers-- frivolous people who rate everything as bad, or rate randomly. So any algorithm ought to start by cleaning the data-- throwing out data the algorithm should ignore. (Or, make a second algorithm for that, with, for example, random ratings).

I lik the idea of the this competition. I'm tempted myself to try to raise that R2.

The biggest problem that I have had with Amazon recommendations is that I already own them (through another vender). The recommendations are spot on, but I have to click the "I own it" box to get new ones.

Post a comment

If you have a TypeKey or TypePad account, please Sign In