Computer rankings via Google Prediction API

For those that follow me on Twitter, you may have noticed this Tweet where I commented that I had an idea for a computer rankings system that uses the Google Prediction API.

Well, it turns out that that it wasn't all that difficult to implement. Using data and results from the last 8 college football season as training data, I used Google Prediction to rank last year's teams. Here is resulting top 10:

  1) Alabama                          0.9914
  2) Louisiana State                  0.9785
  3) Stanford                         0.9744
  4) Oregon                           0.9701
  5) Oklahoma State                   0.9701
  6) Wisconsin                        0.9700
  7) Boise State                      0.9528
  8) Oklahoma                         0.9274
  9) Houston                          0.9270
 10) Arkansas                         0.9142

It had Virginia Tech at 18th. This is a link to an Excel spreadsheet of the full 2011 rankings (with the teams' AP & Coaches poll rankings noted). 2011-rankings.xls

How does this work? The first thing I did was feed the results of the last 8 seasons into Google Prediction. For each game & team, I gave it the final score differential as well as a number "features". These features include things like points scored per game vs the opponent's points allowed per game, offensive yards per game vs the opponent's defensive yards allowed per game, the turnover margins of each team, and the winning percentage of each team. (This is a simplified explanation.)

After feeding Google all of that data and it training my model, it was then possible to use that model to predict the final score differential of a match-up between any two teams. To make a prediction, I send it the "features" of any two teams and it spits back a score differential.

In order to come up with rankings, I had Google predict the outcome of over 14 thousand games - as if every team played every other team twice (once at home, once on the road) - and the results column is the winning percentage based on that prediction.

The results are not perfect, but they're definitely not terrible, either. I was actually pretty surprised to see just how decent the results actually were. I think I've spent 10 hours working on this, and I've managed to put together a computer ranking system that's just as good as any out there. Thanks Google!

What's the future of this thing? I have a ton of code clean-up to do (this is a total mash up for shell scripts and perl scripts). I'd love to add more to the data model (annual football revenue, for one). I think 4 weeks into the 2012 season I'll start generating rankings, as well use it to predict outcomes of games.

DISCLAIMER: Forum topics may not have been written or edited by The Key Play staff.



Nice work.

Big fan of simulations, and I could totally geek out on this. It occurred to me that perhaps you simulated voter bias...not sure if that was your intent or not.

I bet there is a goldilocks zone where the accuracy of predicting pre-season polls is entered in 8 seasons, I would be curious as to accuracy as a function of seasons entered. That goldilocks number would correspond to the collective memory of voters, and then maybe we would see some outliers like Houston disappear.

Anyway, cool stuff!

Voter bias

I'm not sure how I simulated voter bias (nor was that my intent). I was simply wondering if I could use Google Prediction to rank college teams. Also, the "features" I'm using are mostly performance traits (points scored, points allowed, yards per game, penalty yards, turnover margin, etc), so it should be a pretty unbiased model.

Also, the rankings I attached are purely based on the 2011 season (I did not intend them to have anything to do with 2012), and the AP & Coaches poll rankings in the spreadsheet are from the final polls after last season.

I didn't see where you said

"last year's teams"...I thought you were trying to predict pre-season polls at first.

At any rate, what I meant is that "Past performance does not guarantee future results" is the footnote at the bottom of a mutual fund prospectus, but it should be at the bottom of the pre-season polls too. If you were trying to predict pre-season polls, the most relevant performance data was from last season, with each subsequent season counting less and less in step with player turnover.

Once you got completely out of an incoming senior class, say around 4 seasons ago, all that performance data is relevant in an institutional sense i.e. coaching staff experience, athletic department revenue, etc. 8 years does seem a reasonable time horizon to factor in some of these things, and looking at last year's final results, I do agree with your approach.

Go out even farther (probably way beyond 8 years) and you are starting to allow performance factors that were created by a vastly different set of rules to disproportionately impact your current poll. So my point was that the human brain does this at a subconscious level in some way, allowing teams that have "tradition" to creap into a pre-season poll by including all those other years. I bet Google could be used with a sufficiently large data set and proper weightings to predict pre-season polls.

Sorry if it was unclear what these rankings actually were. :-)

Interesting ideas about predicting pre-season polls. That's not necessarily what I'm interested in going after, but I don't see why Google Prediction (or even any other machine-learning/decisioning suite) couldn't be used to attempt to predict human polls, as well.

What I'm really interested to see is how accurate it is in predicting games this season after about 4 or 5 weeks worth of games have been played.

Thanks for the comments!

no prob it was my bad

Yeah, can't wait to see how that goes after a few weeks!

Quick follow-up for those interested

I finally revamped all of the work I did and just ran a test to see if this thing is worth pursuing.

Using the data and results from seasons 2004 thru 2010 I simulated the 2011 season and compared it to the actual results. It correctly predicted the winners in 521 out of 680 regular season games (that's 76.6% accuracy), and it correctly predicted the winners in 29 of the 35 bowl games (83% accuracy).

I'm looking forward to seeing how this thing does this season!

Just wondering

Why can't you use all of those previous previous year's info to predict week one?

Are you waiting to week 5ish to gather enough relevant info?

Sounds really interesting, just wondering what to expect

Yup, I can ...

Yeah, I could use the stats from last year to predict the week 1 games, and I think I probably will just to check it out.

I'll probably simulate all of this year's games from the get-go, and start blending in this year's stats in such a way that by week 5 I'll totally be relying on this year's stats.

PS: Just realized it correctly predicted last year's Sugar Bowl

Shit, I just realized that it correctly predicted the Sugar Bowl.

It predicted/simulated a 2-point Michigan win. :-(