In 2010, we had Paul the Octopus. This year, there’s Google Cloud Platform. For the past couple weeks, we’ve been using Cloud Platform to make predictions for the World Cup—analyzing data, building a statistical model and using machine learning to predict outcomes of each match since the group round. So far, we’ve gotten 13 out of 14 games correct. But with the finals ahead this weekend, we’re not only ready to make our prediction, but we’re doing something a little extra for you data geeks out there. We’re giving you the keys to our prediction model so you can make your own model and run your own predictions.
A little background
Using data from Opta covering multiple seasons of professional soccer leagues as well as the group stage of the World Cup, we were able to examine how activity in previous games predicted performance in subsequent ones. We combined this modeling with a power ranking of relative team strength developed by one of our engineers, as well as a metric to stand in for hometeam advantage based on fan enthusiasm and the number of fans who had traveled to Brazil. We used a whole bunch of Google Cloud Platform products to build this model, including Google Cloud Dataflow to import all the data and Google BigQuery to analyze it. So far, we’ve only been wrong on one match (we underestimated Germany when they faced France in the quarterfinals).
A narrow win for Germany in the final
Drumroll please… Though we think it’s going to be close, Germany has the edge: our model gives them a 55 percent chance of defeating Argentina. Both teams have had excellent tournaments so far, but the model favors Germany for a number of factors. Thus far in the tournament, they’ve had better passing in the attacking half of their field, a higher number of shots (64 vs. 61) and a higher number of goals scored (17 vs. 8).
(Oh, and we think Brazil has a tiny advantage in the third place game. They may have had a disappointing defeat on Tuesday, but their numbers still look good.)
Channel your inner data nerd
Now it’s your turn. We’ve put together a step-by-step guide (warning: code ahead) showing how we built our model and used it for predictions. You could try different statistical techniques or adding in your own data, like player salaries or team travel distance. Even though we’ve been right 92.86 percent of the time, we’re sure there’s room for improvement.
The model works for other hypothetical situations, and it includes data going back to the 2006 World Cup, three years of English Barclays Premier League, two seasons of Spanish La Liga, and two seasons of U.S. MLS. So, you could try modeling how the USA would have done against Argentina if their game against Belgium had gone differently, or pit this year’s German team against the unstoppable Spanish team of 2010. The world (er, dataset) is your oyster.
Ready to kick things off? Read our post on the Cloud Platform blog to learn more (or, if you’re familiar with all the technology, you can jump right over to GitHub and start crunching numbers for yourself).