(Cross-posted on the Google Cloud Platform Blog)
We’ve had a great time giving you our predictions for the World Cup (check out our post before the quarter-finals and semi-finals). So far, we’ve gotten 13 of 14 games correct. But this isn’t about us picking winners in World Cup soccer – it’s about what you can do with Google Cloud Platform. Now, we are open-sourcing our prediction model and packaging it up so you can do your own analysis and predictions.
We used Google Cloud Dataflow to ingest raw, touch-by-touch gameplay day from Opta for thousands of soccer matches. This data goes back to the 2006 World Cup, three years of English Barclays Premier League, two seasons of Spanish La Liga, and two seasons of U.S. MLS. We then polished the raw data into predictive statistics using Google BigQuery.
Our prediction for the final
It’s a narrow call, but Germany has the edge: our model gives them a 55% chance of defeating Argentina due to a number of factors. Thus far in the tournament, they’ve had better passing in the attacking half of their field, a higher number of shots (64 vs. 61) and a higher number of goals scored (17 vs. 8).
But, 55% is only a small edge. And, although we’ve been trumpeting our 13 of 14 record, picking winners isn’t exactly the same as predicting outcomes. If you’d asked us which scenario was more likely, a 7 to 1 win for Germany against Brazil or a 0 to 1 defeat of Germany by Brazil, we wouldn’t have gotten that one quite right.
(Oh, and we think Brazil has a tiny advantage in the third place game. They may have had a disappointing defeat on Tuesday, but the numbers still look good.)
But don’t take our word for it…
Now it’s your turn to take a stab at predicting. We have provided an IPython notebook that shows exactly how we built our model and used it to predict matches. We had to aggregate the data that we used, so you can’t compute additional statistics from the raw data. However, for the real data geeks, you could try to see how well neural networks can predict the same data or try advanced techniques like principal components analysis. Alternatively, you can try adding your own features like player salaries or team travel distance. We’ve only scratched the surface, and there are lots of other approaches you can take.
You might also try simulating how the USA would have done if they had beat Belgium. Or how Germany in 2014 would fare against the unstoppable Spanish team of 2010. Or you could figure out whether the USA team is getting better by simulating the 2006 team against the 2010 and 2014 teams.
Here’s how you can do it
We’ve put everything on GitHub. You’ll find the IPython notebook containing all of the code (using pandas and statsmodels) to build the same machine learning models that we’ve used to predict the games so far. We’ve packaged it all up in a Docker container so that you can run your own Google Compute Engine instance to crunch the data. For the most up-to-date step-by-step instructions, check out the readme on GitHub.