DataRobot

DataRobot for Developers

Embed cutting edge AI into the applications and services you build.

Get Started    API Reference

Soccer Match prediction

This tutorial explains how to integrate with DataRobot programmatically. Check out this basic soccer match app that was created using this tutorial. You can also explore the code that runs the app.

Prerequisites

Goal

I'm a soccer fan. I love cheering for my favorite team, Werder Bremen, bright and early on Saturday mornings. Since we have a big list of soccer matches, wouldn't it be cool to show them all to an AI, and then ask the AI what the score would be between any two teams? Yes, it would be. So that's what we're going to do.

Our goal?
Predict the score of a game between any two soccer teams.

To achieve our goal, we need to do the following:

  • Identify the features that we want to keep and discard
  • Discard unwanted features
  • Prepare data for training
  • Train our AI
  • Make a prediction

Dealing with the data

What's in the data?

First, make sure you’ve downloaded the latest data from: https://projects.fivethirtyeight.com/soccer-api/club/spi_matches.csv

Let's go through the fields one by one:

Field ("Feature")

Description

date

On what date did the game fall? I don't think the accuracy of this number is super important, so I'm not going to worry about questions like, "What time zone did we measure this date in?" The general trend of dates may be meaningful, but being one day off will probably not affect the modeling.

league_id

A code that represents the league that the game was played in. Some teams can play in multiple leagues, but we have the name of the league as well, so this is duplicative.

league

The human-readable name of the league.

team1

One of the teams that played. I don't think there's any meaning between 1 and 2. Perhaps it might indicate a home field advantage, but since I don’t know, I'm going to ignore that. If we did find that it indicated home field advantage, we would want to add this as a field later when we’re preparing the data for learning.

team2

The other team that played.

sp1

The "soccer power index" of team1. This is basically a measure of how good a team is. Note that this is the value of the index at the time of the match, and changes over time.

sp2

The "soccer power index" of team2.

prob1

Probability that team1 will win. We will be throwing this out because we can make our own predictions, thank you very much.

prob2

Probability that team2 will win. We will also be throwing this value out.

probtie

Probability that the match will end in a tie.

proj_score1

Projected score of team1. Like prob1, will be throwing this out because we can make our own predictions.

proj_score2

Projected score of team2. We will also be throwing this out.

importance1

This is a measure of the importance of the match for team1. There are certain matches that can be of greater or lesser importance for teams, depending on their standings. For example, if you are really close to being in 3rd place, but are currently in 4th, this may be a very important game for you. I’ll be throwing this out because I don’t have a good way of deriving this for an arbitrary game. You may choose to leave it in if you can come up with a way to estimate it, though.

importance2

The importance of the match for team2. I’ll be removing this as well.

score1

The final score (the thing we will be trying to predict).

score2

The final score of team2 (the other thing we’re trying to predict).

xg1

Expected goals of team1. This is a measure of the expected number of goals based on the total number of shots, location of the shots, etc. for team1. If, for example, someone was in front of an open goal, and they sliced the ball, this would likely be counted as an expected goal, but of course would not be an actual goal. We are going to throw this data out because we cannot know this before the match. This is called target leakage, and is a very common issue with datasets.

xg2

Expected goals of team2.

nsxg1

Non-shot expected goals. This is a measure of the expected probability of a goal given the state of the field. For example, if the ball is intercepted directly in front of the goal, but a shot is not taken, then a small value will be added here. This is data that we can't know before a match, so I'll be throwing it out.

nsxg2

Non-shot expected goals for team2.

adj_score1

This is the score given to team1 after the game is finished plus style points. We will be throwing this out because we can't measure it before the match has ended.

adj_score2

Adjusted score for team2. So as with adj_score1, I'll be throwing this out.

Pruning the data

As you saw in the previous section, we want to pull out data we don’t need. Let’s look at why.

We’ll get rid of xg1, xg2, nsxg1, nsxg2, adj_score1, and adj_score2 because these fields are data that we can only know after a match has finished. And since we're trying to predict a match before it happens, it doesn't make sense for us to teach our AI about them. (Also, in data science terms including data like this will lead to "target leakage." You can read about it here, if interested.)

We’ll also get rid of prob1, prob2, probtie, proj_score1, and proj_score2 because we can derive all these values in our predictions.

We will also be getting rid of importance1 and importance2 because I don't know how I would calculate them for an arbitrary match between two teams.

Finally, we will be removing the league_id field because it duplicates the league field. This isn't strictly necessary, but will save us some typing.

So, the data we have left is:

  • date
  • league
  • team1
  • team2
  • spi1
  • spi2
  • score1
  • score2

For now, save this data as a file named clean_matches.csv. I did this by opening up the soccer match CSV in Excel and deleting all the columns except the ones in the list above. I also removed a couple thousand matches at the end of the file which have not happened yet.

We want to predict is score1 and score2. However, AIs can only predict one thing. They're smart, but very focused. So, we’re going to have to massage the data a bit so that we can ask one question at a time.

Data transformation

The first thing we need to do is get the data into a structure that allows us to predict one thing at a time. Well, predicting the score of a game really only requires us to predict the score of each team and then put them together, so we can treat each game as two data points. We can ask the AI to predict the score for team1 when it is playing against team2, and also predict the score for team2 when it is playing against team1.

In order to accomplish this, we only need to have one score field, but for clarity, let's leave it named score1. We will need to duplicate every row, and reverse the 1 and 2 values on the duplicated rows.

I whipped up a quick script to do that. I happen to like Ruby, so this code is in Ruby, but it could just as easily be written in any other language.

require 'csv'

matches = CSV.read('clean_matches.csv')
header = matches[0]

matches.shift # remove header row
header.pop # remove score2 column

CSV.open('duplicated_matches.csv', 'w') do |csv|
  csv << header
  matches.each do |row|
    csv << [row[0], row[1], row[2], row[3], row[4], row[5], row[6]]
    csv << [row[0], row[1], row[3], row[2], row[5], row[4], row[7]]
  end
end

Now we have a file called duplicated_matches.csv that has two rows for each game and predicts only one score.

Learning on the data

First, let's make sure that we have a valid client with our key. You can obtain your API key as explained here.

Now we can train our AI. Make sure to change the directory of the file if it is not in the current directory:

import os
import datarobot as dr

dr.Client(token=os.getenv("DATAROBOT_API_KEY"), endpoint='https://app.datarobot.com/api/v2')

project = dr.Project.start(project_name='Match Predictor',
                           sourcedata='./duplicated_matches.csv',
                           target='score1')

If you navigate to the DataRobot application, you will now be able to see your project in the project dropdown:

You can now watch as DataRobot trains lots of models against this data in order to pick the best one. Now would be a good time to grab a coffee.

In the code above, we're telling the AI to learn how to predict the column score1 using all the fields available to it. This is why it was important for us to get rid of fields that could not be known before the game had started. The AI will learn that there is a very high correlation, for example, between adj_score1 and score1 —but this isn't a useful property for the AI to learn. DataRobot has some clever tools for catching errors like this, but it’s better to just make the data as clean as possible.

Getting answers from our AI

Now for the fun part. We finally get to ask our AI which teams will win.

Let's make up a game between the two sides of my family. My family is split between Hamburg and Bremen. Unfortunately, the Hamburg team was relegated to the second league last season; given this move, we would expect that Bremen will probably do better than Hamburg. However, Bremen isn’t having a very good season, and may be relegated as well.

model = project.get_models()[0]
prediction_server = dr.PredictionServer.list()[0]

deployment = dr.Deployment.create_from_learning_model(
    model.id, label='Match Predictor Deployment', description='For making predictions of match outcomes',
    default_prediction_server_id=prediction_server.id)

Because we are always making the AIs smarter, your scores may vary a bit from the ones I got, but they should follow the same general trend/pattern.

You may be wondering where I got my spi numbers from. I got these values from another dataset called spi_global_ranking.csv. If you wanted a better prediction, you might want to find the spi for the team once the lineup is known, as this affects the spi.

import os
import requests
import json
from pprint import pprint

matches = [
    {'date': '2019-08-26', 'league': 'German Bundesliga', 'team1': 'Hamburg SV', 'team2': 'Werder Bremen', 'sp1': '45.57', 'sp2':'61.41'},
    {'date': '2019-08-26', 'league': 'German Bundesliga', 'team1': 'Werder Bremen', 'team2': 'Hamburg SV', 'sp1':'61.41', 'sp2':'45.7'}
]

token = os.getenv("DATAROBOT_API_KEY")

prediction_headers = {
    "Authorization": "Bearer {}".format(token),
    "Content-Type": "application/json",
    "datarobot-key": os.getenv("DATAROBOT_SERVER_KEY"),
}

deployment_id="your-deployment-id"
server_url="your-server-url"

predictions = requests.post(
    "{server_url}/predApi/v1.0/deployments/{deployment_id}/predictions".format(server_url=server_url, deployment_id=deployment_id),
    headers=prediction_headers,
    data=json.dumps(matches),
)
pprint(predictions.json())

Because I want to know how the other team scored as well, we'll reverse the teams and see how many goals Hamburg is expected to score.

The scores that I got were 1.5 Bremen - 1.1 Hamburg. This result seems reasonable. If someone told me that Bremen beat Hamburg 2-1, I would say that sounds probable. I suspect there's an issue here with our AI being overly optimistic about goal scoring on the losing team. Because there are no values below 0 in a soccer match, the AI will tend to go with something higher. In this case, I might also expect a 3-0 game (if we just use the goal differential) or a 4-0 game (if the AI is overly optimistic about the losing team). The best way to fix this is to find more data to feed to the AI. For example, if we had a metric for defensive and offensive power of the teams, the AI would be more likely to give lower scores to the losing team if it had poor offensive power. Unfortunately, our match dataset does not have that data; however, spi_global_rankings.csv includes that data and so could be added if you wanted to improve your results.

Summary

Now that we’ve massaged the soccer data into a state that allows us to predict scores, made an AI, and trained it on our data, we have a fully functioning match predictor. This predictor could certainly be better though, and the best way to make it better is to add data to it. If you find more data, or better ways of modifying the data, we’d love to hear about it! Head over to the community and share your progress.

{'data': [{'prediction': 1.0655537286,
           'predictionValues': [{'label': 'score1', 'value': 1.0655537286}],
           'rowId': 0},
          {'prediction': 1.5037374286,
           'predictionValues': [{'label': 'score1', 'value': 1.5037374286}],
           'rowId': 1}]}

Because I want to know how the other team scored as well, we'll reverse the teams, and see how many goals Hamburg is expected to score.

The scores that I got were 1.5 Bremen - 1.1 Hamburg. This result seems reasonable. If someone told me that Bremen beat Hamburg 2-1, I would say that sounds probable. I suspect there's an issue here with our AI being overly optimistic about goal scoring on the losing team. Because there are no values below 0 in a soccer match, the AI will tend to go with something higher. In this case, I might also expect a 3-0 game (if we just use the goal differential) or a 4-0 game (if the AI is overly optimistic about the losing team). The best way to fix this is to find more data to feed to the AI. For example, if we had a metric for defensive and offensive power of the teams, the AI would be more likely to give lower scores to the losing team if it had poor offensive power. Unfortunately, our match dataset does not have that data. But it is in the spi_global_rankings.csv data, so it could be added if you wanted to improve your results.

Next steps

Now that we’ve massaged the soccer data into a state that allows us to predict scores, made an AI, and trained it on our data, we have a fully functioning match predictor. This predictor could certainly be better, though, and the best way to make it better is to add data to it. If you find more data, or better ways of modifying the data, we’d love to hear about it! Head over to the community and share your progress.

Updated 16 days ago


Soccer Match prediction


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.