This tutorial explains how to integrate with DataRobot programmatically. Check out this basic soccer match app that was created using this tutorial. You can also explore the code that runs the app.
Prerequisites
- An account with DataRobot
- Basic knowledge of the Ruby programming language
- Python and the DataRobot Python Client
- Soccer match data
- Specifically, the csv with all matches
- And if you want current SPI for predictions, get the spi rankings csv
- A CSV viewer (Google sheets, Excel, LibreOffice, etc.)
Goal
I'm a soccer fan. I love cheering for my favorite team, Werder Bremen, bright and early on Saturday mornings. Since we have a big list of soccer matches, wouldn't it be cool to show them all to an AI, and then ask the AI what the score would be between any two teams? Yes, it would be. So that's what we're going to do.
Our goal?
Predict the score of a game between any two soccer teams.
To achieve our goal, we need to do the following:
- Identify the features that we want to keep and discard
- Discard unwanted features
- Prepare data for training
- Train our AI
- Make a prediction
Dealing with the data
What's in the data?
First, make sure you’ve downloaded the latest data from: https://projects.fivethirtyeight.com/soccer-api/club/spi_matches.csv
Let's go through the fields one by one:
Field ("Feature") | Description |
---|---|
| On what date did the game fall? I don't think the accuracy of this number is super important, so I'm not going to worry about questions like, "What time zone did we measure this date in?" The general trend of dates may be meaningful, but being one day off will probably not affect the modeling. |
| A code that represents the league that the game was played in. Some teams can play in multiple leagues, but we have the name of the league as well, so this is duplicative. |
| The human-readable name of the league. |
| One of the teams that played. I don't think there's any meaning between 1 and 2. Perhaps it might indicate a home field advantage, but since I don’t know, I'm going to ignore that. If we did find that it indicated home field advantage, we would want to add this as a field later when we’re preparing the data for learning. |
| The other team that played. |
| The "soccer power index" of |
| The "soccer power index" of |
| Probability that |
| Probability that |
| Probability that the match will end in a tie. |
| Projected score of |
| Projected score of |
| This is a measure of the importance of the match for |
| The importance of the match for |
| The final score (the thing we will be trying to predict). |
| The final score of |
| Expected goals of |
| Expected goals of |
| Non-shot expected goals. This is a measure of the expected probability of a goal given the state of the field. For example, if the ball is intercepted directly in front of the goal, but a shot is not taken, then a small value will be added here. This is data that we can't know before a match, so I'll be throwing it out. |
| Non-shot expected goals for |
| This is the score given to |
| Adjusted score for |
Pruning the data
As you saw in the previous section, we want to pull out data we don’t need. Let’s look at why.
We’ll get rid of xg1
, xg2
, nsxg1
, nsxg2
, adj_score1
, and adj_score2
because these fields are data that we can only know after a match has finished. And since we're trying to predict a match before it happens, it doesn't make sense for us to teach our AI about them. (Also, in data science terms including data like this will lead to "target leakage." You can read about it here, if interested.)
We’ll also get rid of prob1
, prob2
, probtie
, proj_score1
, and proj_score2
because we can derive all these values in our predictions.
We will also be getting rid of importance1
and importance2
because I don't know how I would calculate them for an arbitrary match between two teams.
Finally, we will be removing the league_id
field because it duplicates the league field. This isn't strictly necessary, but will save us some typing.
So, the data we have left is:
date
league
team1
team2
spi1
spi2
score1
score2
For now, save this data as a file named clean_matches.csv. I did this by opening up the soccer match CSV in Excel and deleting all the columns except the ones in the list above. I also removed a couple thousand matches at the end of the file which have not happened yet.
We want to predict is score1
and score2
. However, AIs can only predict one thing. They're smart, but very focused. So, we’re going to have to massage the data a bit so that we can ask one question at a time.
Data transformation
The first thing we need to do is get the data into a structure that allows us to predict one thing at a time. Well, predicting the score of a game really only requires us to predict the score of each team and then put them together, so we can treat each game as two data points. We can ask the AI to predict the score for team1
when it is playing against team2
, and also predict the score for team2
when it is playing against team1
.
In order to accomplish this, we only need to have one score field, but for clarity, let's leave it named score1
. We will need to duplicate every row, and reverse the 1
and 2
values on the duplicated rows.
I whipped up a quick script to do that. I happen to like Ruby, so this code is in Ruby, but it could just as easily be written in any other language.
require 'csv'
matches = CSV.read('clean_matches.csv')
header = matches[0]
matches.shift # remove header row
header.pop # remove score2 column
CSV.open('duplicated_matches.csv', 'w') do |csv|
csv << header
matches.each do |row|
csv << [row[0], row[1], row[2], row[3], row[4], row[5], row[6]]
csv << [row[0], row[1], row[3], row[2], row[5], row[4], row[7]]
end
end
Now we have a file called duplicated_matches.csv that has two rows for each game and predicts only one score.
Learning on the data
First, let's make sure that we have a valid client with our key. You can obtain your API key as explained here.
Now we can train our AI. Make sure to change the directory of the file if it is not in the current directory:
import os
import datarobot as dr
dr.Client(token=os.getenv("DATAROBOT_API_KEY"), endpoint='https://app.datarobot.com/api/v2')
project = dr.Project.start(project_name='Match Predictor',
sourcedata='./duplicated_matches.csv',
target='score1')
If you navigate to the DataRobot application, you will now be able to see your project in the project dropdown:


You can now watch as DataRobot trains lots of models against this data in order to pick the best one. Now would be a good time to grab a coffee.
In the code above, we're telling the AI to learn how to predict the column score1
using all the fields available to it. This is why it was important for us to get rid of fields that could not be known before the game had started. The AI will learn that there is a very high correlation, for example, between adj_score1
and score1
—but this isn't a useful property for the AI to learn. DataRobot has some clever tools for catching errors like this, but it’s better to just make the data as clean as possible.
Getting answers from our AI
Now for the fun part. We finally get to ask our AI which teams will win.
Let's make up a game between the two sides of my family. My family is split between Hamburg and Bremen. Unfortunately, the Hamburg team was relegated to the second league last season; given this move, we would expect that Bremen will probably do better than Hamburg. However, Bremen isn’t having a very good season, and may be relegated as well.
model = project.get_models()[0]
prediction_server = dr.PredictionServer.list()[0]
deployment = dr.Deployment.create_from_learning_model(
model.id, label='Match Predictor Deployment', description='For making predictions of match outcomes',
default_prediction_server_id=prediction_server.id)
Because we are always making the AIs smarter, your scores may vary a bit from the ones I got, but they should follow the same general trend/pattern.
You may be wondering where I got my spi numbers from. I got these values from another dataset called spi_global_ranking.csv. If you wanted a better prediction, you might want to find the spi for the team once the lineup is known, as this affects the spi.
import os
import requests
import json
from pprint import pprint
matches = [
{'date': '2019-08-26', 'league': 'German Bundesliga', 'team1': 'Hamburg SV', 'team2': 'Werder Bremen', 'sp1': '45.57', 'sp2':'61.41'},
{'date': '2019-08-26', 'league': 'German Bundesliga', 'team1': 'Werder Bremen', 'team2': 'Hamburg SV', 'sp1':'61.41', 'sp2':'45.7'}
]
token = os.getenv("DATAROBOT_API_KEY")
prediction_headers = {
"Authorization": "Bearer {}".format(token),
"Content-Type": "application/json",
"datarobot-key": os.getenv("DATAROBOT_SERVER_KEY"),
}
deployment_id="your-deployment-id"
server_url="your-server-url"
predictions = requests.post(
"{server_url}/predApi/v1.0/deployments/{deployment_id}/predictions".format(server_url=server_url, deployment_id=deployment_id),
headers=prediction_headers,
data=json.dumps(matches),
)
pprint(predictions.json())
Because I want to know how the other team scored as well, we'll reverse the teams and see how many goals Hamburg is expected to score.
The scores that I got were 1.5 Bremen - 1.1 Hamburg. This result seems reasonable. If someone told me that Bremen beat Hamburg 2-1, I would say that sounds probable. I suspect there's an issue here with our AI being overly optimistic about goal scoring on the losing team. Because there are no values below 0 in a soccer match, the AI will tend to go with something higher. In this case, I might also expect a 3-0 game (if we just use the goal differential) or a 4-0 game (if the AI is overly optimistic about the losing team). The best way to fix this is to find more data to feed to the AI. For example, if we had a metric for defensive and offensive power of the teams, the AI would be more likely to give lower scores to the losing team if it had poor offensive power. Unfortunately, our match dataset does not have that data; however, spi_global_rankings.csv includes that data and so could be added if you wanted to improve your results.
Summary
Now that we’ve massaged the soccer data into a state that allows us to predict scores, made an AI, and trained it on our data, we have a fully functioning match predictor. This predictor could certainly be better though, and the best way to make it better is to add data to it. If you find more data, or better ways of modifying the data, we’d love to hear about it! Head over to the community and share your progress.
{'data': [{'prediction': 1.0655537286,
'predictionValues': [{'label': 'score1', 'value': 1.0655537286}],
'rowId': 0},
{'prediction': 1.5037374286,
'predictionValues': [{'label': 'score1', 'value': 1.5037374286}],
'rowId': 1}]}
Because I want to know how the other team scored as well, we'll reverse the teams, and see how many goals Hamburg is expected to score.
The scores that I got were 1.5 Bremen - 1.1 Hamburg. This result seems reasonable. If someone told me that Bremen beat Hamburg 2-1, I would say that sounds probable. I suspect there's an issue here with our AI being overly optimistic about goal scoring on the losing team. Because there are no values below 0 in a soccer match, the AI will tend to go with something higher. In this case, I might also expect a 3-0 game (if we just use the goal differential) or a 4-0 game (if the AI is overly optimistic about the losing team). The best way to fix this is to find more data to feed to the AI. For example, if we had a metric for defensive and offensive power of the teams, the AI would be more likely to give lower scores to the losing team if it had poor offensive power. Unfortunately, our match dataset does not have that data. But it is in the spi_global_rankings.csv data, so it could be added if you wanted to improve your results.
Next steps
Now that we’ve massaged the soccer data into a state that allows us to predict scores, made an AI, and trained it on our data, we have a fully functioning match predictor. This predictor could certainly be better, though, and the best way to make it better is to add data to it. If you find more data, or better ways of modifying the data, we’d love to hear about it! Head over to the community and share your progress.
Updated 6 months ago