DataRobot

DataRobot for Developers

Embed cutting edge AI into the applications and services you build.

Get Started    API Reference

Finance Finder - Batch prediction

This is the first article in a multipart series about building a realistic machine learning (ML) powered application.

This guide will take the format of explaining a completed project, and looking at scripts and code snippets. It is not a step-by-step tutorial.

App overview

The app, Harv the Finance Finder, lets you search for stocks, for example if you are building your portfolio. It lets you specify which stock exchanges to include, and which sectors you wish to exclude. It also lets you specify your threshold for stocks' ESG rating, which is where ML comes in.

You can explore the app in the demo below 👇 To use it, register as a user, create a portfolio and the list of stocks based on your criteria.

App demo

You can also open the deployed app directly at harv-the-finance-finder.herokuapp.com.

ESG

ESG is a rating of Environmental, Societal, and Corporate governance scores. A lower ESG score means a particular company (and its stock) is less exposed to risk in that area. For example, a company that deals with oil extraction might have a very high environmental impact rating, which will in turn increase their overall ESG score.

ESG and Machine Learning
Calculating ESG scores is an extensive process that involves in-depth analysis of a company's publicly available information, as well as data from news sources. As not every company will have their ESG score calculated, we decided to use DataRobot's ML technology to score a large number of companies across several different stock exchanges.

Project setup

The app is open source and available on GitHub: https://github.com/datarobot-community/harv-the-finance-finder.

Should you wish to run it yourself, the instructions for preparing your environment and running the app are in the project's README.

Technologies and data sources

The project is made in Python and the Flask web framework. It uses PostgreSQL for persistence. The demo app is deployed on Heroku.

Data sources
Our application uses data from the IEX API. Our stock dataset was created by merging their stock data from various industry sectors into a single dataset. We took the data on 25 May 2020.

We created our own ESG training dataset. ESG scores are provided by various agencies, but not in a publicly accessible API. For this showcase application we decided to generate fake data to train our ML model, based on sustainability ratings available in Yahoo Finance. The script used in the project is available on GitHub.

Predicting in advance

When using ML in applications we are faced with the question of whether to make on-demand predictions, or make them in advance, in batches.

Often, we want to make predictions right there and then, as the data comes in, but in Finance Finder this wouldn't work.

As this app needs to predict a large number of ESG ratings for all the stocks in our portfolio, for example when the markets close in the afternoon, we are doing this in advance. DataRobot makes both techniques possible and straightforward.

Project structure

The main 3 directories in the project are scripts, app, and data:

  • scripts contain scripts for training a model, and making predictions using DataRobot AutoML, as well as for generating the ESG training dataset based on all stocks.
  • app is our main Flask application with routes, models, and business logic.
  • data contains our datasets - all stock quotes in all sectors and exchanges, ESG training dataset, and the already-predicted ESG dataset for all the stocks we have.

Training the model

Training dataset
We'll start by creating and training our model first. For that, let's have a look at the data we're working with. Our synthesized training dataset is located in data/stock_quotes_esg_train.csv and looks like this:

symbol,companyName,open,close,high,low,latestPrice,latestVolume,previousClose,change,changePercent,avgTotalVolume,marketCap,week52High,week52Low,ytdChange,primaryExchange,previousVolume,volume,peRatio,sector,esg_category
FTEK,"Fuel Tech, Inc.",0.8444,0.82,0.855,0.8001,0.82,117187,0.8597,-0.0397,-0.04618,1706211,20201848,1.8,0.3,-0.19198,NASDAQ,146489,0,-2.18,Industrial Services,4
BSQR,BSQUARE Corp.,1.6,1.57,1.65,1.54,1.57,71273,1.53,0.04,0.02614,39228,20564959,1.73,0.83,0.17214,NASDAQ,74043,0,-2.98,Technology Services,1
IDXG,"Interpace Biosciences, Inc.",5.25,5.18,5.3,4.95,5.18,47291,5.15,0.03,0.00583,111855,20909588,11,3.81,-0.00747,NASDAQ,35027,0,-1.02,Health Services,2
WRLS,Pensare Acquisition Corp.,,,3.95,2.21,2.68,0,2.68,0,0,8817,21260440,11.25,2.1,-0.6958,NASDAQ,131174,0,-5.44,Finance,2
PRPH,"ProPhase Labs, Inc.",,,,,1.84,251,1.94,-0.1,-0.05155,1869,21296160,3.36,1.63,-0.126927,NASDAQ,72,0,-6.78,Consumer Non-Durables,3
JCTCF,Jewett-Cameron Trading Co. Ltd.,6,6.2,6.374,6.05,6.2,1427,6.09,0.11,0.01806,2300,21583192,9.1,5,-0.20504,NASDAQ,267,0,17.44,Distribution Services,3
NEON,"Neonode, Inc.",2.48,2.56,2.569,2.43,2.56,6850,2.52,0.04,0.01587,35898,22556160,3.8,1.09,0.276954,NASDAQ,7508,6850,-5.94,Electronic Technology,1

There are a lot of numbers, as one would expect from financial data, but a few of them are more interesting than others.

  • symbol is the stock ticker symbol. MSFT for Microsoft, V for Visa, GM for General Motors, and so on, as seen in companyName.
  • open, close, high, low , week52Low, and week52High indicate how the stock price has moved, either today or in the last year. All numbers are in USD.
  • marketCap tells us the total valuation of the company.
  • sector is the primary sector that the company operates in, for example Electronic Technology, Health Services, Transportation, etc.
  • Finally, esg_category is what we'll be training on model on. We lumped our companies in four ESG categories: 1 being the lowest ESG risk (best) and 4 being the highest (worst).

Training script

Now let's look at the script that trains our data. This code is located in scripts/train_esg_model.py.

Setting up the DataRobot client
The most important package to import is the DataRobot Python SDK:

import datarobot as dr

We're instantiating the DataRobot Client with the environment values. The application loads it from .env file using the dotenv package.

You can learn where to get your API key and your endpoint URL in the API Authentication and Guide to different DataRobot endpoints documents.

# Environment setup
DR_API_KEY = os.environ["DATAROBOT_API_KEY"]
DR_API_URL = os.environ["DATAROBOT_URL"]
ENDPOINT = DR_API_URL+"/api/v2"

dr.Client(token=DR_API_KEY, endpoint=ENDPOINT)

Uploading data and preparing our project for modeling
We upload the project with dr.Project.create, passing in the path to our dataset.
This is usually enough to get going, but in our case we will also transform our esg_category value, telling DataRobot that it's a categorical feature. As we want our predicted ESG categories to be exactly these values (1-4) and not decimals between them, it makes sense for us. We do this using project.create_type_transform_feature call.

project = dr.Project.create(sourcedata='data/stock_quotes_esg_train.csv', project_name='Stock ESG ratings - categorical')

# Transform esg_category from integer into categorical feature 
project.create_type_transform_feature(
    "esg_category_categorical",  # new feature name
    "esg_category",       # parent name
    dr.enums.VARIABLE_TYPE_TRANSFORM.CATEGORICAL_INT
)

To start the AutoML process, we call project.set_target(), passing in the target name - our esg_category_categorical we created above. We can also pass the mode option, telling DataRobot to run a quick modeling run, making a limited set of ML models.

This is a longer running process, and as we're running this in a script, we can call project.wait_for_autopilot(), which will print informative output, and block the script until the modeling job is finished.

# This kicks off modeling, Quick Autopilot mode
project.set_target(
    target='esg_category_categorical',
    mode=dr.enums.AUTOPILOT_MODE.QUICK
)

# Time for a cup of tea or coffee - this might take ~15 mins
project.wait_for_autopilot()

Getting the recommended model
Once Autopilot has finished its "autopiloting", we get a list of all models it has created in our project, ranked by their accuracy. We can get DataRobot's recommendation by calling dr.ModelRecommendation.get(project.id), and get our model from that using get_model().

recommendation = dr.ModelRecommendation.get(project.id)
recommended_model = recommendation.get_model()

Deploying the model
Now that we have the recommended model, all we need to do is to deploy it, so we can use it to make predictions.
Models are not deployed to the same server as we use to train our models, so for this we will first need to get the ID of our prediction server using dr.PredictionServer.list()[0].id.
Then we can use that to call dr.Deployment.create_from_learning_model which will deploy our model.
Note the deployment.id which we'll use when making our predictions.

prediction_server_id = dr.PredictionServer.list()[0].id

deployment = dr.Deployment.create_from_learning_model(
      model_id = recommended_model.id, 
      label='Financial ESG model', 
      description='Model for scoring financial quote data',
      default_prediction_server_id=prediction_server_id
)

print(f'Deployment created: {deployment}, deployment id: {deployment.id}')

Predicting ESG categories


Now that our model is created and deployed, let's look at the prediction script, scripts/predict_with_esg_model.py. This will take the input file, stock_quotes_all.csv, and send it to DataRobot's prediction server to get the ESG categories for each entry.

This file's format is the same as the CSV we looked earlier, but with no esg_category field. The script's output will be another CSV file, data/stocks_esg_scores.csv, containing a table of stock symbols and their ESG categories.

Looking at the script's imports, we have a few more of note than earlier. The datarobot package is still there, as is dotenv, but now we also have requests, json, and csv packages.

import os
import sys
import json
import requests
from dotenv import load_dotenv
import datarobot as dr
import csv

Keys and variables
To make predictions we need a few more pieces. We'll need the prediction server we deployed our model to in the previous section, as well as a datarobot_key, which we'll pass along in our request. Here we use the DataRobot SDK to obtain both.

The only thing we're copying manually from the previous section is the deployment_id, which was printed when we created the deployment.

DR_API_KEY = os.environ["DATAROBOT_API_KEY"]
DR_API_URL = os.environ["DATAROBOT_URL"]
ENDPOINT = DR_API_URL+"/api/v2"

deployment_id = 'YOUR_DEPLOYMENT_ID' 

dr.Client(token=DR_API_KEY, endpoint=ENDPOINT)
prediction_server = dr.PredictionServer.list()[0]

prediction_server_url = prediction_server.url
datarobot_key = prediction_server.datarobot_key

This step is to illustrate API capability. If you used a single deployed model that didn't change, you could also, instead, store the prediction server URL, deployment ID, and DataRobot key in an environmental variable, skipping the API call.

Making the prediction request
Now we have everything we need to make our request. DataRobot's prediction API doesn't come with an SDK so we need to "handcraft" our API requests, using Python's requests.

We are sending 3 headers:

  • Content-Type in our case is text/plain as we're sending a CSV file. Alternatively the API also accepts application/json for JSON payloads.
  • Authorization takes the same API key we used with the modeling API in the DataRobot Python SDK.
  • datarobot-key is specific to the prediction endpoint in our managed cloud environments.

To predict, we're sending a POST request to the prediction server at PREDICTION_SERVER/v1.0/deployments/{DEPLOYMENT_ID}/predictions, with our data as the payload.

Prediction API responds in JSON format, and our predictions will be in the data field.

all_quotes_filename = 'data/stock_quotes_all.csv'

headers = {
    'Content-Type': 'text/plain; charset=UTF-8', 
    'Authorization': f'Bearer {DR_API_KEY}',
    'datarobot-key': datarobot_key
}

url = f'{prediction_server_url}/predApi/v1.0/deployments/{deployment_id}/predictions'
data = open(all_quotes_filename, 'rb').read()

predictions_response = requests.post(
        url,
        data=data,
        headers=headers,
)

predictions = predictions_response.json()['data']

Parsing and writing the prediction response

The prediction data payload is a JSON array of objects for all predicted fields, in the same order as we sent it. The actual predicted value is in prediction field. In our example we're creating a new list of symbo/category pairs, and filling them by iterating through our predictions.

esg_categories = [['symbol', 'esg_category']]

for prediction in predictions:
    quote = next(quotes_reader, None)
    symbol = quote[0]
    
    esg_entry = [symbol, int(prediction['prediction'])]
    esg_categories.append(esg_entry)

The final step is to write the CSV file so that we can incorporate it into our Flask app.

# Write the file
with open('data/stocks_esg_scores.csv', mode='w') as out_csv:
    csv_writer = csv.writer(out_csv)
    csv_writer.writerows(esg_categories)

Using predictions in our app

In this section we will focus on where the ESG values get integrated. Let's first look at the data models we have created as they represent most of the logic in the app. Model code is located in app/models.py.

Models and business logic

We have data models corresponding to stocks, portfolio, and ESG.

The StockQuote class represents the latest values for each of the stocks in our database. It also calculates each stock's growth in the last year. We will use that to sort the stocks to find the fastest growing ones.

class StockQuote(db.Model): # This is the "main quote object"
    __tablename__ = 'stockquotes'
    id = db.Column(db.Integer, primary_key=True)
    symbol = db.Column(db.String(12))
    ...
    
    def growth_in_last_year(self):
        return self.latestPrice / self.week52Low

StockEsg class is just a combination of our stock ticker and its predicted ESG category.

The Portfolio class contains our filters: stock exchanges, and excluded sectors, as well as our ESG risk preference. Users create their portfolios with their preferences, and retrieve their stock suggestions with it.

class Portfolio(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String(100))
    owner_id = db.Column(db.Integer, db.ForeignKey('user.id'))
    ...

The most interesting method of Portfolio is stockSuggestions(), used to retrieve stocks based on the portfolio rules. That merges StockQuote and StockEsg models, filters on user's preference, and sorts them by their growth.

Because we've made our predictions for all stock quotes in advance, this method is extremely fast - as fast as the underlying database query.

def stockSuggestions(self):
        suggestions = StockQuote.query \
        .join(StockEsg, StockQuote.symbol == StockEsg.symbol) \
        .filter(StockQuote.primaryExchange.in_(self.exchanges())) \
        .filter(StockQuote.sector.notin_(self.excludedSectors())) \
        .filter(StockEsg.esg_risk_category <= self.esg_risk_category) \
        .filter(StockQuote.peRatio != None ) \
        .filter(StockQuote.latestPrice != None) \
        .filter(StockQuote.week52Low != None) \
        .filter(StockQuote.week52Low > 0) \
        .order_by(
            desc(
                StockQuote.latestPrice / 
                StockQuote.week52Low
            )
        ) \
        .limit(10)
        return suggestions

Loading the data

We've seen how the ML-enriched data is used by the Finance Finder app to find you interesting stocks, but not how we've added that data into the app. There are two more files to look for: app/seed_stocks.py and app/__init__.py.

The seed_stocks.py file contains functions seed_stock_quotes() and seed_esgs(), which read our data from the CSV files in data and import them into our database.

In __init__.py, which is the main entry point into our Flask application we have added a single CLI method,seed_data , that we can run to invoke the function seed_stocks.py. The result is that we can invoke flask seed_data to load our data.

@app.cli.command('seed_data')
def seed_data():
    seed_stocks.seed_stock_quotes()
    seed_stocks.seed_esgs()
`

Right, for the purposes of this demo, we only need to run flask seed_data once after setting everything up. In a more realistic scenario we could have this call run as a cron job, every day when the stock markets close and we've predicted new ESG categories for our stocks.

Conclusion

In this tutorial we have looked at a demo application that uses ML to enrich an existing dataset of stock quotes with ESG categories, so that the user can filter through stocks in terms of their own ESG risk threshold.

In our scenario all the machine learning work is done beforehand in a single call to DataRobot's prediction API. All predicted values are imported into our database before using the app. This makes for instantaneous retrieval of results much more efficiently than making it every single time.

In a future tutorial we will look at how to establish pipelines for machine learning and integrating predictions in this app with a periodically scheduled job.

Updated 2 months ago


Finance Finder - Batch prediction


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.