05.07.18 by Giuliano Vesci

Run ML predictions with R on AWS Lambda

5 min read

In this article, I will introduce how and why our team here at foodora/foodpanda is exposing an API that makes machine learning predictions using R, AWS Lambda and Amazon API Gateway. I will guide you through all the required steps while using the prediction of the food preparation time of our restaurants.

Purpose: calculating food preparation time

foodpanda and foodora deliver thousands of meals every day. An important logistics process is sending one of our riders to the restaurant only when the food is ready at the restaurant. However, the time required preparing the food varies depending on several factors. With the following steps, I will guide you to the process that helped us to improve this crucial part of our operations.

Build a model

First, we built a model to analyze when the food was actually ready, calculated by looking at historical data. This analysis was performed in R and will not be discussed here, but it’s worth saying that the output of the analysis was a model (specifically a linear regression).

The model needs to be stored in a format which can be read quickly. Thus, we decided to go for the RDS format. Here’s an example:

saveRDS(model, "model.rds")

Architectural decisions

For our use case, the architectural direction was simple: we did not want to spend time and resources in managing additional servers. Amazon Lambda is a great example of serverless architecture. It runs your code in response to events and automatically manages the underlying compute resources for you and you do not have to pay for idle compute time. In order to expose the function to the public, we decided to use another Amazon product, Amazon API Gateway.

Our Lambda function had to access our model. We decided to upload our model into S3 for easy to access from Lambda.

Setting up the Lambda function

Once the model is ready, we need to start working on the creation of the Lambda function. Unfortunately, Lambda does not support R, it only supports Python, NodeJs and Java.

Using R in Lambda requires building a zip package that contains C shared libraries. You can compile these on an Amazon EC2 instance. We’re going to use the rpy2 Python package which can run R code from within Python. We will need to import it in the handler function we’re gonna define in Lambda.

Let’s go step by step from all the necessary operations:

Step 1: Compile R and all dependencies

For the first step, please refer to this article that guides you through all the necessary steps to run R code with Lambda using the rpy2 Python package. In particular, the steps 1 to 4 in their solution walkthrough consist of the following:

Compilation of R and all dependencies for Amazon Linux
Installation of the rpy2 package
Packaging of R and the rpy2 package for Lambda
Set up of the libraries from the Python virtual environment

Step 2: Create a handler function in Python to estimate preparation time via R

Now that we have all the package setup, we finally need to define our handler function. The handler function is called when a new event triggers Lambda. Let’s call this file handler.py and it should be placed in the $HOME/lambda folder.

In order to continue, we need to load all the shared libraries and set the R environment variables before loading rpy2. Let’s finally pass to the code now!

The first part of the function should import the necessary functions and load R:

import ctypes
import json
import os
import boto3
import logging

# use python logging module to log to CloudWatch
# http://docs.aws.amazon.com/lambda/latest/dg/python-logging.html

logging.getLogger().setLevel(logging.DEBUG)

s3 = boto3.client('s3')

################### load R
# must load all shared libraries and set the
# R environment variables before you can import rpy2
# load R shared libraries from lib dir

for file in os.listdir('lib'):
    if os.path.isfile(os.path.join('lib', file)):
        ctypes.cdll.LoadLibrary(os.path.join('lib', file))

# set R environment variables
os.environ["R_HOME"] = os.getcwd()
os.environ["R_LIBS"] = os.path.join(os.getcwd(), 'site-library')

import rpy2
from rpy2 import robjects
from rpy2.robjects import r

################## end of loading R

We then need to define the handler function, which is the entry point of Lambda. In this function, we want to read the input and call the get_prep_time function which actually runs the R code and get the prediction. For a matter of simplicity, we predict the preparation time only depending on the total value of an order.

def lambda_handler(event, context):
    try:
        total_value = event['total_value']

        # calling the get_prep_time function which predict the preparation time from the total_value in input
        prep_time = get_prep_time(total_value)

        res = {}
        res['prep_time'] = prep_time
        return res

    except Exception as e:
        logging.error('Payload: {0}'.format(event))
        logging.error('Error: {0}'.format(e.message))

        # generate a JSON error response that API Gateway will parse and associate with a HTTP Status Code

        error = {}
        error['errorType'] = type(e).__name__
        error['httpStatus'] = 500
        error['request_id'] = context.aws_request_id
        error['message'] = e.message.replace('\n', ' ') # convert multi-line message into single line
        raise Exception(json.dumps(error))

We then need to define the get_prep_time function, which is in charge of using R to load our previously defined model and run a prediction for the new input value. The function then returns the calculated preparation time:

def get_prep_time(total_value):
    download_model_from_s3()
    r.assign('total_value', total_value)

    r('model <- readRDS("model.rds")')
    r('df <- data.frame(total_value=as.numeric(total_value)')
    r('prediction <- predict(model, newdata = df)')

    r_pred = robjects.r('prediction')

    # R return an array of one element. Return it
    return r_pred[0]

The last missing step is the download_model_from_s3 function which is in charge of loading the file from S3. In the following code, remember to add the name of the S3 bucket where the model was uploaded.

def download_model_from_s3():
    # caching strategies used to avoid the download of the model.rds file every time from S3
    if os.path.isfile(RDS_FILE):
        logging.debug('file already downloaded')
        return
    else:
        bucket = '*** BUCKET NAME ***'
        key = 'model.rds'
    
        logging.debug('attempting to download file')
        try:
            s3.download_file(bucket,key,'model.rds')
        except Exception as e:
            logging.error('Error downloading file {} from bucket {}.'.format(key, bucket))
            logging.error(e)
            raise e

Step 3: Create the package for Lambda

Once the Lambda function has been created and everything has been setup, we can just zip everything together:

cd $HOME/lambda
zip -r prep-time-api-VERSION.zip *

The zipped file can then be uploaded directly on Lambda or moved to S3 and loaded from there.

Expose Lambda with API Gateway

Exposing a Lambda function through an API Gateway is a common task, which is very well documented by Amazon. For this reason, we will not cover the topic in this presentation.

Next steps

We still face the following problems with the current solution:

Loading the model.rds in R is slow. This will affect the performance of our APIs. Ideally, we want to load the model once and then cache it inside the R environment.
Use a language agnostic file format to store the model, such as feather instead of .rds
Automation of deployment

Conclusions

With this article, I wanted to give a very simplistic overview of how we run machine learning predictions in production using Amazon Lambda. The project is still at its beginning but we would potentially expand this architectural decision to other projects and domains.

Giuliano Vesci
Director, Product Management

Engineering

GrafanaCon EU 2018, Amsterdam

5 min read