In this article, you will learn how to launch a SageMaker Notebook Instance and run your first model on SageMaker. Amazon SageMaker is a fully-managed machine learning platform that enables data scientists and developers to build and train machine learning models and deploy them into production applications.

Building a model in SageMaker and deployed in production involved the following steps:

  1. Store data files in S3
  2. Specify algorithm and hyper parameters
  3. Configure and run the training job
  4. Deploy the trained model

Table of Contents:

  • SageMaker additional libraries
  • Preprocess Data
  • Upload data to S3
  • Configure algorithm and specify hyper parameters
  • Specify Training, Validation Data Location
  • Data's Format
  • Train model
  • Deploy Model
  • Run predictions
  • Conclusion
This is HIVERY trademark tagline. Data Has A Better Idea. Visit www.hivery.com to learn more about who we are.
Photo by Franki Chamaki / Unsplash

SageMaker additional libraries

In SageMaker is good to remember that in addition to the normal Python Libraries

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

you should also import the following:

import boto3
import sagemaker

Preprocess Data

The data that we are going to use for this example are the data of the below kaggle competition:

https://www.kaggle.com/c/bike-sharing-demand/data

First Step

Download the data locally and upload the data to the SageMaker Jupyter Notebook by pressing the 'Upload ' button as following:

Create a new Jupyter Notebook import the libaries and read the datasets as pandas dataframe while converting to datetime64[ns] the datetime:

Second Step

Next, we create some additional datetime features and also transform the count into log(count + 1) - this is important since the metric for the competition is RMSLE and also the distribution of counts is skewed. As described in this kernel:

https://www.kaggle.com/apapiu/predicting-bike-sharing-with-xgboost

Third Step

In the next step we will split the train set to training (random sample of 70% of the train set) and validation set (the remaining 30 %) and we will save them as csv files into the Jupyter Folder.

Please remember that it is very important to save the training and validation files as follow:

  • The Target Variable will be the first column followed by input features
  • Training, Validation files do not have a column header.

Upload data to S3

We assume that the files are saved in Jupyter Folder and want to move it into an S3 bucket.

# The below example assume that the bucket name is: bike_data and the file name is 
# train.csv. The key represent where exactly inside the S3 bucket to store it.
# Thus, the file will be saved in: s3://bike_data/biketrain/bike_train.csv

def write_to_s3(filename, bucket, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)

write_to_s3('train.csv','bike_data','biketrain/bike_train.csv')

This step is very important as when you submit a training job using an AWS algorithm (container) you have to specify an S3 bucket location where the data are located. That said a training job is run independently of the Jupyter Notebook and that's why you need to specify an S3 location for the data (you also specify a separate instance for the training as we will see later).

Configure algorithm and specify hyper parameters

AWS team has packaged machine learning algorithms as docker containers.These containers are stored in a container registry. Each algorithm has a unique entry that we refer to as container registry path. We need to specify this container registry path to Sagemaker training Job to specify which algorithm to use for training. This link provide useful information:

https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

Let me give you a little more information about the upper link:

  • The first table illustrates all the algorithm that are currently available. This table is expecting to grow as AWS works to expand the choice of algorithms.
  • The 'Algorithm Name' as its name suggests signify the name of the algorithm
  • The 'Channel Name' column specify the datasets that should be provided. For example for XGBoost the train set is necessary while the validation is optional.
  • The 'File Type' column display the format of the data that can be used.For example XGBoost accept only CSV or LibSVM.
  • The 'Instance Class' column display the appropriate instance to be used.For example XGBoost is optimised for CPU on AWS so make sure to use a CPU based instance instead of spending money using a GPU instance.
  • The 'Training Image and Inference Image Registry Path' display the path that we should use to refer to this container.

Container Path

Let's now see how to construct a container path by going through two examples.

1st Example

Region: eu-central-1
Algorithm: XGBoost
Version Used: latest

=> Training Image and Inference Image Registry Path:

813361260812.dkr.ecr.eu-central-1.amazonaws.com/xgboost:latest

2nd Example

Region: us-east-2
Algorithm: Linear Learner
Version Used: first

=> Training Image and Inference Image Registry Path:

404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:1

I tend to use always the latest version in my projects. P.S If you want to find your region run the following command.

print(boto3.Session().region_name)

Get execution role

The next step is to get the execution role name this is the role that we define when launching the notebook instance to grant the correct permissions for the instance. We need to pass this role to the training job so that you can assume the permissions on our behalf and access the data file and other resources.

from sagemaker import get_execution_role
role = get_execution_role()

Create Instance of Estimators

The next step is to configure the training job. To train first we need to establish a SageMaker Session

# Configure training job, establish SagMaker session
sess = sagemaker.Session()

Next, create an instance of estimators:

# Access appropriate algorithm container image
# Specify how many instances to use for distributed training and what type of machine to use
# Finally, specify where the trained model artifacts needs to be stored
# Reference: http://sagemaker.readthedocs.io/en/latest/estimators.html
# Optionally, give a name to the training job using base_job_name

# Create instance of estimators
estimator = sagemaker.estimator.Estimator(container_path,#depends algorithm & region
                                       role, #pass your role to access your resources
                                       train_instance_count=1, #specify number of instance; take advantage of distrubuting training support
                                       train_instance_type='ml.m4.xlarge', #specify type of instance to be used chooose a CPU or GPU based instance depending on the algorithm
                                       output_path=s3_model_output_location, #specify S3 location of output
                                       sagemaker_session=sess,
                                       base_job_name ='xgboost-biketrain-v1' #give name in training job;optional
                                         )

To customize the behavior of the algorithm you can specify hyper parameters. Useful link is provided my AWS XGBoost in the below link:
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html

# Specify hyper parameters
estimator.set_hyperparameters(max_depth=5,objective="reg:linear",
                              eta=0.1,subsample=0.7,num_round=150)
print(estimator.hyperparameters())

Data's format

The model (estimator) expect the data to follow the below convention:

  • The Target Variable will be the first column followed by input features
  • Training, Validation files do not have a column header.

Specify Training, Validation Data Location

The training and validation file are S3 location and needs to be specified using S3 input config class.

# content type can be libsvm or csv for XGBoost
training_input_config = sagemaker.session.s3_input(s3_data=s3_training_file_location,content_type="csv")
validation_input_config = sagemaker.session.s3_input(s3_data=s3_validation_file_location,content_type="csv")

As shown above you can specify the content type i.e csv. But let's print the configuration first;

It use some default value but let's investigate what each of them means with the help of the below table.

Note that when working with big dataset probably we would avoid replicating the entirely dataset in each instance (FullyReplicated) rather than the dataset to be equally divided and each instance gets its portion of the dataset (ShardedByS3Key).

Train Model

You can initiate the train phase by calling the fit method and passing as arguments the train and validation S3 input config class defined as shown below:

#XGBoost supports "train", "validation" channels
#Reference: Supported channels by algorithm
#https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-#paths.html
estimator.fit({'train':training_input_config, 'validation':validation_input_config})

Since a new compute instance is launched for running the training job you will notice a startup delay. This delay is usually 5-6 minutes. Once the training job is completed SageMaker automatically terminate the training compute instance.

You can also check the progress of the training job on the SageMaker console by clicking on Job in the navigation panel as shown below.

After the training job has finished it displays the training and validation error as well as the billable time.

The train is stored in the S3 bucket inside the 'model' folder;

If you follow the path you will find the gz file and this is the serialized model that the training job created.

Deploy Model

Now that the model is ready the final step is we need to deploy it using the estimator as it provides an easy-to-use deployment method ( there are other methods as well). As follow by specifying:

  • The number of instances for hosting your model
  • The type of instances
  • Optionally you can give a name to the endpoint

When you deploy the model SageMaker launches the instances that you requested and host the model in it. These instances are running 24/7 and are available to produce predictions.

Once finished SageMaker creates an endpoint as shown bellow;

SageMaker provide the ability to easily edit the deployed model by adding additional instances in order to support your production model by modifying the Endpoint Configuration;

Please make sure to terminate the endpoint once you don't need the model otherwise you will get charged for every second that AWS host your model.

Run predictions

Finally, since the model is deployed we can go and run predictions as follow;

Conclusion

In this article we cover the following steps, achieving to build and deploy a machine learning model using AWS SageMaker.

  1. Ensure Training, Test and Validation data are in S3 Bucket
  2. Select Algorithm Container Registry Path - Path varies by region
  3. Configure Estimator for training - Specify Algorithm container, instance count, instance type, model output location
  4. Specify algorithm specific hyper parameters
  5. Train model
  6. Deploy model - Specify instance count, instance type and endpoint name
  7. Run Predictions

The complete Jupyter Notebook can be found in my below github page;

https://github.com/geodra/AWS-SageMaker/tree/master

AWS XGBoost Hyperparameters resource:

https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html?shortFooter=true

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Coding.