Managing Python Libraries on Azure Synapse Apache Spark Pools

Azure Synapse Analytics is Microsoft’s one-stop shop that integrates big data analytics and data warehousing. It features Apache Spark pools to provide scalable and cost-effective processing power for Spark workloads which enables you to quickly and easily process and analyse large volumes of data at scale.

I have been working with Spark environment on Azure Synapse for a while and thought of documenting the experience I had with installing external python libraries for Spark pools on Azure. This guideline may come handy for you if you are performing your big data analytics experiments with specific python libraries.

Apache Spark pools on Azure Synapse Workspaces comes with Anaconda python distribution and with pip as a package manager. Most of the native python libraries used in the data analytics space are already installed. If you need any additional packages to be installed and used within your scripts, there are 3 ways you can do it on Synapse.

  • Use magic command on notebooks to install packages in session level.
  • Upload the python package as a workspace package and install in Spark pool.
  • Install packages using PIP or conda input file.

01. Use magic command on notebooks to install packages in session level.

Install packages for session level on Notebooks

This is the most simple and straight forward way of installing a python package in Spark session level. You just have to use the magic command followed up with the usual pythonic way of installing the package through pip or conda. Though this is easy for prototyping and quick experiments, it’s pointless if you are installing it over and over again when you start a new spark session. Better to avoid this method in production environments. Good for rapid prototyping experiments.

02. Upload the python package as a workspace package and install in Spark pool.

Upload python libraries as workspace packages

Azure Synapse workspace allows to have workspace packages uploaded and install on the Spark pools. It accepts python wheels (.whl files), jar files or tar.gz as packages.

After uploading the packages go for specific Apache Spark pool and then select the packages you want to install on it. The initial installation may take few minutes. (In my case, it took around 20 mins to install 3 packages)

With the experience I had with different python packages, I would stick with python wheels from pip or jars from official package distributions. I tried sentence-transformers tar.gz file from pyPI (https://pypi.org/project/sentence-transformers/ ). It gave me an error during the installation process mentioning a package dependency with R (which is confusing)

03. Install packages using PIP or conda input file.

Upload the requirement files

If you are familiar with building conda environments or docker configurations, having a package list as a config file should not be new to you. You can specify either a .txt file or an .yml file with the desired package versions to be installed to the Spark cluster. If you want to specify a specific channel to get libraries, you should use a .yml file.

For an example if you need to install sentence-transformers package and spark-nlp python packages which are used in NLP for the Spark environment, you should add these two line in a .txt file and upload it as the input file for the workspace.

sentence-transformers===2.2.2
spark-nlp===4.4.1

I find this option as the most robust way of installing python packages to a Spark since it gives you the flexibility to select the desired package version and the specific channel to use during installation. It saves the time from installing the packages each time when the session get refreshed too.

Let me know your experience and the lessons learned during experiments with Apache Spark pools on Azure Synapse Analytics.

Here’s the Azure documentation on this for further reference.

Perspective Transformation of Coordinate Points on Polygons

I blogged on some of the challenges I face with my deep learning based experiments and the approaches I used for overcoming those challenges in previous blog posts. This is going to be one of them where I’m going to explain a technique I used for calculating perspective relationship between two different planes in a computer vision application.

Background :

Computer vision is widely used in surveillance applications, object detection and sports analytics. Mapping the imagery/ video footage generated from a single camera or from set of cameras to a relative space is one of the major tasks we may have to deal with. Mostly this need comes with mapping people/ object locations.

Use Case –

Imagine a sports analytics application where you capture a soccer game from a fixed camera and run a human detection algorithm on the image to find out the player positions. That’s quite straightforward. (You can see that has been done in the following figure). The tricky part is mapping the player positions which are in the camera space to the actual soccer field coordinate and generate graph with player positions relative to the soccer field (or you may want to normalize the location coordinates) . What we need was an output similar to the left bottom one in the figure.

How to do that?

We can clearly understand that soccer field is rectangular in shape. So that, if we know the frame-space location coordinates of the 4 corners of the field, we can easily transform any point inside that polygon for a given coordinate space. In geometry this is called “perspective transformation” . (This is bit different from affine transformation which is a common application.)

Perspective transformation

If you interested in digging deep and see how this mathematical transformation is happening, I strongly encourage you to follow this link and see the related matrix calculations behind this operation.

I found out a pretty neat JavaScript snippet by Florian Segginger and I did ported the logic to a python script.

import numpy as np
from matplotlib import pyplot as plt

resultRect = {
  'p1': {'x': 0, 'y': 0},
  'p2': {'x': 1, 'y': 0},
  'p3': {'x': 1, 'y': 1},
  'p4': {'x': 0, 'y': 1}
}

resultPoint = {'x': 0, 'y': 0}

# # solve function
# First, find the transformation matrix for our deformed inputPolygon
# [a b c]
# [d e f]
# [g h 1]
def perspective_transform(inputPolygon, point):
  x0 = inputPolygon['p1']['x']
  y0 = inputPolygon['p1']['y']
  x1 = inputPolygon['p2']['x']
  y1 = inputPolygon['p2']['y']
  x2 = inputPolygon['p3']['x']
  y2 = inputPolygon['p3']['y']
  x3 = inputPolygon['p4']['x']
  y3 = inputPolygon['p4']['y']

  X0 = resultRect['p1']['x']
  Y0 = resultRect['p1']['y']
  X1 = resultRect['p2']['x']
  Y1 = resultRect['p2']['y']
  X2 = resultRect['p3']['x']
  Y2 = resultRect['p3']['y']
  X3 = resultRect['p4']['x']
  Y3 = resultRect['p4']['y']


  dx1 = x1 - x2
  dx2 = x3 - x2
  dx3 = x0 - x1 + x2 - x3
  dy1 = y1 - y2
  dy2 = y3 - y2
  dy3 = y0 - y1 + y2 - y3

  a13 = (dx3 * dy2 - dy3 * dx2) / (dx1 * dy2 - dy1 * dx2)
  a23 = (dx1 * dy3 - dy1 * dx3) / (dx1 * dy2 - dy1 * dx2)
  a11 = x1 - x0 + a13 * x1
  a21 = x3 - x0 + a23 * x3
  a31 = x0
  a12 = y1 - y0 + a13 * y1
  a22 = y3 - y0 + a23 * y3
  a32 = y0

  transformMatrix = [
  [a11, a12, a13],
  [a21, a22, a23],
  [a31, a32, 1]
  ]

  #find inverse of matrix
  transformMatrix = np.array(transformMatrix)
  inv = np.linalg.inv(transformMatrix)

  #convert point to a matrix
  pointMatrix = np.array([point['x'], point['y'],1])

  #matrix multiplication
  resultMatrix = np.matmul(pointMatrix, inv)

  #result point
  resultPoint['x'] = resultMatrix[0] / resultMatrix[2]
  resultPoint['y'] = resultMatrix[1] / resultMatrix[2]

  return resultPoint

########

#perform transformation with an example

inputPolygon = {
  'p1': {'x': 158, 'y': 2044},
  'p2': {'x': 669, 'y': 573},
  'p3': {'x': 2797, 'y': 594},
  'p4': {'x': 3686, 'y': 2062}
}

point = {'x': 1800, 'y': 900}

resultPoint = perspective_transform(inputPolygon, point)

How to use this?

Pretty easy! You have to know 3 things.

1. Coordinates of the corner 4 points of the polygon to be transformed

2. Coordinate/ coordinates of the points to be transformed

3. 4 corner points of the transformed polygon (This can be a rectangle or any 4 point polygon)

perspective_transform method will get the input polygon coordinates and point coordinates and output the resultPoint perspective to the resultRect we have defined. (In this code I’ve used a 1-1 plane to map the points)

Feel free to use this method in your applications and let me know your thoughts on this. Cheers!

Using Hierarchical Data Format (HDF5) in Machine Learning

Example of HDF5 file structure : https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5

Machine learning or deep learning is not all about algorithms and training predictive models on some set of data. It involves a wide range of tools, techniques and computing approaches to handle various steps of the machine learning process pipeline.

Starting from a raw data point, to the stage of exposing the model as a REST API there are numerous places where we need to pay attention on data handling approaches. (Yes! Data is the key component of any ML/DL pipeline.)

In this article am bringing out a problem I faced when dealing with a deep learning experiment and the approach I took to overcome the problem. I’m pretty sure you may have to face similar kind of issues if you using massive amounts of structured/ unstructured data for training your deep learning models.

Here’s the issue I faced :

In order to train a computer vision related deep learning model I had to write a PyTorch custom dataloader for loading a set of annotation data. The data points were stored in JSON format and believe me, that massive JSON file was nearly 4GB! It was not a simple data structure with keys and values, but had a mixed set of data structures including lists, single float values and keys in String format.

As usual I wrote a PyTorch custom dataset class and tried to load the massive JSON file inside init . Yp! It crashed! Memory was not enough for handling such a big file. Can’t you move that for getitem ? No. It’s not possible. Loading file on call is so inefficient and I had to think of solution which doesn’t load the massive file as a whole for the RAM and with the possibility of retrieving data inside the file with indexes.

(If you need to get some tips and tricks on writing PyTorch custom datasets, please refer this article.)

What I did?

The first dumb idea I got was converting the data into a multidimensional numpy array and save the file, but I figured out that gives the birth for another massive file which doesn’t solve my problem. With a suggestion I got from my co-supervisor, I started looking on HDF5; Hierarchical Data Format. Yes! It was the solution and it nicely solved my issue.

What is Hierarchical Data Format (HDF5) ?

The Hierarchical Data Format version 5 (HDF5), is an open source file format that supports large, complex, heterogeneous data. This uses a ‘directory-like’ structure to store data. In simpler terms, a HDF5 file can be identified as a definition of a file system (the way files and directories are stored in your computer) in a single file.

There are two important terms used in HDF5 format.

  • Groups – Folder like element within the HDF5 file which can contain subgroups or datasets.
  • Dataset – Actual data contained within the HDF5 file. (Numpy arrays etc. )

In simpler terms, if your data is large, complex, heterogeneous and need random access most probably HDF5 would be the best option you can go forward with.

How to use HDF5?

We all speak Python when it comes to machine learning. Python supports HDF5 format using h5py package. Since this is a wrapper based on native HDF C API, it provides almost the full functionality.

Create HDF5 file from a JSON array

Here I included a very brief code snippet of creating a HDF5 file from a JSON array which contains the data from famous iris dataset. This is a sample of JSON array I used. (You can get the full dataset from here)

[
    {"sepalLength": 5.1, "sepalWidth": 3.5, "petalLength": 1.4, "petalWidth": 0.2, "species": 0},
    {"sepalLength": 5.7, "sepalWidth": 2.8, "petalLength": 4.5, "petalWidth": 1.3, "species": 1},
    {"sepalLength": 6.9, "sepalWidth": 3.1, "petalLength": 5.4, "petalWidth": 2.1, "species": 2}
]

Here I created a separate group for each entry. (3 JSON objects in the array means 3 groups in HDF5 file.) The 5 datapoints in each object are stored as datasets.

import numpy as np
import json
import h5py
import os

hdf5_filename = 'iris_hdf5.hfd5'

#read iris.json file
with open('iris.json') as jsonfile:
    iris_data = json.load(jsonfile)
    
#create HDF5 file
h = h5py.File(hdf5_filename, 'w')

#running a loop through all entries in the JSON array
index = 0
for entry in iris_data:
    for k, v in entry.items():
        dataset_name = os.path.join(str(index), k) #groups are divided by '/'
        h.create_dataset(dataset_name, data = np.asarray(v, dtype=np.float32))
    index = index +1
h.close()
print('Iris data HDF5 file created.')


#read data from HDF5
h_read = h5py.File(hdf5_filename, 'r')

#read a single entry
 
h_read['0'].keys() 
# output : <KeysViewHDF5 ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth', 'species']>
np.asarray(h_read['0']['petalLength']) 
# output : array(1.4, dtype=float32)

h_read.close()

Though this is a very simple data structure, you can expand this to complex and large files. You’ll find it pretty easy to use HDF5 instead of using huge lists inside init of custom dataloaders. Here’s a rough sketch of the PyTorch custom dataset class I created for the above example.


import torch
import h5py
from torch.utils.data.dataset import Dataset

hdf5_filename = 'iris_hdf5.hfd5'

class MyCustomDataset(Dataset):
    def __init__(self, ...):
        # # All the data preperation tasks can be defined here
        # - HDF5 file is referenced here.
        h_read = h5py.File(hdf5_filename, 'r')
         
    def __getitem__(self, index):
        # # Returns data and labels
        # - access HDF5 file through indexing
        item = np.asarray(h_read[index]['petalLength'])
        return item
 
    def __len__(self):
        return count # of how many examples you have

This is only one usage of using HDF5 file format in machine learning. Share your experiences with HDF5 here too. 🙂

What’s Your Workhorse?

Going to break the flow of Azure Machine Learning blog series and write bit of a descriptive answer for a frequently asked question.

Here’s that golden question!

What is the development rig and tools you use for deep learning/machine learning development?

We all know doing machine learning experiments comes with its cost of computation power. When it comes to deep learning, it’s very computationally expensive. As a researcher who do most of my experiments on computer vision and deep learning this is the setup I use to my jobs done (As of September 2020).

Disclaimer :

Please note that, all the things am mentioning here are my personal preferences and not any standard or something. Neither any of the manufacturers or software vendors have sponsored me for this post 😀 (Most of the software tools I’m mentioning here are FOSS).

Hardware

Choosing the right set of hardware, is the most vital part of building a DL workstation. It maybe quite expensive to go for the best set which satisfy your need but I see that as an investment. Make sure you don’t spend way too much just for the name sake or the brand. Make sure whether it’s enough to do your job.

The main specifications you need to look on a DL workstation are processor, RAM, Storage and GPU processing unit (If you wanna do computer vision, NLP kinda experiments this is a must!) . I’m using a desktop with a Intel Xeon 4 core processor with a 16GB RAM. When it comes to storage I’m having a SSD as the primary drive installed all my software and a bit big (6TB) hard drive. Why such a big storage? Since I have to deal with somewhat big image and video datasets having a big storage is a need.

GPU! This is the pinnacle of your build! Plus maybe the most expensive. I have a NVIDIA GTX 1080 which is having 8GB GPU memory space and supports CUDA based processing. Comes handy with most of the experiments (When this is not enough, I use a remote cluster with 2 NVIDIA 2080Ti s for big computations.)

Make sue you are having proper cooling for the machine! Believe me else these computations gonna generate a lot of heat and ruin your hardware.

Laptop or a desktop?

This is a hard choice to make. If you have to go place to place to do your job then you may have to choose a laptop. (Many gaming laptops comes with GPUs which can be used for CUDA based processing). I’m more of a desktop person and you can easily build a more powerful rig for the same amount of dollars you spend for a good enough laptop. (I don’t prefer Macbooks since they are not coming with GPUs.. I just use a light laptop to do my presentations and just to carry around for meetings)

With all these beasts, a two-monitor setup, mechanical keyboard with a wireless mouse, RODE USB mic for video cons are records and a Bose sound cancelling headset for a complete isolation is just there to ease up the things.

Software and Tools

This is the most interesting part. Though you have the correct hardware, the wrong choice of tools would make your job hard. (Again this is completely my choice of software. You may have different preferences)

I use Ubuntu Linux as my operating system (18.10 LTS still 😀 ) . Why Ubuntu? Since it’s so easy to setup and work with, that’s my choice. Since Python is my primary programming language it works as magic with all the python environmental dependencies and all. (Plus OpenCV 😀 )

So I said Python.. what else? Yeah.. along with most of the machine learning frameworks (yes I have an Anaconda based setup) PyTorch has become my to-go deep learning framework. Easy debugging, Pythonic syntax, wide support made me to take this choice.

IDE

VSCode with installed extentions

No programmer is complete without his/her own IDE choice and modifications. Microsoft VSCode (which is open source and free) is the IDE I use. It supports Python, Spark, Scala and almost all the languages I use. I added few extensions which comes so handy with experiments. Here are some of the extentions I use. See if they suits for you.

  • Python – Make sure I get all the indentation and tooltips right.
  • Remote explorer – Since I use a remote cluster to train my models and sometimes a NAS to store, this comes handy to manage my SSH connections. Pretty convenient even for remote debugging.
  • Docker – pretty easy to manage your Docker environments.
  • Excel viewer – Just to view the CSV files in style.
  • LaTeX Workshop and Code Spell Checker – This is all because I use LaTeX for scientific writing. Believe me, VSCode is a nice place to do word processing too 😀

With all these tweaks I use the dark theme 😀

This is all about the hardware and software setup I use for my experiments. In a later post, will talk about some tips and tricks in MLOps I use within experiments.

AzureML Python SDK – Installation & Configuration

In the last blog post, we discussed on developing a machine learning classifier with Azure machine learning service. As mentioned there, we going to utilize the familiar development tools and frameworks for model development.

Key areas of the SDK include:

  • Explore, prepare and manage the lifecycle of your datasets used in machine learning experiments.
  • Manage cloud resources for monitoring, logging, and organizing your machine learning experiments.
  • Train models either locally or by using cloud resources, including GPU-accelerated model training.
  • Use automated machine learning, which accepts configuration parameters and training data. It automatically iterates through algorithms and hyperparameter settings to find the best model for running predictions.
  • Deploy web services to convert your trained models into RESTful services that can be consumed in any application.

~ Ref : https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py

AzureML python SDK acts as the connector between all the resources on the cloud and the dev environment.

Installing Python SDK –

AzureML SDK can be easily installed for your local computer through pip. Refer this guide for the installation process. I’d suggest to go with default installation since it’s enough for the most of the operations we used in the experiment.  It’s a good idea to upgrade the SDK before running an experiment since the package is rapidly updating.

Download config file –

For connecting the AzureML workspace, we may need the Azure subscription ID, resource group which the workspace has been created and the workspace name. The easiest way to grab these details is downloading the config.json file from the Azure portal. Place this file inside the experiment directory.

Downloading config.json from Azure portal

Connect AzureML Workspace –

Connecting the AzureML workspace and and listing the resources can be done by using easy python syntaxes of AzureML SDK (A sample code is provided below). Refer Python SDK documentation to do modifications for the resources of the AML service.

#!pip install --upgrade azureml-sdk[notebooks]
import azureml.core
from azureml.core import Workspace
from azureml.core import ComputeTarget, Datastore, Dataset

print("Ready to use Azure ML", azureml.core.VERSION)
ws = Workspace.from_config()
print(ws.name, "loaded")

#View resources in the workspace 
print("Compute Targets:")
for compute_name in ws.compute_targets:
    compute = ws.compute_targets[compute_name]
    print("\t", compute.name, ":", compute.type)
    
print("Datastores:")
for datastore in ws.datastores:
    print("\t", datastore)
    
print("Datasets:")
for dataset in ws.datasets:
    print("\t", dataset)

print("Web Services:")
for webservice_name in ws.webservices:
    webservice = ws.webservices[webservice_name]
    print("\t", webservice.name)

In next blog article, will discuss the data loading and experiment creation.

Build a Machine Learning Classifier with Azure Machine Learning Service

Azure Machine Learning Service is becoming the one-stop place for managing all ML related workloads in Azure cloud. There are two main advantages of using Azure Machine Learning Service for your ML and data science experiments.

#1 – You can mange the whole machine learning workflow in a single environment. From data wrangling to machine learning service deployment, everything is managed on the cloud with its reliable, scalable and efficient services.

#2 – You can use your familiar open source toolset, languages and frameworks in model development. Being a ML engineer or a data scientist, you may be using python or R as your main development languages. Azure machine learning allows you to use any of those languages and frameworks to develop the experiments.

Pima Indians Diabetes Classification is one of the most famous machine learning experiments. It’s a binary classification problem which use a .CSV based tabular dataset as the input. I’ll walk you through the process I went through to perform my experiment.

Scenario:

  • Diabetes dataset is available as a .CSV file in your local file system.
  • I have to build a binary classifier trained with the dataset and deploy it as a web service with a REST endpoint.

Solution:

As shown in the diagram I used the services and tools in AMLS with my typical development environments to build up the solution.

  • Step 1: Since the experiment is going to build on Azure cloud, I have transferred my dataset into an Azure blob storage. I used Azure storage Explorer to upload the dataset into the cloud. (For better performance, make sure the dataset is in a storage blob in the same region where the AMLS experiment is)
  • Step 2: In order to access the data stored in the blob space, it’s registered inside AMLS as a datastore.
  • Step 3: AMLS supports two types of datasets. Since the .CSV file contains tabular data, it’s registered as a tabular dataset. (You can perform the basic statistical operations and visualizations after registering as a tabular dataset.)
  • Step 4: Now it’s the time for the real job. Since am more familiar with Python and sci-kit learn, I used those languages and libraries to develop my model. The whole coding part has been done on a Linux machine using my favorite VSCode IDE. 😉 You may wonder how I’m going to connect the code base on my local machine with the cloud… Here’s the place where AzureML python SDK comes to the rescue.
  • Step 5: I don’t have enough computation power to do the model training on my machine. So that, I use an Azure compute cluster to perform the computation. (In my experiment I did hyperparameter tuning to select the best parameters. Using the compute cluster allowed me to perform parallel training)
  • Step 6: After model training and getting the desired inference accuracy, I had the need of exposing the binary classification model as a web service. For that, I used Azure Container Instance (ACI) since this is going to be a small testing experiment. (I may have to go for an Azure Kubernetes Services (AKS) if I wanna go for global massive deployment)

Yp! It’s just a simple 6 step process. Complex? Don’t worry, I’m going to walk you through the whole process assisted with the code snippets in the upcoming blog posts. Stay tuned. Let’s start a real experiment with Azure Machine Learning Service.

Tensorboard with PyTorch

tb_1

Tensorboard Interface

Training and evaluating deep learning models may take a lot of time. Sometimes it’s worth to monitor how good or bad the model is training in real-time. It’ll help to understand, debug and optimize your models without waiting till the model get trained to monitor the performance.The good old method of printing out training losses / accuracy for each epoch is a good idea, but it’s bit hard to evaluate the metrics comparatively with that.

A real-time graphical interface that can use to plot/ visualize metrics while a model is training through epochs or iterations would be the best option. Tensorboard is visualization tool came out with TensorFlow and I’m pretty sure almost all TF guys are using and getting the advantage from that cool tool.

So what about PyTorchians?? Don’t panic. Official PyTorch repository recently came up with Tensorboard utility on PyTorch 1.1.0 . Still the code is experimental and for me it was not working well for me.

Then, I found this awesome opensource project, tensorboardX. Pretty similar to what PyTorch official repo is having and easy to work with. TensorboardX supports scalar, image, figure, histogram, audio, text, graph, onnx_graph, embedding, pr_curve and video summaries.

5 simple steps…

  1. Install tensorboardX
  2. Import tensorboardX for your PyTorch code
  3. Create a SummaryWriter object
  4. Define SummaryWriter
  5. Use it!

I just did a simple demo on this by adding Tensorboard logs for the famous PyTorch transfer learning tutorial. Here’s the GiHub repo. Just clone and play around it.

Note that in the experiment I’ve used two SummaryWriter objects two create two scalar graphs for training phase and the other one for validation phase.

The log files will be created in the directory you specified when creating SummaryWriter object. (You can change this directory to wherever you want)

To view the tensorboard, open a terminal inside the experiment folder. Assume that your log files are inside ‘./logs/’ . Use the following command to spin up the tensorboard server on your local machine.

$ tensorboard –logdir ./logs/

Sometimes you may use a remote server or a VM (might be a Azure DLVM) for training your deep learning models. Then how to get this tensorboard out from there??

SSH Tunneling with post forwarding is a good option you can use for this. You just have to spin up the tensorboard service on your remote machine. Then tunnel the server back to your workstation with the ssh command stated below.

$ ssh -N -L 6007:127.0.0.1:6006 <username>@<remote_ip>

127.0.0.1:6006 : Tensorboard server running on the remote server / VM

6007 : local workstation port

You can then view the tensorboard running on the remote machine through your local machine’s browser.

http://<remote_ip>:6006

That’s it! Simple and neat. No need to wait couple of days till the model get trained. Just monitor and stop early if it’s not learning well.

Enjoy Deep Learning!

GPU Accelerated Application Deployment with NVIDIA-Docker

When it comes to deep learning model development and training, personally for me, the majority of the time is spent on data pre-processing, then for setting up the development environment.  Cloud based development environments such as Azure DLVM, Google CoLab etc. are very good options to go with when you don’t have much time to spend on installing all the required packages for your workstation. But, there are times that we want to do the development on our machines and train/deploy in another place (may be on the client’s environment, for a machine with a better GPU for faster training or to train on a Kubernetes cluster). Docker comes handy in these scenarios.

Docker provides both hardware and software encapsulation by allowing portable deployment. If you are a data scientist/ machine learning guy or a deep learning developer, I strongly recommend you to give it a try with docker and I’m pretty sure that’ll make your life so easy.

Alright! That’s about docker! Let’s assume now you are using docker for deploying your deep learning applications and you want to use docker to ship your deep learning model to a remote computer that is having a powerful GPU, which allows you to use large mini-batch sizes and speedup your training process. Though docker containers solve the problem of framework dependencies and platform dependencies it is also hardware-agnostic. This creates a problem!

Have you ever tried to access the GPU resource on the host computer from a program running inside a docker container? Sadly, Docker does not natively support NVIDIA GPUs within containers.

The early work around was installing the Nvidia drivers inside the docker container. It’s bit of a hassle as the driver version installed in the container should match the driver on the host.

For making docker images that uses GPU resources more portable, Nvidia has introduced nvidia-docker!

nvidia-docker

NVIDIA-Docker plugin enables GPU accelerated application deployment

Nvidia-docker is a wrapper around the docker command that mounts the GPU on the host machine with the docker container. The only thing you should pay your attention is the CUDA version you want to use.

So, in which scenarios you can use this? In my case, nvidia-docker comes handy for me when I’m running my experiment on a cluster which is having a higher GPU power. What I do is just containerize all my code into a docker and run on the remote with nvidia-docker. (Windows guys… nvidia-docker is not still available for windows hosts. Not sure if that is in the development timeline or not 😀 )

Here’s the official GitHub on nvidia-docker. Just install it at make sure to restart your docker engine and make sure nvidia-docker the default docker run-time. Then rest is the same as building and running a typical docker.

Here’s a simple docker file I wrote for containerizing my PyTorch code. I’ve used CUDA 9.1.  You can modify this for your need.

FROM nvidia/cuda:9.1-base-ubuntu16.04

# Install some basic utilities
RUN apt-get update && apt-get install -y \
curl \
ca-certificates \
sudo \
git \
bzip2 \
libx11-6 \
&& rm -rf /var/lib/apt/lists/*

# Create a working directory
RUN mkdir /app
WORKDIR /app

# Create a non-root user and switch to it
RUN adduser --disabled-password --gecos '' --shell /bin/bash user \
&& chown -R user:user /app
RUN echo "user ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/90-user
USER user

# All users can use /home/user as their home directory
ENV HOME=/home/user
RUN chmod 777 /home/user

# Install Miniconda
RUN curl -so ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-4.5.1-Linux-x86_64.sh \
&& chmod +x ~/miniconda.sh \
&& ~/miniconda.sh -b -p ~/miniconda \
&& rm ~/miniconda.sh
ENV PATH=/home/user/miniconda/bin:$PATH
ENV CONDA_AUTO_UPDATE_CONDA=false

# Create a Python 3.6 environment
RUN /home/user/miniconda/bin/conda install conda-build \
&& /home/user/miniconda/bin/conda create -y --name py36 python=3.6.5 \
&& /home/user/miniconda/bin/conda clean -ya
ENV CONDA_DEFAULT_ENV=py36
ENV CONDA_PREFIX=/home/user/miniconda/envs/$CONDA_DEFAULT_ENV
ENV PATH=$CONDA_PREFIX/bin:$PATH

# Install PyTorch with Cuda 9.1 support
RUN conda install -y -c pytorch \
cuda91=1.0 \
magma-cuda91=2.3.0 \
pytorch=0.4.0 \
torchvision=0.2.1 \
&& conda clean -ya
RUN conda install opencv

# Install other dependencies from pip 
#My requirments.txt file jsut contains the following packages I used for the code. Change this for your need.
#numpy==1.14.3
#torch==0.4.0
#torchvision==0.2.1
#matplotlib==2.2.2
#tqdm==4.28.1
COPY requirements.txt .
RUN pip install -r requirements.txt

# Create /data directory so that a container can be run without volumes mounted
RUN sudo mkdir /data && sudo chown user:user /data

# Copy source code into the image
COPY --chown=user:user . /app

# Set the default command to python3
CMD ["python3"]

Here’s the bash command used for running the docker using the Nvidia run-time.

# 1. Build image
docker build .

# 2. Run the docker image
docker run \
--runtime=nvidia -it -d \
--rm <dockerImage> python3 <yourCode.py>

 

Just try it and see how your deep learning life becomes easy! Happy coding! 🙂

Deploy Machine Learning Models in a Production environment as APIs (Python Flask + Visual Studio)

Intelligent application building basically consist of integrating machine learning based predictive components for the apps and systems. Mostly data scientists or the AI engineers are accountable of building these machine learning models.

When it comes to integration and deployment in production environment, the problem occurs with platform dependency. Most of the data scientists and AI engineers are pretty comfortable with python or R and they develop their models with them, though the rest of the system would be on .NET or Java based application.

One of the best approaches to connect these components together is deploying the ML predictive module as a web API and calling the API through the application. When it comes to APIs any programmer can work with it when they have the API definition.

Flask is a small and powerful web framework for Python. It’s easy to learn and simple to use, enabling you to build your web app in a short amount of time. Visual Studio provides an easy way to create Python flask web applications through it’s templates. Here’s the steps I’ve gone through for deploying the ML experiment as a REST API.

01. Create the machine learning model, train, tune and evaluate it.

Here what I’ve done is a simple linear regression for predicting the monthly salary according to the years of experience. Sci-kit learn python library has been used for performing the regression operation. The dataset used for the experiment is from SuperDataScience. 

The code is available in the GitHub repository .

02. Creating the pickle

When you deploy the predictive model in production environment, no need of training the model with code again and again. Python has a built-in method of persisting data called pickle. The pickle module can serialize objects or data into a file that we can save and load from. You can just use the pickle as a binary reference generating the output.  scikit-learn has their own model persistence method we will use: joblib. This is more efficient to use with scikit-learn models due to it being better at handling larger numpy arrays that may be stored in the models.

03. Create a Python Flask web application.

Simply go for Visual Studio. (I’m using VS2017 which comes with python by default) Select web project. The step by step guide is here.  I would recommend you to go with option 2 mentioned in the blog because it reduces lot of unnecessary overhead.

f_2For the safe side, use python virtual environments. It would avoid many hassles occurs with library dependencies. I’ve used anaconda environment as the base of virtual environment.

f_3

04. Create the API.

Create a new python file in your project and set it as the startup file. (In my case MLService.py is the startup file which contains the API code). The pickle file that contains the model binaries is the only dependency the API is getting when it is deployed.

f_7Here the API operates through POST methods which accepts the input in JSON.

04. Run & Test

You can run the API and test by sending POST requests to the URL with a JSON body. Here I’ve used postman to send a POST request and it gives me the predicted salary for the entered number of months.

f_5

You can access the whole code of the project through my GitHub repo here.

f_6

    Do comment if you have any suggestion to change the API structure.

Handling Big snakes on Visual Studio

In the last post we discuss on setting up a Windows rig for deep learning. If you still haven’t setup your machine, go do it first: D

After getting the so called big snakes; python and anaconda in the machine, we should have a proper IDE for coding.

There are many good IDEs you can use in Windows environment to code in python. Pycharm, Spyder are some popular tools.

If you familiar with Visual Studio, the so-called father of all IDEs, python works smoothly with VS. There are few configurations need to be done.

c1No need to purchase Visual Studio enterprise or ultimate. The freely available Visual Studio Community edition works fine. In 2017 version python comes along side with the default installation options. For the later versions you need to install Python Tools for Visual Studio (PTVS) separately.

https://docs.microsoft.com/en-us/visualstudio/python/python-in-visual-studio

Refer this guide for more details.

The python environments configured to machines can be seen from ‘Python Environments’ pane of Visual Studio. (If it is not there go for Tools -> Python -> Python Environments)

c2

By default, your Anaconda environment and default python environment should be there. First Refresh those environments to support intelliSense and grab the installed libraries for the DB.

For our deep learning experimentations, we configured a separate python environment before. To add that environment for visual studio follow the following steps.

01. Click Custom on ‘Python environments’

02. Go for anaconda environments and activate your pre-configured environment for deep learning (Mine is tensorflow-gpu)

c4

03. Copy the interpreter path of the environment

04. Insert it for the interpreter path and click “auto detect’. Visual Studio will detect the rest

c3

05. Click Apply

It may take few minutes to refresh the packages as well as the intelliSense. Make the configured environment your default and open the interactive. You are good to go 😊