Streamline Your Machine Learning Workflow with Docker: A Beginner’s Guide

Posted on February 27, 2023 by Haritha Thilakarathne

Docker has revolutionized the way applications are developed and deployed by making it easier to build, package, and distribute applications as self-contained units that can run consistently across different environments.

I’ve been using docker as a primary tool with my machine learning experiments for a while. If you interested in reading how sweet the docker + machine learning combo is; you can hop to my previous blog post here.

Docker is important in machine learning for several reasons:

Reproducibility: In machine learning, reproducibility is important to ensure that the results obtained in one environment can be replicated in another environment. Docker makes it easy to create a consistent, reproducible environment by packaging all the necessary software and dependencies in a container. This ensures that the code and dependencies used in development, testing, and deployment are the same, which reduces the chances of errors and improves the reliability of the model.

Portability: Docker containers are lightweight and can be easily moved between different environments, such as between a developer’s laptop and a production server. This makes it easier to deploy machine learning models in production, without worrying about differences in the underlying infrastructure.

Scalability: Docker containers can be easily scaled up or down to handle changes in the demand for machine learning services. This allows machine learning applications to handle large amounts of data and processing requirements without requiring a lot of infrastructure.

Collaboration: Docker makes it easy to share machine learning models and code with other researchers and developers. Docker images can be easily shared and used by others, which promotes collaboration and reduces the amount of time needed to set up a new development environment.

Overall, Docker simplifies the process of building, deploying, and managing machine learning applications, making it an important tool for data scientists and machine learning engineers.

Alright… Docker is important. How can we get started?

I’ll share the procedure I normally follow with my ML experiments. Then demonstrate the way we can containerize a simple ML experiment using docker. You can use that as a template for your ML workloads with some tweaks accordingly.

I use python as my main programming language and for deep learning experiments, I use PyTorch deep learning framework. So that’s a lot of work with CUDA which I need to work a lot with configuring the correct environment for model development and training.

Since most of us use anaconda as the dev framework, you may have the experience with managing different virtual environments for different experiments with different package versions. Yes. That’s one option, but it is not that easy when we are dealing with GPUs and different hardware configurations.

In order to make my life easy with ML experiments, I always use docker and containerize my experiments. It’s clean and leave no unwanted platform conflicts on my development rig and even on my training cluster.

Yes! I use Ubuntu as my OS. I’m pretty sure you can do this on your Mac and also in the Windows PC (with bit of workarounds I guess)

All you need to get started is installing docker runtime in your workstation. Then start containerizing!

Here’s the steps I do follow:

I always try to use the latest (but stable) package versions in my experiments.
After making sure I know all the packages I’m going to use within the experiment, I start listing down those in the environment.yml file. (I use mamba as the package manager – which is similar to conda but bit faster than that)
Keep all the package listing on the environment.yml file (this makes it lot easier to manage)
Keep my data sources on the local (including those in the docker image itself makes it bulky and hard to manage)
Configure my experiments to write its logs/ results to a local directory (In the shared template, that’s the results directory)
Mount the data and results to the docker image. (it allows me to access the results and data even after killing the image)
Use a bash script to build and run the docker container with the required arguments. (In my case I like to keep it as a .sh file in the experiment directory itself)

In the example I’ve shared here, a simple MNIST classification experiment has been containerized and run on a GPU based environment.

Github repo : https://github.com/haritha91/mnist-docker-example

I used a ubuntu 20.04 base image from nvidia with CUDA 11.1 runtime. The package manager used here is mamba with python 3.8.

FROM nvidia/cuda:11.1.1-base-ubuntu20.04

# Remove any third-party apt sources to avoid issues with expiring keys.
RUN rm -f /etc/apt/sources.list.d/*.list

# Setup timezone (for tzdata dependency install)
ENV TZ=Australia/Melbourne
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime

# Install some basic utilities and dependencies.
RUN apt-get update && apt-get install -y \
    curl \
    ca-certificates \
    sudo \
    git \
    bzip2 \
    libx11-6 \
    libgl1 libsm6 libxext6 libglib2.0-0 \
 && rm -rf /var/lib/apt/lists/*

# Create a working directory.
RUN mkdir /app
WORKDIR /app

# Create a non-root user and switch to it.
RUN adduser --disabled-password --gecos '' --shell /bin/bash user \
 && chown -R user:user /app
RUN echo "user ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/90-user
USER user

# All users can use /home/user as their home directory.
ENV HOME=/home/user
RUN chmod 777 /home/user

# Create data directory
RUN sudo mkdir /app/data && sudo chown user:user /app/data

# Create results directory
RUN sudo mkdir /app/results && sudo chown user:user /app/results


# Install Mambaforge and Python 3.8.
ENV CONDA_AUTO_UPDATE_CONDA=false
ENV PATH=/home/user/mambaforge/bin:$PATH
RUN curl -sLo ~/mambaforge.sh https://github.com/conda-forge/miniforge/releases/download/4.9.2-7/Mambaforge-4.9.2-7-Linux-x86_64.sh \
 && chmod +x ~/mambaforge.sh \
 && ~/mambaforge.sh -b -p ~/mambaforge \
 && rm ~/mambaforge.sh \
 && mamba clean -ya

# Install project requirements.
COPY --chown=user:user environment.yml /app/environment.yml
RUN mamba env update -n base -f environment.yml \
 && mamba clean -ya

# Copy source code into the image
COPY --chown=user:user . /app

# Set the default command to python3.
CMD ["python3"]

In the beginning you may see this as an extra burden. Trust me, when your experiments get complex and you start working with different ML projects parallelly, docker is the lifesaver you have and it’s going to save you a lot of unnecessary time you waste on ground level configurations.

Happy coding!

Data Selection for Machine Learning Models

Posted on July 30, 2022 by Haritha Thilakarathne

Data is the key component of machine learning. Thus, high quality training dataset is always the main success factor of a machine learning training process. A good enough dataset leads to more accurate model training, faster convergence as well as it’s the main deciding factor on model’s fairness and unbiases too.
Let’s discuss the dos and don’ts when selecting/ preparing a dataset for training a machine learning model and the factors we should consider when composing the training data. These are valid for structured numerical data as well as for unstructured data types such as images and videos.

What does the dataset’s distribution look like?

This is important mostly with numerical datasets. Calculating and plotting the frequency distribution (how often each value occurs in the dataset) of a dataset leads us to the insights on the problem formation as well as on class distribution. ML engineers tend to have datasets with normal distribution to make sure they are having sufficient data points to train the models.

Though normal distribution is more common in nature and psychology, there’s no need of having a normal distribution on every dataset you use for model training. (Obviously some real-world data collections don’t fit for the noble bell curve).

Does the dataset represent the real world?

We are training machine learning models to solve real world problems. So, the data should be real too. It’s ok to use synthetic data if you are having no other option to collect more data or need to balance the classes, but always make sure to use the real -world data since it makes the model more robust on testing/ production. Please don’t put some random numbers into a machine learning model and expect that to solve your business problem with 90% accuracy 😉

Does the dataset match the context?

Always we must make sure the characteristics of the dataset used for training the model matches with the conditions we have when the model goes live in production. For example, will say we need to train a computer vision model for a mobile application which identify certain types of tree leaves from images captured from the mobile camera. There’s no use of training the model with images only captured in a lab environment. You should have images which are captured in wild (which is closely similar for the real-world use case of the application) in your training set.

Is the data redundant?

Data redundancy or data duplication is important point to pay attention when training ML models. If the dataset contains the same set of data points repeatedly, model overfits for that data points and will not perform well in testing. (mostly underfitting)

Is the dataset biased?

A bias dataset never produces an unbiased trained model. Always the dataset we choose should be balanced and not bias to certain cases.
Let’s get an example of having a supervised computer vision model which identifies the gender of people based on their face. Will assume the model is trained only with images from people from USA and we going to use it in an application which is used world-wide. The model will produce unrealistic predictions since the training context is bias to a certain ethnicity. To get a better outcome, the training set should have images from people from different ethnicities as well as from different age groups.

Which data is there too little/too much of?

“How much data we need to train a model with good accuracy?” – This is a question which comes out quite often when we planning ML projects. The simple answer is – “we don’t know!” 😀
There are no exact numbers on how much of data needed for training a ML model. We know that deep learning models are data hungry. Yes, we need to have large datasets for training deep neural networks since we are using those for representing non-linear complex relationships. Even with traditional machine learning algorithms, we should make sure to have enough data from all the classes even from the edge/ corner cases.
What will happen if we have too much data? – that doesn’t help at all. It only makes the training process lengthy and costly without getting the model into a good accuracy. This may end up producing an overfitted trained model too.

These are only very few points to consider when selecting a dataset for training a machine learning model. Please add your thoughts on dataset selection in comments.

Different Approaches to Perform Machine Learning Experiments on Azure

Posted on May 30, 2022 by Haritha Thilakarathne

We have discussed a lot about Azure Machine Learning Studio; the one-stop portal for all ML related workloads on Azure cloud. AzureML Studio provides different approaches to work on the machine learning experiments based on needs, resources and constraints you have. Selecting the plan to attack is completely your choice.

We all have our own way of performing machine learning experiments. While some prefer working on Jupyter notebooks, some are more into less code environments. Being able to onboard data scientists with their familiar development environment without a big learning overhead is one of the main advantages of AzureML.

In this article, let’s have a discussion on different methods available in AzureML studio and their usage in practical scenarios. We may discuss pros and cons of each approach as well.

Please keep in mind that, these are my personal thoughts based on the experiences I had with ML experiments and this may change in different scenarios.

Automated ML

As the name implies, this is all automated. Automated ML is the easiest way of producing a predictive model just in few minutes. You don’t need to have any coding experience to use Automated ML. Just need to have an idea on machine learning basics and an understanding on the problem you going to solve with ML.

The process is pretty straight forward. You can start with selecting the dataset you want to use for ML model training and specify the ML task you want to perform. (Right now, it supports classification, regression, time series forecasting. Computer vision and NLP tasks are in preview). Then you can specify the algorithms you want to test it with and other optional parameters. Azure does all the hard work for you and provides a deployment ready model which can be exposed as a REST API.

Pros:

Zero code process.
Easy to use and well suited for fast prototyping.
Eliminate the environment setup step in ML model development
Limited knowledge on machine learning is needed to get a production viable result.

Cons:

Limited machine learning capabilities.
Right now, only works with supervised learning problem scenarios.
Works well with relational data, computer vision and NLP are still in preview.
There’s no way of using custom machine learning algorithms in the process.

Azure ML Designer

Azure ML Designer is an upgraded version of the pretty old Azure ML Studio drag and drop version. Azure ML Designer is having a similar drag and drop interface for building machine leaning experiment pipelines. You have a set of prebuilt components which you can connect together in a flowchart like manner to build machine learning experiments. You have the ability to use SQL queries or python/ R scripts if you want in the process too. After training a viable ML model, you can deploy it as a web service with just few clicks.

I personally prefer this for prototyping. Plus, I see a lot f potential on Azure ML designer in educational purposes. It’s really easy to visualize the ML process through the designer pipelines and it increases the interpretability of the operation.

Pros:

Zero code/ Less code environment
Easy to use graphical interface
No need to worry on development/ training environment configurations
Having the ability to expand the capabilities with python/ R scripts
Easy model deployment

Cons:

Less flexibility for complex ML model development.
Less support for deep learning workloads.
Code versioning should handle separately.

Azure ML notebooks

This maybe the most favourite feature on Azure ML for data scientists. I know Jupyter notebooks are like bread and butter for data scientists. Azure ML offers a fully managed notebook experience for them without giving them the hassle of setting up the dev environment on local computers. You just have to connect the notebook with a compute instance and it allows you to do your model development and training on cloud in the same way you did on a local compute or elsewhere on notebooks.

I would recommend this as the to-go option for most of the machine learning experiments since it’s really easy to spin up a notebook instance and get the job done. Most of the ML related libraries are pre-installed on the compute instance and you even have the flexibility to install 3^rd party packaged you need through conda or pip.

Pros:

Familiar notebook experience on cloud.
Option to use different python kernels.
No need to worry about dev environment setup on local compute.
Can use the powerful compute resources on Azure for model training.
Flexibility to install required libraries through package managers.

Cons:

Comes with a price for computation.
No direct support for spark workloads.
Code version control should manage separately.

Developing on local and connect to AzureML service through AML Python SDK.

This is the option I would suggest for more advanced users. Think of a scenario where you have a deep learning based computer vision experiment to run on Azure with a complex code base. If this is the case, I would definitely use AzureML python SDK and connect my prevailing code base with the AzureML service.

In this approach, your code base sits on your local computer and you are using Azure for model training, deployment and monitoring purposes. You have the total flexibility of using the power of cloud for computations as well as the flexibility of using local machine for development.

Pros:

Total flexibility in performing machine learning experiments with our comfortable dev tools.
AzureML python SDK is an open-source library.
Code version controlling can be handled easily.
Whole ML process can be managed using scripts. (Easy for automation)

Cons:

Setting up the local development environment may take some effort.
Some features are still in experimental stage.

Choosing the most convenient approach for your ML experiment is totally based on the need and resources you have. Before getting into the big picture, start small. Start with a prototype, then a workable MVP, gradually you can move forward with expanding it with complex machine learning approaches.

What’s your most preferred way of model development from these options? Please mention in the comments.

Cheers!

MLOps : Let’s start plumbing ML experiments!

Posted on April 22, 2022 by Haritha Thilakarathne

What’s all this hype on MLOps? What’s the difference between machine learning and MLOps? Is MLOps essential? Why we need MLOps? Through this article series we going to start a discussion on MLOps to get a good start with the upcoming trend. The first post is not going to go deep with technicalities, but going to cover up essential concepts behind MLOps.

What is MLOps?

As the name implies, it is obviously having some connection with DevOps. So, will see what DevOps is first.

“A compound of development (Dev) and operations (Ops), DevOps is the union of people, process, and technology to continually provide value to customers.”
Microsoft Docs

This is the formal definition of DevOps. In the simpler terms, DevOps is the approach of streamlining application development life cycle of software development process. It ensures the quality engineering and security of the product while making sure the team collaboration and coordination is managed effectively.

Imagine you are a junior level developer in a software development company who develops a mission critical system for a surveillance application. DevOps process make sure each and every code line you write is tracked, managed and integrated to the final product reliably. It doesn’t stop just by managing the code base. It involves managing all development life cycle steps including the final deployment and monitoring of the final product iteratively too.

That’s DevOps. Machine Learning Operations (MLOps) is influenced by DevOps principles and practices to increase the efficiency of machine learning workflows. Simply, it’s the way of managing ML workflows in a streamlines way to ensure quality, reliability, and interpretability of machine learning experiments.

Is MLOps essential?

We have been playing around with machine learning experiments with different tools, frameworks and techniques for a while. To be honest, most of our experiments didn’t end up in production environments :D. But, that’s the ultimate goal of predictive modeling.

Machine Learning experiment is an iterative process
Source : https://azure.microsoft.com/en-au/resources/gigaom-delivering-on-the-vision-of-mlops/

Building a machine learning model and deploying it is not a single step process. It starts with data collection and goes in an iterative life cycle till monitoring the deployed model in the production environment. MLOps approaches and concepts streamline these steps and interconnect them together.

Answer is Yes! We definitely need MLOps!

Why we need MLOps?

As I said earlier, MLOps interconnect the steps in ML life cycle and streamline the process.

I grabbed these points from Microsoft docs. As it implies, these are the goals of MLOps.

Faster experimentation and development of models

Good MLOps practices leads for more code and component reusability which leads for faster experiments and model development. For an example, without having separate training loops or data loading components for each experiment, we can reuse an abstract set of methods for those tasks and connect them with a machine learning pipeline for running different experiment configurations. That’s make the life easy of the developer a lot!

I do lot of experiments with computer vision. In my case, I usually have a set of abstract python methods that can be used for model training and model evaluation. When performing different experiments, I pass the required parameters to the pipeline and reuse the methods which makes the life easy with less coding hassle.

Faster deployment of models into production

Machine learning model deployment is always a tricky part. Managing the endpoints and making sure the deployment environment is having all the required platform dependencies maybe hard to keep track with manual processes. A streamlines MLOps pipeline helps to manage deployments by enabling us to choose which trained model should go for production etc. by keeping track of a model registry and deployment slots.

Quality assurance and end-to-end lineage tracking

Maintaining good coding practices, version controlling, dataset versioning etc. ensures the quality of your experiments. Good MLOps practices helps you to find out the points where errors are occurring easily rather than breaking down the whole process. Will say your trained model is not performing well with the testing data after sometime from model deployment. That might be caused by data drift happened with time. Correctly configured MLOps pipeline can track such changes in the inference data periodically and make sure to notify such incidents.

Trustworthiness and ethical AI

This is one of the most important use cases of MLOps. It’s crucial to have transparency in machine learning experiments. The developer/ data scientist should be able to interpret each and every decision they took while performing the experiment. Since handling data is the key component of ML model, there should be ways to maintain correct security measures in experiments too. MLOps pipelines ensure these ethical AI principles are met by streamlining the process with a defined set of procedures.

How we gonna do this?

Now we all know MLOps is crucial. Not just having set of python scripts sitting in a notebook it’s all about interconnecting all the steps in a machine learning experiments together in an iterative process pipeline. There are many methods and approaches to go forward with. Some sits on-prem while most of the solutions are having hybrid approach with the cloud. I usually use lot of Azure services in my experiments and Azure machine learning Studio provides a one-stop workbench to handle all these MLOps workloads which comes pretty handy. Let’s start with a real-world scenario and see how we can use Azure Machine Learning Studio in MLOps process to streamline our machine learning experiments.

FAQs on Machine Learning Development – #AskNaadi Part 1

Posted on February 14, 2022 by Haritha Thilakarathne

Happy 2022!

It’s almost 7 years since I started playing with machine learning and related domains. These are some FAQs that comes for me from peers. Just added my thoughts on those. Feel free to any questions or concerns you have on the domain. I’ll try my best to add my thoughts on that. Note that all these answers are my personal opinions and experiences.

01. How to learn the theories behind machine learning?

The first thing I’d suggest would be ‘self-learning’. There are plenty of online resources out there where you can start studying by your own. Most of them are free. Some may need a payment for the certification (That’s totally up to you to pay and get it). I’ve listed down some of the famous places to get a kickstart for learning AI. Just take a look here.

Next would be keep practising. Never stop coding and training models in various domains. Kaggle is a good place to sharpen your skill set. Keep learning and keep practising at the same time.

02. Do we really need mathematics for ML?

Yes. To some extend you should know the theories behind probability and and some from basic mathematics. No need to worry a lot on that. As I said previously, there are plenty of places to catch up your maths too.

03. Is there a difference between data analysis and machine learning?

Yes. There is. Data analysis is about find pattern in the prevailing data and obtain inferences due to those patterns. It may have the data visualization components too. When is comes to machine learning, you train a system to learn those patterns and try to predict the upcoming pattern.

04. Does the trend in AI/ML going to fade out in the near future?

Mmm.. I don’t think so. Can’t exactly say AI is going to be ‘the’ future. Since all these technical advancements going to generate hell a lot of data, there should be a way to understand the patterns of those data and get a value out of that. So, data science and machine learning is going to be the approach to go for.

Right… those are some general questions I frequently get from people. Let’s move into some technicalities.

05. What’s the OS you use on your work rig?

Ubuntu! Yes it’s FOSS and super easy to setup all the dependencies which I need on it. (I did a complete walk through on my setup previously. Here’s it). Sometimes I use Windows too. But if it’s with docker and all, yes.. Ubuntu is the choice I’m going with.

06. What’s your preferred programming language to perform machine learning experiments?

I’m a Python guy! (Have used R a little)

07. Any frameworks/ libraries you use most in your experiments?

Since am more into deep learning and computer vision, I use PyTorch deep learning framework a lot. NumPy, Sci-kit learn, Pandas and all other ML related Python toolkits are in my toolbox always.

08. Machine learning is all about neural networks right?

No it’s not! This is one of the biggest myths! Artificial neural networks (ANNs) are only one family of algorithms which we can perform machine learning. There are plenty of other algorithms which are widely used in performing ML. Decision trees, Support Vector Machines, Naive Bayes are some popular ML algorithms which are not ANNs.

09. Why we need GPUs for training?

You need GPUs when you need to do parallel processing. The normal CPUs we have on our machines are typically having 4-5 cores and limited number of threads can be handled simultaneously. When it comes to a GPU, it’s having thousands of small cores which can handle thousands of computational threads in parallel. (For an example Nvidia 2080Ti is having 4352 CUDA cores in it). In Deep learning, we have to perform millions or calculations to train models. Running these workloads in GPUs is much faster and efficient.

10. When to use/ not to use Deep learning?

This is a tricky questions. Deep learning is always good in understanding the non-linear data. That’s why it’s performing really well in computer vision and natural language processing domains. If you have a such task, or your feature space is really large and having a massive amount of data, I’d suggest you to go with deep learning. If not sticking with traditional machine learning algorithms might be the best case.

11. Do I need to know all complex theories behind AI to develop intelligent applications?

Yes and No. In some cases, you may have to understand the theories behind AI/ML in order to develop a machine learning based applications. Mostly I would say model training and validation phases need this knowledge. Will say you are a software developer who’s very good with .Net/ Java and you are developing an application which is having a component where you have to read some text from a scanned document. You have to do it using computer vision. Fortunately you don’t have to build the component from the scratch. There are plenty of services which can be used as REST endpoints to complete the task. No need to worry on the underlying algorithms at all. Just use the JSON!

12. Should I build all my models from scratch?

This is a Yes/No answer too. This question comes mostly with deep learning model development. In some complex scenarios you may have to develop your models from the scratch. But most of the cases the problem you having can be defined as a object detection/ image classification/ Key phrase extraction from text etc. kinda problem. The best approach to go forward would be something like this.

Use a simple ANN and see your data loading and the related things are working fine.

Use a pre-trained model and see the performance (A widely used SOTA model would be the best choice).

If it’s not working out, do transfer learning and see the accuracy of the trained model. (You should get good results most of the times by this step)

Do some tweaks to the network and see if it’s working.

If none of these are working, then think of building a novel model.

13. Is cloud based machine learning is a good option?

In most of the industrial use cases yes! Since most of the data in prevailing systems are already sitting in the cloud and industries are heavily relying on cloud services these days, cloud based ML is a good approach. Obviously it comes with a price. When it comes to research phases, the price of purchiasing computation power maybe a problem. In those cases, my approach would be doing the research phase on-prem and moving the deployment to the cloud.

14. I’ve huge computer vision datasets to be trained? Shall I move all my stuff to the cloud?

Ehh… As I said previously, if you planning on a research project, which goes for a long time and need a lot of computational hours, I’d suggest to go with a local setup first, finalize the model and then move to the cloud. (If dollars aren’t your problem, no worries at all! Go for the cloud! Obviously it’s easy and more reliable)

15. Which cloud provider to choose?

There’s a lot of cloud providers out there having various services related to ML. Some provides out of the box services where you can just call and API to do the ML tasks (Microsoft Cognitive services etc. ). There are services where you can use your own data to train prevailing models (Custom Vision service by Azure etc.)

If you want end-to-end ML life cycle management, personally I find Azure ML service is a good solution since you can use any of your ML related frameworks and tools and just use cloud to train, manage and deploy the models. I find MLOps features that comes with Azure Machine Learning is pretty useful.

16. I’ve trained and deployed a pretty good machine learning model. I don’t need to touch that again right?

No way! You have to continuously check their performance and the accuracy they are providing for the newest data that comes to the service. The data that comes into the service may skewed. It’s always a good idea to train the model with more data. So better to have a re-training MLOps pipelines to iteratively check your models.

17. My DL models takes a lot of time to train. If I have more computation power the things will speed up?

mm.. Not all the time. I have seen cases where data loading is getting more time than model training. Make sure you are using the correct coding approaches and sufficient memory and process management. Make sure you are not using old libraries which may be the cause for slow processing times. If your code is clean and clear then try adjusting the computation power.

This is just few questions I noted down. If you have any other questions or concerns in the domain of machine learning/ deep learning and data science, just drop a comment below. Will try to add my thoughts there.

How to Streamline Machine Learning/ Data Science Projects?

Posted on December 20, 2020 by Haritha Thilakarathne

When it comes to designing, developing and implementing a project related to data mining/ machine learning or deep learning, it is always better to follow a framework for streamlining the project flow.

It is OK to adapt a software development framework such as scrum, or waterfall method to manage a ML related project but I feel like having more streamlined process which pays attention on data would be an advantage for the success of a such project.

To my understanding there can be two variations of ML related projects.

Solely machine learning/ data science based projects
Software development projects where ML related services are a sub component of the main project.

The step-by step process am explaining can be used in both of these variations with your own additions and modifications.

Basically this is what I do when I get a ML related project to my hand.

I follow the steps of a good old standard process known as Cross-industry standard process for data mining (CRISP-DM) to streamline the project flow. Let’s go step by by step.

Step 1 : Business understanding

First you have to identify what is the problem you going to address with the project. Then you have to be open minded and answer the following questions.

What is the current situation of this project? (whether it is using some conventional algorithm to solve this problem etc. )
Do we really need to use machine learning to solve this problem? ( Using ML or deep learning for solving some problems maybe over engineering. Take a look whether it is essential to use ML to do the project.)
What is the benefit of implementing the project? (ML projects are quite expensive and resource hungry. Make sure you get the sufficient RoI with the implementation.)
What are constraints, limitations and risks? ( It’s always better to do a risk assessment prior the project. The data you have to use may have compliance issues. Look on those aspects for sure!)
What tools and techniques am going to use? ( It maybe bit hard to determine the full tech stack you going to use before dipping your feet into the project. But good have even a rough idea on the tools, platforms and services you going to use to development and implementation. DON’T forget implementation phase. You may end up having a pretty cool development which maybe hard to implement with the desired application. So make sure you know your tool-set first)

Tip : If you feel like you are not having experience with this phase, never hesitate to discuss about it with the peers and experts in the field. They may come-up with easy shortcuts and techniques to make your project a success.

Step 2 : Data understanding

Data is the most vital part of any data science/ ML related project. When it comes to understanding the data, I prefer answering these questions.

How big/small the data is? (Sometimes training deep learning models may need a lot of annotated data which is hard to find)
How credible/ accurate the data is?
What is the distribution of data?
What are the key attributes and what are not-so-important attributes in data?
How the data has been stored? (Data comes in CSVs/JSONs or flat files etc.)
Simple statistical analysis of data?

Before digging into the main problem, you can save a lot of time by taking a closer look on data that you have or that you going to get.

Step 3 : Data preparation

To be honest, this step takes 80% of total project time most of the times. Data that we find in real world are not clean or in the perfect shape. Perfectly cleaned and per-processed data will save a lot of time in later stages. Make sure you follow the correct methodologies for data cleansing. This step may include tasks such as writing dataloaders for your data. Make sure to document the data preparation steps you did to the original dataset. Otherwise you may get confused in later stages.

Step 4 : Modelling

This is the step where you actually get the use of machine learning algorithms and related approaches. What I normally do is accessing the data and try some simple modelling techniques to interpret the data I have. For an example, will say I have a set of images to be classified using a artificial neural network based classifier… I’d first use a simple neural network with one or two hidden layers and see if the problem formation and modelling strategy is making any sense. If that’s successful, I’ll move for more complex approaches.

Tip : NEVER forget documentation! Your project may grow exponentially with thousands of code lines and you may try hundreds of modelling techniques to get the best accuracy. So that keep clear documentation on what you did to make sure you can roll back and see what you have done before.

Step 5 : Evaluation

Evaluating the models we developed is essential to determine whether we have done the right thing. Same as software review processes I prefer having a set framework to evaluate the ML projects. Make sure to select appropriate evaluation matrix. Some may not indicate the real behaviour of the models you build.

When performing a ML model evaluation, I plan ahead and make a set structure for the evaluation report. It makes the process easy to compare it against different parameter changes of the single model.

In most of the cases, we neglect the execution or the inference time when evaluating ML models. These can be vital factors in some applications. So that plan your evaluation wisely.

Step 6 : Deployment & Maintenance

Deployment is everything! If the deployment fails in the production, there’s no value in all the model development workload you did.

You should select the technologies and approaches to deliver the ML services (as REST web services, Kubernetes, container instances etc. ). I personally prefer containerising since it’s neat and clean. The deployed models should be monitored regularly. Predictions can get deviated with time. Sometimes data distribution can be changed. Make sure you create a robust monitoring plan beforehand.

Tip : What about the health of the published web endpoints or the capacity of inference clusters you using?? Yp! Make sure you monitor the infrastructure too.

https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview

This is just a high-level guideline that you can follow for streamlining data science/machine learning related tasks. This is a iterative process. There’s no hard bound rules saying you MUST follow these steps. Microsoft has introduced team data science process (TDSP) adapting and improving this concept with their own tool-sets.

Key takeaway : Please don’t follow cowboy coding for machine learning/ data science projects! Having a streamline process is always better! 🙂

The Story of Deep Pan Pizza :AI Explained for Dummies

Posted on September 14, 2018 by Haritha Thilakarathne

Artificial Intelligence, Machine Learning, Neural Networks, Deep Learning….

Most probably, the words on the top are the widely used and widely discussed buzz words today. Even the big companies use them to make their products appear more futuristic and “market candy” (Like a ‘tech giant’ recently introduced something called a ‘neural engine’)!

Though AI and related buzz words are so much popular, still there are some misconceptions with people on their definitions. One thing that clearly you should know is; AI, machine learning & deep learning is having a huge deviation from the field called “Big Data”. It’s true that some ML & DL experiments are using big data for training… but keep in mind that handling big data and doing operations with big data is a separate discipline.

So, what is Artificial Intelligence?

“Artificial intelligence, sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.” – Wikipedia

Simple as that. If a system has been developed to perform the tasks that need human intelligence such as visual perception, speech recognition, decision making… these systems can be defined as a intelligent system or an so called AI!

The most famous “Turing Test” developed by Alan Turing (Yes. The Enigma guy in the Imitation Game movie!) proposed a way to evaluate the intelligent behavior of an AI system.

Turing Test

There are two closed rooms… let’s say A & B. in the room A… we have a human while in the room B we have a system. The interrogator; person C is given the task to identify in which room the human is. C is limited to use written questions to make the determination. If C fails to do it- the computer in room A can be defined as an AI! Though this test is not so valid for the intelligent systems we have today, it gives a basic idea on what AI is.

Then Machine Learning?

Machine learning is a sub component of AI, that consists of methods and algorithms allows the computer systems to statistically learn the patterns of data. Isn’t that statistics? No. Machine learning doesn’t rely on rule based programming (It means that a If-Else ladder is not ML 😀 ) where statistical modeling is mostly about formulation of relationships between data in the form of mathematical equations.

There are many machine learning algorithms out there. SVMs, decision trees, unsupervised methods like K-mean clustering and so-called neural networks.

That’s ma boy! Artificial Neural Networks?

Inspired by the neural networks we all have inside our body; artificial neural network systems “learn” to perform tasks by considering many examples. Simply, we show a thousand images of cute cats to a ANN and next time.. when the ANN sees a cat he is gonna yell.. “Hey it seems like a cat!”.

If you wanna know all the math and magic behind that… just Google! Tons of resources there.

Alright… then Deep Learning?

Yes! That’s deep! Imagine the typical vanilla neural networks as thin crust pizza… It’s having the input layer (the crust), one or two hidden layers (the thinly soft part in the middle) and the output layer (the topping). When it comes to Deep Learning or the deep neural networks, that’s DEEP PAN PIZZA!

DNNs are just like Deep Pan Pizzas

Deep Neural Networks consist of many hidden layers between the input layer and the output layer. Not only typical propagation operations, but also some add-ins (like pineapple) in the middle. Pooling layers, activation functions…. MANY!

So, the CNNs… RNNs…

You can have many flavors in Deep Pan Pizzas! Some are good for spicy lovers… some are good for meat lovers. Same with Deep Neural Networks. Many good researchers have found interesting ways of connecting the hidden layers (or baking the yummy middle) of DNNs. Some of them are very good in image interpretation while others are good in predicting values that involves time or the state. Convolutional Neural Networks, Recurrent Neural Networks are most famous flavors of this deep pan pizzas!

These deep pan pizzas have proven that they are able to perform some tasks with close-to-human accuracy and even sometimes with a higher accuracy than humans! deep-learning

Don’t panic! Robots would not invade the world soon…

Image Courtesy : DataScienceCentral | Wikipedia

One-Hot Encoding in Practice

Posted on April 9, 2018 by Haritha Thilakarathne

mtimFxh Data is the king in machine learning. In the process of building machine learning models, data is used as the input features.

Input features comes in all shapes and sizes. For building a predictive model with a better accuracy rate, we should understand the data as well as the logic behind the algorithm we going to use to fit the model.

Data Understanding; as the second step of CRISP-DM, guides for understanding the types and the way the data we get has been represented. We can distinguish three main kinds of data feature.

Quantitative Data – Data with numerical scale (Age of a person in years, Price of a house in dollars etc.)
Ordinal features – Data without a scale but with ordering (Ordered sets/ first, second, third etc.)
Categorical features – Data without a numerical scale neither an ordering. These features don’t allow any statistical summary. (Car manufacturer categories, Civil status, N-grams in NLP etc.)

Most of the machine learning algorithms such as linear regression, logistic regression, neural network, support vector machine works better with numerical features.

Quantitative features come with a numerical value and they can be directly used (Sometimes data preprocessing, normalization may have to use) as the input features of ML algorithms.

Ordinal features can be easily represented in numbers (Ex. First = 1, Second = 2, Third = 3 …). This is called Integer Encoding. Representing ordinal features using numbers makes sense because the dependency between each representation can be notated in a numerical way.

There are some algorithms that can directly deal with joint discrete distribution such as Markov chain / Naive Bayes / Bayesian network, tree based, etc. These algorithms can work with categorical data without any encoding; while we should encode the categorical features in a way to represent in a numerically to use as the input features for other ML algorithms. That means it’s better to change the categorical features to numerical most of the times 😊

There are some special cases too. For an example, while naïve bias classification only really handles categorical features, many geometric models go in the other direction by only handling quantitative features.

How to convert Categorical data for Numerical data?

There are few ways to covert the categorical data to numerical data.

Dummy encoding
One-hot encoding / one-of-K scheme

are the most prominent ways of it.

One hot encoding is the process of converting the categorical features into numerical by performing “binarization” of the category and include it as a feature to train the model.

In mathematics, we can define one-hot encoding as…

One hot encoding transforms:

a single variable with n observations and d distinct values,

to

d binary variables with n observations each. Each observation indicating the presence (1) or absence (0) of the dth binary variable.

Let’s get this clear with an example. Suppose you have ‘flower’ feature which can take values ‘daffodil’, ‘lily’, and ‘rose’. One hot encoding converts ‘flower’ feature to three features, ‘is_daffodil’, ‘is_lily’, and ‘is_rose’ which all are binary.

Capture A common application of OHE is in Natural Language Processing (NLP). It can be used to turn words to vectors so easily. Here comes a con of OHE, where the vector size might get very large with respect to the number of distinct values in the feature column.If there’s only two distinct categories in the feature, no need to construct to additional columns. You can just replace the feature column with one Boolean column.

OHE in word vector representation

You can easily perform One-hot encoding in AzureML Studio by using the ‘Convert to Indicator Values’ module. The purpose of this module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model, which is the same happens in OHE. Let’s look at performing One-Hot encoding using python in next article.

Mission Plan for building a Predictive model

Posted on March 20, 2018 by Haritha Thilakarathne

maxresdefault When it comes to a machine learning or data science related problem, the most difficult part would be finding out the best approach to cope up with the task. Simply to get the idea of where to start!

Cross-industry standard process for data mining, commonly known by its acronym CRISP-DM, is a data mining process model describes commonly used approaches that data mining experts use to tackle problems. This process can be easily adopted for developing machine learning based predictive models as well.

CRISP – DM

No matter what are the tools/IDEs/languages you use for the process. You can adopt your tools according to the requirement you’ve.

Let’s walk through each step of the CRISP-DM model to see how it can be adopted for building machine learning models.

Business Understanding –

This is the step you may need the technical knowhow as well as a little bit of knowledge about the problem domain. You should have a clear idea on what you going to build and what would be the functional value of the prediction you suppose to do through the model. You can use Decision Model & Notation (https://en.wikipedia.org/wiki/Decision_Model_and_Notation) to describe the business need of the predictive model. Sometimes, the business need you are having might be able to solve using simple statistics other than going for a machine learning model.

Identifying the data sources is a task you should do in this step. Should check whether the data sources are reliable, legal and ethical to use in your application.

Data Understanding –

I would suggest you to do the following steps to get to know your data better.

Data Definition – A detailed description on each data field in the data source. The notations of the data points, the units that the data points have been measured would be the cases you should consider about.
Data Visualization – Hundreds or thousands of numerical data points may not give a clear idea for you what the data is about or an idea about the shape of your data. You may able to find interesting subsets of your data after visualizing it. It’s really easy to see the clustering patterns or the trending nature of the data in a visualized plot.
Statistical analysis – Starting from the simple statistical calculations such as mean, median; you can calculate the correlation between each data field and it will help you to get a good idea about the data distribution. Feature engineering to increase the accuracy of the machine learning model. For performing that a descriptive statistical analysis would be a great asset.

For data understanding, The Interactive Data Exploration, Analysis and Reporting tool (IDEAR) can be used without getting the hassle of doing all the coding from the beginning. (Will discuss on IDEAR in a long run soon)

Data Preparation –

Data preparation would take roughly 80% of your time of the process implying it’s the most vital part in building predictive models.

This is the phase where you convert the raw data that you got from the data sources for the final datasets that you use for building the ML models. Most of the data you got from raw sources like IoT sensors or collectives are filled with outliers, contains missing values and disruptions. In the phase of data preparation, you should follow data preprocessing tasks to make those data fields usable in modeling.

Modeling –

Modeling is the part where algorithms comes to the scene. You can train and fit your data to a particular predictive model to perform the deserved prediction. You may need to check the math behind the algorithms sometimes to select the best algorithm that won’t overfit or underfit the model.

Different modeling methods may need data in different forms. So, you may need to revert back for the data preparation phase.

Evaluation –

Evaluation is a must before deploying a model. The objective of evaluating the model is to see whether the predictive model is meeting the business objectives that we’ve figured out in the beginning. The evaluation can be done with many parameter measures such as accuracy, AUC etc.

Evaluation may lead you to adjust the parameters of the model and might have to choose another algorithm that performs better. Don’t expect the machine learning model to be 100% accurate. If it is 100% most probably it would be an over fitted case.

Deployment –

Deployment of the machine learning model is the phase where the client, or the end user going to consume. In most of the cases, the predictive model would be a part of an intelligent application that acts as a service that gets a set of information and give a prediction as an output of that.

I would suggest you to deploy the module as a single component, so that it’s easy to scale as well as to maintain. APIs / Docker environments are some cool technologies that you can adopt for deploying machine learning models.

CRISP-DM won’t do all the magic of getting a perfect model as the output though it would definitely help you not to end up in a dead-end.

Artificial Neural Networks with Net# in Azure ML Studio

Posted on November 8, 2017 by Haritha Thilakarathne

The ideas for neural networks go back to the 1940s. The essential concept is that a network of artificial neurons built out of interconnected threshold switches can learn to recognize patterns in the same way that an animal brain and nervous system does.

Though the name “neural network” gives an idea of a ‘black box’ type predictive operation; ANN is a set of mathematical operations.

VqOpE

As the name implies by itself; neural network is a structural ‘network’. The nodes of the neural network are organized in layers and the nodes are connected with each other with edges. The edges are directional and they are weighted.

Azure Machine Learning Studio comes with pre-built neural network modules that can easily use for predictive analytics.

Pre-built neural networks in AML Studio

Multiclass Neural Network Module –

Used for multiclass classification problems. The number of hidden nodes, the learning date, number of learning iterations and many parameters can be changed easily by changing the module properties.

Two-Class Neural Network –

Ideal for binary classification problems. Same as the Multiclass neural network module, the properties of the neural network can be changed by the module properties.

Neural Network regression –

This is a supervised machine learning method that can be used to predict a numerical value.

These simple pre-built modules can be added to your ML experiment with just a drag and drop and change the parameters by changing the module properties. What you going to do if you want to implement a complex neural network architecture? Or to create a deep neural network with more hidden layers?

AzureML Studio comes handy here with providing you the ability to define the hidden layer/layers of the ANN with a script. Net# scripting language provide the ability to define almost any neural network architecture in an easy to read format.

Net# scripting language is able to

Create hidden layers and control the number of nodes in each layer.
Specify how layers are to be connected to each other.
Define special connectivity structures, such as convolutions and weight sharing bundles.
Specify different activation functions.

In Azure Machine Learning, you can add the Net# scripts by choosing ‘Custom definition script’ in Hidden layer specification property. By default, it would set to the fully connected case.

properties

Net# lexical is more similar to C#. The structure of a Net# script has four main sections.

Constant declaration (Optional) – Define values used elsewhere in the neural network definition
Layer declaration – The input, hidden and output layers are defined with the layer dimensions. The layer declaration for hidden or output layer can include the output function.
Connection declaration – You can define connection bundles (Full, Filtered, Convolutional, Pooling, Response normalization) – Full connection bundle is the default configuration.
Share declaration (Optional) – Defining multiple bundles with shared weights.

This is a simple neural network defined by a Net# script to perform a binary classification. You can customize the number of hidden neurons and the activation functions and see how the accuracy of the model variate.

<!– HTML generated using hilite.me –>

//A simple neural network definition
//auto keyword allows the ANN to automatically include all feature columns in the input examples
//input layer named Data
input Data auto;

//Hidden layer named "H" including 200 nodes
hidden H [200] from Data all;

//output layer named "Out" including 2 nodes (binary classification problem) 
//Sigmoid activation function has been used.
output Out [2] sigmoid from H all;

For more insides here’s the resources – https://docs.microsoft.com/en-us/azure/machine-learning/studio/azure-ml-netsharp-reference-guide#overview