What’s Best for Me? – 5 Data Analytics Service Selection Scenarios Explained

With the extensive usage of cloud-based technologies to perform machine learning and data science related experiments, choosing the right toolset/ platform to perform the operations is a key part for the project success.

Since selecting the perfect toolset for our ML workloads maybe bit tricky, I thought of sharing my thoughts on that by getting a couple of generic use cases. Please keep in mind that the use cases I have chosen and the decisions I’m suggesting are totally my own view on the scenarios and this may differ based on different factors (amount of data, time frame, allocated budget, ability of the developer etc.) you have with your project. Plus, the suggestions I’m pointing out here are from the services comes with Microsoft Azure cloud. This maybe the easily adjusted for other cloud providers too.

Scenario 1:

We are a medium scale micro financing company having our data stored on Microsoft Azure. We have a plan to build a datalake and use that for analytical and reporting tasks. We have a diverse data team with abilities in python, Scala and SQL (most of the data engineers are only familiar with SQL). We need to build a couple of machine learning models for predictions. What would be the best platform to go forward with? Azure Databricks or Azure ML Studio?

Suggestion: Azure Databricks

Reasons:

  • Databricks is more flexible in ETL and datalake related data operations comparing to AzureML Studio.
  • You can perform data curation and machine learning within a single product with Azure Databricks.
  • Databricks can connect with Azure Data Factory pipelines to handle data flow and data curation tasks within the datalake.
  • Since the data engineers are more familiar with SQL, they’ll easily adapt with SparkSQL on Databricks.
  • Data team can develop their machine learning experiments with any language of their choice with Databricks notebooks.
  • Databricks notebooks can be used for analytical and reporting tasks even with a combination of PowerBI.
  • Given that, the company is planning for building a datalake, Databricks is far more flexible in ETL tasks. You can use Azure Data Factory pipelines with Databricks to control the data flow of the datalake.

Scenario 2:

I’m a computer science undergrad. I’m doing a software project to predict several types of wildflowers by capturing images from a mobile phone. I’m planning to build my computer vision model using TensorFlow and Keras and expose the service as a REST API.  Since I’m not having the infrastructure to train the ML models, I’m planning to use Azure for that. Which tool on Azure should I choose?

Suggestion: Azure ML Studio

Reasons:

  • AzureML provides a complete toolset to train, test and deploy a deep learning model using any open-source framework of your choice.
  • You can use the GPU training clusters on AzureML to train your models.
  • It’s easy to log your model training and experiments using AzureML python SDK.
  • AzureML gives you the ability for model management and exposing the trained model as a REST API.
  • Small learning curve and adaptability.

Scenario 3:

I’m the CEO of a retail company. I’m not having a vast experience with computing or programming but having a background in maths and statistics. I have a plan to use machine learning to perform predictive analysis with the data currently having in my company. Most of the data are still in excel! Someone suggested me to use Azure. What product on Azure should I choose?

Suggestion: Azure ML Studio

Reasons:

  • For a beginner in machine learning and data science, Azure ML Studio is a good start.
  • AzureML Studio provides no-code environments (Azure ML designer and AutoML) to develop ML models.   
  • Since, you are mostly in the experimental stage and not going for using bigger datasets, using Databricks would be an overkill.
  • You can easily import your prevailing data and start experimenting and playing around with them without any local environmental setup.

Scenario 4:

I’m the IT manager of a large enterprise who are heavily relying on data assets with our decision-making process. We have to run iterative jobs daily to retrieve data from different external sources and internal systems. Currently we have an on-prem SQL database acting as the data warehouse.  Company has decided to go for cloud. Can Azure serve our needs?   

Suggestion: Yes. Azure can serve your need with different tools in the data & AI domain.

Reasons:

  • You can use Azure Synapse Analytics or Azure Data Factory to build data pipelines and perform ETL operations.
  • The local data warehouse can be easily migrated to Azure cloud.
  • You can use Azure Databricks in-order to perform analytics tasks.
  • Since the enterprise in large and scaling, using Databricks would be better with its Spark based computation abilities.

Scenario 5:

We are an agricultural company growing forward with adopting modern Agri-tech into the business. We collect numerous data values from our plantations and store them in our cloud databases. We have a set of data scientists working on data modelling and building predictive models related to crop fertilizing and harvesting. They are currently using their own laptops to perform analysis and it’s troublesome with the data amount, platform configurations and security. Will Azure ML comes handy in our case?

Suggestion: Yes. Azure ML Studio would be a good choice.

Reasons:

  • AzureML can be easily adaptable as an analytical platform.
  • The cloud databases can be connected to AzureML, and data scientists can straight-up start working on the data assets.
  • AzureML is relatively cheap comparing to Databricks (Given the data amount is manageable in a single computer.)
  • It’s easy to perform prototyping of models using AutoML/ AzureML Designer and then implement the models within a short time frame.  

Generally, these are the factors I would keep in mind when selecting the services for ML/ data related implementations on Azure.

  • Azure ML studio is good when you are training with a limited data, though Azure ML provides training clusters, the data distribution among the nodes is to be handled in the code.
  • AzureML Studio comes handy in prototyping with AzureML designer and Automated ML.
  • Azure Databricks with its RDDs is designed to handle data distributed on multiple nodes which is advantageous when your you have big datasets.
  • When your data size is small and can fit in a scaled up single machine/ you are using a pandas dataframe, then use of Azure Databricks is an overkill.
  • Services like Azure Data Factory and Datalake storage can be easily interconnected for building  

Let me know your thoughts on these scenarios as well. Add your queries in the comments too. I’ll try my best to provide my suggestions for those use cases.

Medallion Data Lakehouse Architecture

“Who rules the data, rules the world”

Business world has realised data is the most important asset for the growth and business success. Since traditional data storage and architectural methods are not capable of handling massive data volumes and advance analytics capabilities, organisations are looking for new approaches for dealing with it.

When it comes to big data stores, Data Warehouses and Data Lakes are well-known and widely used in the industry for a long time. Data Warehouses are used for storing structured data, Data Lakes are occupied when there’s the need for storing unstructured data from variety of formats.

Source : Databricks

As the table shows Data Warehouses and Data Lakes are having their own advantages and disadvantages. Modern data platforms often need the features from both worlds. Data Lakehouse is a highbred approach which combines the features from Data Warehouse and Data Lake.

In short, a Data Lakehouse is an architecture that enables efficient and secure Artificial Intelligence (AI) and Business Intelligence (BI) directly on vast amounts of data stored in Data Lakes.

https://www.databricks.com/blog/2021/08/30/frequently-asked-questions-about-the-data-lakehouse.html

With the increasing need of AI and BI workloads, Data Lakehouse architecture has caught up the attention of data professionals and has become the to-go approach to handle massive amount of data from different source systems and different formats. As the Data Lakehouse is the central location for data in an organisation, it is necessary to follow a multi-layer architecture that ensure consistency, isolation, and durability of data as it passes through multiple processing stages.

The medallion approach consists of 3 data layers. The terms bronze (raw), silver (validated), and gold (enriched) describe the quality of the data at each of these levels.

Let’s see the features and usage of each data layer of the architecture.

The Bronze Layer

This is where the journey starts! The data in various formats coming from different source systems, streaming data etc. are stored in a distributed file store which can hold vast data volumes. Normally the data in the bronze layer is not cleansed or processed (Data is in in their raw format). It can be timestamped or saved as it comes from sources.   

The Silver Layer

Though we have zillions of data in the bronze layer, most of it could be just garbage and not worthy enough for any analytical tasks. When transferring data to the silver layer, transformation rules are applied to cleanse the data, remove null values, and join with lookup tables. The cleansed and validated data in the silver layer can be used for ad-hoc reporting, sophisticated analytics and even for machine learning model training.  

The Gold Layer

As the name implies, this is the most valuable portion or the form of data in the organisation. This layer is accountable for aggregating data from the silver layer for business use. Since the data is enriched with business logic principles, data in the gold layer is mostly used in BI reporting and analytical tasks. Data analysts, data scientists and business analysts highly rely on the data available in the gold layer.

Storing the data with a medallion architecture ensure the data quality and its transparency. Medallion architecture is mainly promoted by Databricks, but can use as a blueprint for structuring Lakehouse in other platforms too.

What’s your choice for storing big data? Sticking with data lake or moving to Lakehouse?

References:

https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion

https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

Data Selection for Machine Learning Models

Data is the key component of machine learning. Thus, high quality training dataset is always the main success factor of a machine learning training process. A good enough dataset leads to more accurate model training, faster convergence as well as it’s the main deciding factor on model’s fairness and unbiases too.
Let’s discuss the dos and don’ts when selecting/ preparing a dataset for training a machine learning model and the factors we should consider when composing the training data. These are valid for structured numerical data as well as for unstructured data types such as images and videos.

What does the dataset’s distribution look like?

This is important mostly with numerical datasets. Calculating and plotting the frequency distribution (how often each value occurs in the dataset) of a dataset leads us to the insights on the problem formation as well as on class distribution. ML engineers tend to have datasets with normal distribution to make sure they are having sufficient data points to train the models.


Though normal distribution is more common in nature and psychology, there’s no need of having a normal distribution on every dataset you use for model training. (Obviously some real-world data collections don’t fit for the noble bell curve).

Does the dataset represent the real world?

We are training machine learning models to solve real world problems. So, the data should be real too. It’s ok to use synthetic data if you are having no other option to collect more data or need to balance the classes, but always make sure to use the real -world data since it makes the model more robust on testing/ production. Please don’t put some random numbers into a machine learning model and expect that to solve your business problem with 90% accuracy 😉

Does the dataset match the context?

Always we must make sure the characteristics of the dataset used for training the model matches with the conditions we have when the model goes live in production. For example, will say we need to train a computer vision model for a mobile application which identify certain types of tree leaves from images captured from the mobile camera. There’s no use of training the model with images only captured in a lab environment. You should have images which are captured in wild (which is closely similar for the real-world use case of the application) in your training set.

Is the data redundant?

Data redundancy or data duplication is important point to pay attention when training ML models. If the dataset contains the same set of data points repeatedly, model overfits for that data points and will not perform well in testing. (mostly underfitting)

Is the dataset biased?

A bias dataset never produces an unbiased trained model. Always the dataset we choose should be balanced and not bias to certain cases.
Let’s get an example of having a supervised computer vision model which identifies the gender of people based on their face. Will assume the model is trained only with images from people from USA and we going to use it in an application which is used world-wide. The model will produce unrealistic predictions since the training context is bias to a certain ethnicity. To get a better outcome, the training set should have images from people from different ethnicities as well as from different age groups.

Which data is there too little/too much of?

“How much data we need to train a model with good accuracy?” – This is a question which comes out quite often when we planning ML projects. The simple answer is – “we don’t know!” 😀
There are no exact numbers on how much of data needed for training a ML model. We know that deep learning models are data hungry. Yes, we need to have large datasets for training deep neural networks since we are using those for representing non-linear complex relationships. Even with traditional machine learning algorithms, we should make sure to have enough data from all the classes even from the edge/ corner cases.
What will happen if we have too much data? – that doesn’t help at all. It only makes the training process lengthy and costly without getting the model into a good accuracy. This may end up producing an overfitted trained model too.

These are only very few points to consider when selecting a dataset for training a machine learning model. Please add your thoughts on dataset selection in comments.

FAQs on Machine Learning Development – #AskNaadi Part 1

Happy 2022!

It’s almost 7 years since I started playing with machine learning and related domains. These are some FAQs that comes for me from peers. Just added my thoughts on those. Feel free to any questions or concerns you have on the domain. I’ll try my best to add my thoughts on that. Note that all these answers are my personal opinions and experiences.

01. How to learn the theories behind machine learning?

The first thing I’d suggest would be ‘self-learning’. There are plenty of online resources out there where you can start studying by your own. Most of them are free. Some may need a payment for the certification (That’s totally up to you to pay and get it). I’ve listed down some of the famous places to get a kickstart for learning AI. Just take a look here.

Next would be keep practising. Never stop coding and training models in various domains. Kaggle is a good place to sharpen your skill set. Keep learning and keep practising at the same time.

02. Do we really need mathematics for ML?

Yes. To some extend you should know the theories behind probability and and some from basic mathematics. No need to worry a lot on that. As I said previously, there are plenty of places to catch up your maths too.

03. Is there a difference between data analysis and machine learning?

Yes. There is. Data analysis is about find pattern in the prevailing data and obtain inferences due to those patterns. It may have the data visualization components too. When is comes to machine learning, you train a system to learn those patterns and try to predict the upcoming pattern.

04. Does the trend in AI/ML going to fade out in the near future?

Mmm.. I don’t think so. Can’t exactly say AI is going to be ‘the’ future. Since all these technical advancements going to generate hell a lot of data, there should be a way to understand the patterns of those data and get a value out of that. So, data science and machine learning is going to be the approach to go for.

Right… those are some general questions I frequently get from people. Let’s move into some technicalities.

05. What’s the OS you use on your work rig?

Ubuntu! Yes it’s FOSS and super easy to setup all the dependencies which I need on it. (I did a complete walk through on my setup previously. Here’s it). Sometimes I use Windows too. But if it’s with docker and all, yes.. Ubuntu is the choice I’m going with.

06. What’s your preferred programming language to perform machine learning experiments?

I’m a Python guy! (Have used R a little)

07. Any frameworks/ libraries you use most in your experiments?

Since am more into deep learning and computer vision, I use PyTorch deep learning framework a lot. NumPy, Sci-kit learn, Pandas and all other ML related Python toolkits are in my toolbox always.

08. Machine learning is all about neural networks right?

No it’s not! This is one of the biggest myths! Artificial neural networks (ANNs) are only one family of algorithms which we can perform machine learning. There are plenty of other algorithms which are widely used in performing ML. Decision trees, Support Vector Machines, Naive Bayes are some popular ML algorithms which are not ANNs.

09. Why we need GPUs for training?

You need GPUs when you need to do parallel processing. The normal CPUs we have on our machines are typically having 4-5 cores and limited number of threads can be handled simultaneously. When it comes to a GPU, it’s having thousands of small cores which can handle thousands of computational threads in parallel. (For an example Nvidia 2080Ti is having 4352 CUDA cores in it). In Deep learning, we have to perform millions or calculations to train models. Running these workloads in GPUs is much faster and efficient.

10. When to use/ not to use Deep learning?

This is a tricky questions. Deep learning is always good in understanding the non-linear data. That’s why it’s performing really well in computer vision and natural language processing domains. If you have a such task, or your feature space is really large and having a massive amount of data, I’d suggest you to go with deep learning. If not sticking with traditional machine learning algorithms might be the best case.

11. Do I need to know all complex theories behind AI to develop intelligent applications?

Yes and No. In some cases, you may have to understand the theories behind AI/ML in order to develop a machine learning based applications. Mostly I would say model training and validation phases need this knowledge. Will say you are a software developer who’s very good with .Net/ Java and you are developing an application which is having a component where you have to read some text from a scanned document. You have to do it using computer vision. Fortunately you don’t have to build the component from the scratch. There are plenty of services which can be used as REST endpoints to complete the task. No need to worry on the underlying algorithms at all. Just use the JSON!

12. Should I build all my models from scratch?

This is a Yes/No answer too. This question comes mostly with deep learning model development. In some complex scenarios you may have to develop your models from the scratch. But most of the cases the problem you having can be defined as a object detection/ image classification/ Key phrase extraction from text etc. kinda problem. The best approach to go forward would be something like this.

  • Use a simple ANN and see your data loading and the related things are working fine.
  • Use a pre-trained model and see the performance (A widely used SOTA model would be the best choice).
  • If it’s not working out, do transfer learning and see the accuracy of the trained model. (You should get good results most of the times by this step)
  • Do some tweaks to the network and see if it’s working.
  • If none of these are working, then think of building a novel model.

13. Is cloud based machine learning is a good option?

In most of the industrial use cases yes! Since most of the data in prevailing systems are already sitting in the cloud and industries are heavily relying on cloud services these days, cloud based ML is a good approach. Obviously it comes with a price. When it comes to research phases, the price of purchiasing computation power maybe a problem. In those cases, my approach would be doing the research phase on-prem and moving the deployment to the cloud.

14. I’ve huge computer vision datasets to be trained? Shall I move all my stuff to the cloud?

Ehh… As I said previously, if you planning on a research project, which goes for a long time and need a lot of computational hours, I’d suggest to go with a local setup first, finalize the model and then move to the cloud. (If dollars aren’t your problem, no worries at all! Go for the cloud! Obviously it’s easy and more reliable)

15. Which cloud provider to choose?

There’s a lot of cloud providers out there having various services related to ML. Some provides out of the box services where you can just call and API to do the ML tasks (Microsoft Cognitive services etc. ). There are services where you can use your own data to train prevailing models (Custom Vision service by Azure etc.)

If you want end-to-end ML life cycle management, personally I find Azure ML service is a good solution since you can use any of your ML related frameworks and tools and just use cloud to train, manage and deploy the models. I find MLOps features that comes with Azure Machine Learning is pretty useful.

16. I’ve trained and deployed a pretty good machine learning model. I don’t need to touch that again right?

No way! You have to continuously check their performance and the accuracy they are providing for the newest data that comes to the service. The data that comes into the service may skewed. It’s always a good idea to train the model with more data. So better to have a re-training MLOps pipelines to iteratively check your models.

17. My DL models takes a lot of time to train. If I have more computation power the things will speed up?

mm.. Not all the time. I have seen cases where data loading is getting more time than model training. Make sure you are using the correct coding approaches and sufficient memory and process management. Make sure you are not using old libraries which may be the cause for slow processing times. If your code is clean and clear then try adjusting the computation power.

This is just few questions I noted down. If you have any other questions or concerns in the domain of machine learning/ deep learning and data science, just drop a comment below. Will try to add my thoughts there.

Handling Imbalanced Classes with Weighted Loss in PyTorch

When it comes to real world data collections, we don’t have the prestige of having perfectly balanced labelled datasets for training models. Most of the machine learning algorithms are not immune for imbalanced classes and cause less accurate and biased models. There are many approaches that we can follow to tackle imbalanced data problem. Either we have to choose a ML algorithm which is reluctant for imbalanced data or we may have to generate synthetic data in order to make the classes balanced.

Neural networks are trained using backpropagation which treats each class same when calculating the loss. If the data is not balanced, that makes the model bias for one class than another.

A, B, C, D classes are imbalanced

I had to face this issue when experimenting with a computer vision based multi-class classification problem. The data I had was so much skewed and some classes had a very less amount of data compared to the majority class. Model was not performing well at all and need to take some actions to tackle the class imbalance problem.

These were the solutions I thought of try out.

  1. Creating synthetic data –
    Creating new synthetic data points is one of the main methods which is used mostly for numerical data and in some cases in imagery data too with the help of GAN and image augmentations. As in the starting point, I took the decision not to go with synthetic data generation since it may introduce abnormal characteristics to my dataset. So I keep that for a later part.
  2. Sampling the dataset with balanced classes –
    In this approach, what we normally do is, sample the dataset similar number of samples for each data label. For an example, will say we have a dataset which is having 3 classes named A, B & C with 100, 50, 20 data points for each class accordingly. When sampling what we do is randomly selecting 20 samples from each A, B & C classes and get a dataset with 60 data points.

In some cases this approach comes as a better option if we have very large amounts of data for each class (Even for the minority classes) In my case, I was not able to take the cost of loosing a huge portion of my data just by sampling it based on the data points having in the minority class.

Since both methods were not going well for me, I used a weighted loss function for training my neural network. Since this is a multi-class classification problem, I used Cross Entropy Loss in PyTorch as my loss function. (You can follow the similar approach if you using BCELoss for binary classification too)

import torch.nn as nn

#class weights for 6 class multi-class classification
class_weights = [0.5281, 0.8411, 0.9619, 0.8634, 0.8477, 0.9577]

#loss function with class weights
criterion = nn.CrossEntropyLoss(weight = class_weights) 

How I calculated the weight for each class? –

This is so simple. What I did was calculating a manual re-scaling weight for each class and pass it to “weight” parameter in the loss function. Make sure that you have a Tensor with the size of number of classes as the class weights. (In simpler words each class should have a weight).

Hint : If you using GPU for model training, make sure to put your class weights tensor to the GPU too.

Did it worked? Hell yeah! I was able to train my model accurately with less bias and without overfitting for a single class by using this simple trick. Let me know any other trick you use for training neural network models with imbalanced data.

Happy coding 🙂

How to Streamline Machine Learning/ Data Science Projects?

CRISP-DM (Image from wikipedia)

When it comes to designing, developing and implementing a project related to data mining/ machine learning or deep learning, it is always better to follow a framework for streamlining the project flow.

It is OK to adapt a software development framework such as scrum, or waterfall method to manage a ML related project but I feel like having more streamlined process which pays attention on data would be an advantage for the success of a such project.

To my understanding there can be two variations of ML related projects.

  1. Solely machine learning/ data science based projects
  2. Software development projects where ML related services are a sub component of the main project.

The step-by step process am explaining can be used in both of these variations with your own additions and modifications.

Basically this is what I do when I get a ML related project to my hand.

I follow the steps of a good old standard process known as Cross-industry standard process for data mining (CRISP-DM) to streamline the project flow. Let’s go step by by step.

Step 1 : Business understanding

First you have to identify what is the problem you going to address with the project. Then you have to be open minded and answer the following questions.

  1. What is the current situation of this project? (whether it is using some conventional algorithm to solve this problem etc. )
  2. Do we really need to use machine learning to solve this problem? ( Using ML or deep learning for solving some problems maybe over engineering. Take a look whether it is essential to use ML to do the project.)
  3. What is the benefit of implementing the project? (ML projects are quite expensive and resource hungry. Make sure you get the sufficient RoI with the implementation.)
  4. What are constraints, limitations and risks? ( It’s always better to do a risk assessment prior the project. The data you have to use may have compliance issues. Look on those aspects for sure!)
  5. What tools and techniques am going to use? ( It maybe bit hard to determine the full tech stack you going to use before dipping your feet into the project. But good have even a rough idea on the tools, platforms and services you going to use to development and implementation. DON’T forget implementation phase. You may end up having a pretty cool development which maybe hard to implement with the desired application. So make sure you know your tool-set first)

Tip : If you feel like you are not having experience with this phase, never hesitate to discuss about it with the peers and experts in the field. They may come-up with easy shortcuts and techniques to make your project a success.

Step 2 : Data understanding

Data is the most vital part of any data science/ ML related project. When it comes to understanding the data, I prefer answering these questions.

  1. How big/small the data is? (Sometimes training deep learning models may need a lot of annotated data which is hard to find)
  2. How credible/ accurate the data is?
  3. What is the distribution of data?
  4. What are the key attributes and what are not-so-important attributes in data?
  5. How the data has been stored? (Data comes in CSVs/JSONs or flat files etc.)
  6. Simple statistical analysis of data?

Before digging into the main problem, you can save a lot of time by taking a closer look on data that you have or that you going to get.

Step 3 : Data preparation

To be honest, this step takes 80% of total project time most of the times. Data that we find in real world are not clean or in the perfect shape. Perfectly cleaned and per-processed data will save a lot of time in later stages. Make sure you follow the correct methodologies for data cleansing. This step may include tasks such as writing dataloaders for your data. Make sure to document the data preparation steps you did to the original dataset. Otherwise you may get confused in later stages.

Step 4 : Modelling

This is the step where you actually get the use of machine learning algorithms and related approaches. What I normally do is accessing the data and try some simple modelling techniques to interpret the data I have. For an example, will say I have a set of images to be classified using a artificial neural network based classifier… I’d first use a simple neural network with one or two hidden layers and see if the problem formation and modelling strategy is making any sense. If that’s successful, I’ll move for more complex approaches.

Tip : NEVER forget documentation! Your project may grow exponentially with thousands of code lines and you may try hundreds of modelling techniques to get the best accuracy. So that keep clear documentation on what you did to make sure you can roll back and see what you have done before.

Step 5 : Evaluation

Evaluating the models we developed is essential to determine whether we have done the right thing. Same as software review processes I prefer having a set framework to evaluate the ML projects. Make sure to select appropriate evaluation matrix. Some may not indicate the real behaviour of the models you build.

When performing a ML model evaluation, I plan ahead and make a set structure for the evaluation report. It makes the process easy to compare it against different parameter changes of the single model.

In most of the cases, we neglect the execution or the inference time when evaluating ML models. These can be vital factors in some applications. So that plan your evaluation wisely.

Step 6 : Deployment & Maintenance

Deployment is everything! If the deployment fails in the production, there’s no value in all the model development workload you did.

You should select the technologies and approaches to deliver the ML services (as REST web services, Kubernetes, container instances etc. ). I personally prefer containerising since it’s neat and clean. The deployed models should be monitored regularly. Predictions can get deviated with time. Sometimes data distribution can be changed. Make sure you create a robust monitoring plan beforehand.

Tip : What about the health of the published web endpoints or the capacity of inference clusters you using?? Yp! Make sure you monitor the infrastructure too.

https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview

This is just a high-level guideline that you can follow for streamlining data science/machine learning related tasks. This is a iterative process. There’s no hard bound rules saying you MUST follow these steps. Microsoft has introduced team data science process (TDSP) adapting and improving this concept with their own tool-sets.

Key takeaway : Please don’t follow cowboy coding for machine learning/ data science projects! Having a streamline process is always better! 🙂

Handling data sources on Azure Machine Learning

Image source : https://docs.microsoft.com/en-us/azure/machine-learning/concept-data

Being the fundamental and the most vital factor in any machine learning experiment, the way of handling data in your experiments is crucial. Here we going to discuss different ways of managing your data sources inside Azure Machine Learning (AML).

Since the new Azure Machine Learning Service is becoming the one-stop place for managing all ML related workloads in Azure, the functions and methods can be created/managed using the web portal or using AzureML python SDK (You may use AzureML R SDK or the Azure CLI too)

Data comes in all shapes and sizes. In order to tackle these different data scenarios AML offers different options to manage the data. Let’s discuss these options one by one with their usages, pros and cons.

Datastore

Datastore is the place where the data sits in an AML experiment. Your AML workspace can have one or more Datastores connected according to your need.

AML is all about cloud-based machine learning. So that, I would recommend using some sort of Azure based storage to store your data in the first hand. Blob storage, File Share, Data Lake Storage, Azure SQL database, Azure PostgreSQL, Azure DB for MySQL and Databricks file system are currently supported data storage types to create Datastores. (Will say your data is at on-prem SQL database. You can use Azure Data Factory to migrate your data load onto Azure.)

You can see the Datastores registered to your workspace either through AML Studio (ml.azure.com) or through the python SDK. When you create a workspace, by default two Datastores are created : workspacefilestore and workspaceblobstore.

Workspaceblobstore acts as the default Datastore of experiments. You can change it at any time through the SDK. This is the place where all your code and other files you put in the experiment sits.

Is it a good idea to keep your data on the workspaceblobstore?

  • Scenario #1 : You are doing a toy experiment with a small dataset (Eg. a 2 MB CSV file). Dataset is static and no plan on updating that during the experiment. Yes! That’s completely ok to keep the dataset inside workspaceblobstore.
  • Scenario #2 : You are doing a deep learning experiment with a 100,000 images. No! Never use workspaceblobstore to keep your data.

Why not the workspaceblobstore always?

Workspaceblobstore is having a file and storage limitation (300MB and/or 2000 files). So that it’s impossible to use it when we have a large dataset. On the other hand, this directly affects for the docker image size (or the snapshot size) you may create for the experiment. Bulky snapshots or the docker images are not a good thing. Always keep it simple and modularized. So always the workspaceblobstore is a no go!

Datasets

AML datasets is the high-level abstraction of the data you use in experiments. You may create an AML dataset from,

  • A local file / local files
  • Registered datastore (from file(s) sit on a datastore)
  • Web URL
  • Azure Open Datasets

The AML datasets we create may belong to two main types:

  • Tabular datasets – If you have a file/ files that contains data in a tabular format (CSV, JSON line files, Parquet files, Tabular data in SQL databases etc.) creating a tabular dataset would be beneficial as it allows to transform data into a Pandas or Spark DataFrame.    
  • File datasets – Refer to a single or multiple file in your Datastore or on a public URL. File datasets comes handy when you have a scenario like a dataset with 1K images.

AML datasets comes with the advantage of versioning and tracking as well as monitoring. It’s not a hard thing to create perform data drift detection or a simple statistical analysis on data fields of the dataset with a few clicks.  

Microsoft recommends to use AML datasets always in experiments rather than pointing for the datastore directly (which is totally possible). I’ve found out pros and cons in both approaches.

  • AML datasets are easy to version and manage compared to datastore.
  • If you have a tabular dataset, I would always recommend to go for AML tabular datasets.
  • It becomes tricky when you have files. If you use File datasets, you have to use to_path() method to get the list of file paths defined by the dataset. This comes as a flat list! If you are not concerned about the directory structure of the data this is totally fine. But if you wish to create custom dataloaders (Eg. PyTorch custom dataloaders which allows to differentiate classes according to directories) this may not come handy. (You can do a workaround by processing the file path to determine the directory structure though 😀 )  
  • Keep in mind that AML dataset mount() only works for unix-like OS. If you wish to run your experiment on a windows running workstation you may have to download() the dataset.

Will discuss on using these datasets in different model training scenarios in future posts.

There are just some of the experiences I had when playing around with new Azure Machine Learning. The Microsoft Learning GitHub repo (https://github.com/MicrosoftLearning/DP100) for DP100 exam is a really nice place to find some example code on using these functionalities. Let me know your find outs and experiences with AML too 😊

Happy coding!     

Mission Plan for building a Predictive model

maxresdefaultWhen it comes to a machine learning or data science related problem, the most difficult part would be finding out the best approach to cope up with the task. Simply to get the idea of where to start!

Cross-industry standard process for data mining, commonly known by its acronym CRISP-DM, is a data mining process model describes commonly used approaches that data mining experts use to tackle problems. This process can be easily adopted for developing machine learning based predictive models as well.

CRISP-DM_Process_Diagram

CRISP – DM

No matter what are the tools/IDEs/languages you use for the process. You can adopt your tools according to the requirement you’ve.

Let’s walk through each step of the CRISP-DM model to see how it can be adopted for building machine learning models.

Business Understanding –

This is the step you may need the technical knowhow as well as a little bit of knowledge about the problem domain. You should have a clear idea on what you going to build and what would be the functional value of the prediction you suppose to do through the model. You can use Decision Model & Notation (https://en.wikipedia.org/wiki/Decision_Model_and_Notation) to describe the business need of the predictive model. Sometimes, the business need you are having might be able to solve using simple statistics other than going for a machine learning model.

Identifying the data sources is a task you should do in this step. Should check whether the data sources are reliable, legal and ethical to use in your application.

Data Understanding –

I would suggest you to do the following steps to get to know your data better.

  1. Data Definition – A detailed description on each data field in the data source. The notations of the data points, the units that the data points have been measured would be the cases you should consider about.
  2. Data Visualization – Hundreds or thousands of numerical data points may not give a clear idea for you what the data is about or an idea about the shape of your data. You may able to find interesting subsets of your data after visualizing it. It’s really easy to see the clustering patterns or the trending nature of the data in a visualized plot.
  3. Statistical analysis – Starting from the simple statistical calculations such as mean, median; you can calculate the correlation between each data field and it will help you to get a good idea about the data distribution. Feature engineering to increase the accuracy of the machine learning model. For performing that a descriptive statistical analysis would be a great asset.

For data understanding, The Interactive Data Exploration, Analysis and Reporting tool (IDEAR) can be used without getting the hassle of doing all the coding from the beginning. (Will discuss on IDEAR in a long run soon)

Data Preparation –

Data preparation would take roughly 80% of your time of the process implying it’s the most vital part in building predictive models.

This is the phase where you convert the raw data that you got from the data sources for the final datasets that you use for building the ML models. Most of the data you got from raw sources like IoT sensors or collectives are filled with outliers, contains missing values and disruptions. In the phase of data preparation, you should follow data preprocessing tasks to make those data fields usable in modeling.

Modeling –

Modeling is the part where algorithms comes to the scene. You can train and fit your data to a particular predictive model to perform the deserved prediction. You may need to check the math behind the algorithms sometimes to select the best algorithm that won’t overfit or underfit the model.

Different modeling methods may need data in different forms. So, you may need to revert back for the data preparation phase.

Evaluation –

Evaluation is a must before deploying a model. The objective of evaluating the model is to see whether the predictive model is meeting the business objectives that we’ve figured out in the beginning. The evaluation can be done with many parameter measures such as accuracy, AUC etc.

Evaluation may lead you to adjust the parameters of the model and might have to choose another algorithm that performs better. Don’t expect the machine learning model to be 100% accurate. If it is 100% most probably it would be an over fitted case.

Deployment –

Deployment of the machine learning model is the phase where the client, or the end user going to consume. In most of the cases, the predictive model would be a part of an intelligent application that acts as a service that gets a set of information and give a prediction as an output of that.

I would suggest you to deploy the module as a single component, so that it’s easy to scale as well as to maintain. APIs / Docker environments are some cool technologies that you can adopt for deploying machine learning models.

CRISP-DM won’t do all the magic of getting a perfect model as the output though it would definitely help you not to end up in a dead-end.

Democratizing Machine Learning with Cloud

HiRes.jpg.800x600_q96We have already passed the era of gigabytes when it comes to data. World is talking about terabytes of unstructured data and massive amounts of data points generated from IoT devices and sensors in millions per a second. To analyze these heaps of data, obviously, we need large computation power and massive storage. Building workhorse machines to fulfil those tremendous workloads would definitely cost a lot. Cloud computing paradigm comes handy here. The resourcefulness and the scalability of the public cloud can be used to perform the large calculations in machine learning algorithms.

Almost all the major public cloud providers in the market comes up with machine learning services. Cloud machine learning services in Google Cloud Platform provides modern machine learning services, with pre-trained models and a service to generate your own tailored models. Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology. IBM analytics comes up with a machine learning platform with its cloud data services. Azure Machine Learning Studio is a GUI-based integrated development environment for constructing and operationalizing Machine Learning workflow on Azure. We discussed a lot about Azure Machine Learning and its appliances in practical scenarios in the previous posts.

All the mentioned platforms provide machine learning as a service. Most of the platforms offer pre-built ML algorithms in packages. Simple drag and drop user interactions and easy deployment has attracted many developers to use these tools.

But, how would it be if you want to go from the scratch? Either you want to use the power of Graphical Processing Units (GPUs) to process the ML algorithms parallelly? Cloud based Virtual Machines specifically optimized for computation is one of the best solutions that you can consume.

Azure Data Science Virtual Machine (DSVM) –

dsvm

DSVM in Azure Portal

If you already have used Azure virtual machines for your computation, hosting or storage tasks, this would not be a new concept for you. Azure DSVM is specifically optimized for large computations. Azure DSVM comes in two flavors. One with Windows and the other with Linux. You can choose the hardware configurations as you wish. Many development environments, programming IDEs, languages are pre-installed in the VM instances.

dsvm_linuxMy personal favorite here is the Linux DSVM instance. Here I’ve created a Linux DSVM with the basic configurations. For accessing the VM you can use any tool that can do a SSH call. What I normally do is calling the accessing the VM using Ubuntu Bash on Windows 10.

GPUs for machine learning –

GPU_1

GPU_2

Configurations of the Linux VM with Nvidia GPU

Many machine learning algorithms currently available can be executed parallely. Execution parts of those algorithms are embarrassingly parallel. With that parallel programming, you can reduce the execution time of the algorithms drastically. Data scientists in both industry and academia have been using GPUs for machine learning to make groundbreaking improvements across a variety of applications including image classification, video analytics, speech recognition and natural language processing.

google_brain

GPUs Vs. CPU computing

Specially in Deep Learning, parallel processing using GPUs can make a drastic decrease in computation time. Purchasing a deep learning dream machine powered with a CUDA enabled high-end GPU such as Nvidia Tesla K80 would cost nearly 6000 dollars! Rather than spending a lot on a machine like that, the most feasible plan is to provision a virtual machine with the specifications we need and pay as we consume.

VM_size

VM instance price plans

The N-series is a family of Azure Virtual Machines with GPU capabilities that you can use for these kinds of tasks. The N-series will feature the NVIDIA Tesla accelerated platform as well as NVIDIA GRID 2.0 technology, providing the highest-end graphics support available in the cloud today. Through your Azure portal, you can choose a desired price plan with the desired configurations for your tasks when provisioning the VM.

teslaHere’s my Azure VM specifically configured for deep learning exercises. The machine is powered with Tesla K80 GPU which is having 4992 cores in it!! I installed anaconda for that and doing computations using Jupyter notebooks.

Just a hint: stop your VM instance when you are not using it for computation to avoid getting huge unnecessary bills. 😉

No need of huge wallets! The wise decision would be applying cloud technologies for machine learning.

SQL support in R tools for Visual Studio

If you have any kind of interest in data science or machine learning, you’ll probably found out that R language is the ultimate survivor. If you are a developer familiar with Visual Studio, you don’t have to adopt for RStudio again. You can code R inside VS!

R Tools for Visual Studio (RTVS) recently released the 0.5 version. One useful feature comes with the new version is SQL integration. With that you can directly import the data loads in your SQL database to a R environment. SQL queries can help you to fetch the data that you want. You can easily play with the data using R then.

First, you have to have Visual Studio 2015 with update 3. (Visual Studio 2015 Comunity edition is freely available to download) Update your VS if you haven’t done it yet and download RTVS 0.5 from here & install it.
https://aka.ms/rtvs-current

1
In your R project you can add SQL Query item (Right click on solution explorer and “Add new item”) which is created as a *.sql file.

2
On the top of the panel you can connect the database using “connect” icon. There you should configure the server name, server authentication and the database details.

3

Inside the .sql file you can execute the typical SQL queries to fetch data from the SQL database. One main advantage of this is, by enabling the execution plan you can analyze and optimize the SQL query you written.

4

Adding a database connection for the R project –

Go to R tools -> Data -> Add Database Connectionconnection-prop
Provide the authentication details of the database that you want to access. Then test the connection using “Test Connection” button. After clicking ‘ok’, you can see the database connection string is automatically generated inside settings.R file. Within the R code you can access for data inside the particular database as shown in the following example code.

final-screen

The str() output is shown in the R console

The example shows the code used for accessing the data in ‘Iris Data’ table inside ‘DMDatasets’ database placed in the local SQL server. Make sure to install “RODBC” R package to use the database related functions inside R.

#Need RODBC package to extablish the ODBC database iterface
install.packages("RODBC")
require("RODBC")

#Auto-generated Settings.R file should be added as a source
#The connection string contains in this file  
source("Settings.R")
conn <- odbcDriverConnect(connection = dbConnection)

#To get the tables of particualr database
tbls <- sqlTables(conn, tableType = "TABLE")
print(tbls)

#The SQL query is used to fetch data from the table 
sql <- "SELECT * FROM [dbo].[Iris Data]"
df <- sqlQuery(conn, sql)
str(df)
#plotting the dataset
plot(df) 

No need of switching developer environments to handle your coding as well as data analytics tasks. Just keep Visual Studio as your default IDE! 🙂