What’s Best for Me? – 5 Data Analytics Service Selection Scenarios Explained

With the extensive usage of cloud-based technologies to perform machine learning and data science related experiments, choosing the right toolset/ platform to perform the operations is a key part for the project success.

Since selecting the perfect toolset for our ML workloads maybe bit tricky, I thought of sharing my thoughts on that by getting a couple of generic use cases. Please keep in mind that the use cases I have chosen and the decisions I’m suggesting are totally my own view on the scenarios and this may differ based on different factors (amount of data, time frame, allocated budget, ability of the developer etc.) you have with your project. Plus, the suggestions I’m pointing out here are from the services comes with Microsoft Azure cloud. This maybe the easily adjusted for other cloud providers too.

Scenario 1:

We are a medium scale micro financing company having our data stored on Microsoft Azure. We have a plan to build a datalake and use that for analytical and reporting tasks. We have a diverse data team with abilities in python, Scala and SQL (most of the data engineers are only familiar with SQL). We need to build a couple of machine learning models for predictions. What would be the best platform to go forward with? Azure Databricks or Azure ML Studio?

Suggestion: Azure Databricks

Reasons:

  • Databricks is more flexible in ETL and datalake related data operations comparing to AzureML Studio.
  • You can perform data curation and machine learning within a single product with Azure Databricks.
  • Databricks can connect with Azure Data Factory pipelines to handle data flow and data curation tasks within the datalake.
  • Since the data engineers are more familiar with SQL, they’ll easily adapt with SparkSQL on Databricks.
  • Data team can develop their machine learning experiments with any language of their choice with Databricks notebooks.
  • Databricks notebooks can be used for analytical and reporting tasks even with a combination of PowerBI.
  • Given that, the company is planning for building a datalake, Databricks is far more flexible in ETL tasks. You can use Azure Data Factory pipelines with Databricks to control the data flow of the datalake.

Scenario 2:

I’m a computer science undergrad. I’m doing a software project to predict several types of wildflowers by capturing images from a mobile phone. I’m planning to build my computer vision model using TensorFlow and Keras and expose the service as a REST API.  Since I’m not having the infrastructure to train the ML models, I’m planning to use Azure for that. Which tool on Azure should I choose?

Suggestion: Azure ML Studio

Reasons:

  • AzureML provides a complete toolset to train, test and deploy a deep learning model using any open-source framework of your choice.
  • You can use the GPU training clusters on AzureML to train your models.
  • It’s easy to log your model training and experiments using AzureML python SDK.
  • AzureML gives you the ability for model management and exposing the trained model as a REST API.
  • Small learning curve and adaptability.

Scenario 3:

I’m the CEO of a retail company. I’m not having a vast experience with computing or programming but having a background in maths and statistics. I have a plan to use machine learning to perform predictive analysis with the data currently having in my company. Most of the data are still in excel! Someone suggested me to use Azure. What product on Azure should I choose?

Suggestion: Azure ML Studio

Reasons:

  • For a beginner in machine learning and data science, Azure ML Studio is a good start.
  • AzureML Studio provides no-code environments (Azure ML designer and AutoML) to develop ML models.   
  • Since, you are mostly in the experimental stage and not going for using bigger datasets, using Databricks would be an overkill.
  • You can easily import your prevailing data and start experimenting and playing around with them without any local environmental setup.

Scenario 4:

I’m the IT manager of a large enterprise who are heavily relying on data assets with our decision-making process. We have to run iterative jobs daily to retrieve data from different external sources and internal systems. Currently we have an on-prem SQL database acting as the data warehouse.  Company has decided to go for cloud. Can Azure serve our needs?   

Suggestion: Yes. Azure can serve your need with different tools in the data & AI domain.

Reasons:

  • You can use Azure Synapse Analytics or Azure Data Factory to build data pipelines and perform ETL operations.
  • The local data warehouse can be easily migrated to Azure cloud.
  • You can use Azure Databricks in-order to perform analytics tasks.
  • Since the enterprise in large and scaling, using Databricks would be better with its Spark based computation abilities.

Scenario 5:

We are an agricultural company growing forward with adopting modern Agri-tech into the business. We collect numerous data values from our plantations and store them in our cloud databases. We have a set of data scientists working on data modelling and building predictive models related to crop fertilizing and harvesting. They are currently using their own laptops to perform analysis and it’s troublesome with the data amount, platform configurations and security. Will Azure ML comes handy in our case?

Suggestion: Yes. Azure ML Studio would be a good choice.

Reasons:

  • AzureML can be easily adaptable as an analytical platform.
  • The cloud databases can be connected to AzureML, and data scientists can straight-up start working on the data assets.
  • AzureML is relatively cheap comparing to Databricks (Given the data amount is manageable in a single computer.)
  • It’s easy to perform prototyping of models using AutoML/ AzureML Designer and then implement the models within a short time frame.  

Generally, these are the factors I would keep in mind when selecting the services for ML/ data related implementations on Azure.

  • Azure ML studio is good when you are training with a limited data, though Azure ML provides training clusters, the data distribution among the nodes is to be handled in the code.
  • AzureML Studio comes handy in prototyping with AzureML designer and Automated ML.
  • Azure Databricks with its RDDs is designed to handle data distributed on multiple nodes which is advantageous when your you have big datasets.
  • When your data size is small and can fit in a scaled up single machine/ you are using a pandas dataframe, then use of Azure Databricks is an overkill.
  • Services like Azure Data Factory and Datalake storage can be easily interconnected for building  

Let me know your thoughts on these scenarios as well. Add your queries in the comments too. I’ll try my best to provide my suggestions for those use cases.

ml.azure.com – New Face of Azure Machine Learning

Azure Machine Learning Studio (preview) interface

Out of all the public cloud platforms, Microsoft Azure has adopted all most all the steps in machine learning life cycle into cloud. Though the resources and the abilities are there, sometimes finding the correct cloud-based product or service to adopt for your solution might be a problem.

Providing a perfect answer for that issue, Azure has come up with the whole new Azure Machine Learning Studio which is in the preview by the time am blogging this! (Don’t get confused with the AzureML Studio, the drag and drop interface we had before. This is a new thing – ml.azure.com ) There’s no framework dependency or restrictions for using these services. You can easily adapt your open source machine learning code base (may be written with Python, sci-kit learn, TensorFlow, PyTorch, Keras… anything)  

Most awesome feature of this new service is the AzureML python SDK (https://docs.microsoft.com/en-us/python/api/overview/azure/ml/intro) and AzureML R SDK (https://github.com/Azure/azureml-sdk-for-r) . These SDKs allow you to create and manage the ML experiments with your familiar coding style.  

In order to use this one-stop solution in Azure you may have to create an Azure Machine Learning Service from Azure portal. Then it’ll direct you for the new interface. You can either go for the Enterprise pricing tier or for the Basic tier. In basic tier you won’t get the visual designer and Automated ML features.

Launching the Studio through Azure portal

Let’s go through each and every tab we got in the side pane in the latest release and see what can we do with them.

Notebooks –

These are fully managed Jupyter notebook instances on cloud. These notebook servers are running on top of a new VM instance type called “notebookVM”. There notebookVMs are fully configured work environments to do your machine learning and data science tasks. No need to worry about installing all the python packages and its dependencies. All are already there! You have the privilege to change the notebook sizes (Yes GPU enabled VMs are also there) or install new packages through a python package manager too.

Automated ML –

Not available in the free tier though. A process of selecting the best suited algorithm for the dataset you are having. Right now, this is supporting classification, regression and time series forecasting for tabular data formats. Not supported for deep learning based computer vision applications. Deep learning based text analysis is also in preview. The Automated ML process runs a set of machine learning algorithms on top of your provided data and see which one gives the best accuracy metric. Good for building prototypes and even in some cases might be in production.

Designer –

This is the evolution of Azure ML Studio (Old drag and drop thing). Seems like Microsoft is going to end its lifecycle and give the new Designer be its replacement. Here you can build the complete ML workflow by dragging and dropping modules. If you want can integrate R or python scripts in the experiment. The machine learning service endpoint can be exposed through a Azure Kubernetes Service (AKS) deployment.

Datasets –

The place to manage and version your datasets. Datasets can be either tabular or file based. Here you can profile your dataset by performing a basic statistical analysis on your data. If your dataset is sitting on a datastore (which we going to discuss later) this acts as a high-level encapsulation of that data.

Experiments –

You may execute several runs on the same experiment with different configurations. This is the place where you can see all the log files of them and compare the runs with each other.

Pipelines –

Don’t get confused Azure machine learning pipelines with the Azure pipelines.  Azure ML pipelines are specifically designed for MLOps tasks. You can manage the whole experiment process till production using ML pipelines. These pipelines are reusable and help collaborative development of the solution.

Models –

You can register the trained ML models here. Versioning the models, managing which model to go for production are some use cases of this model registry. You can register models that has been trained outside the particular Azure ML workspace too here.

Endpoints –

Endpoint of an azure machine learning experiment can be a web service or an IoT module endpoint. Managing the endpoint keys etc. are performed in this section.

Compute –

In most of the cases, you may use Azure for computations. Here in the Compute section you can create and manage the following compute resource types.

  • Notebook VMs – as we discussed previously in the Notebooks section this is a fully managed ML development environment suits for development and prototyping purposes.
  • Training clusters – You can make a either a CPU based or a GPU based cluster for running your experiments. Note that this would be charged according to the computation hours as well as for the number of nodes you are using. Good thing is there’s no charge when you are not using the cluster for computation.
  • Inference clusters – This talk about AKS clusters where you can deploy the endpoints. Even you can register a prevailing AKS cluster as an inference cluster.
  • Attached compute – If you working with Azure Databricks, Data Lake Analytics or HDInsight you can configure the computation here. In an interesting use case if you want to attach your physical computer (which should be a workstation running Ubuntu) as a compute target it’s also possible through the AML service.

Datastores –

When it comes to machine learning experiments its normal to have large amount of data. These data may sit in your Azure storage. Datastore is the storage abstraction over an Azure storage account which then you can use inside your machine learning experiments.

Data labeling –

A cool new feature for data annotators. Right now, this supports Image classification in multi-label / multi-class and object identification (bounding box) annotations. The annotator should not have to have an Azure subscription. You can easily outsource your tedious annotation workload through this feature.

This is just an overview of the options we are having with new Azure Machine Learning Studio. It’s pretty clear that Azure team is going to get all the ML related services under one umbrella. Let’s discuss some cool use cases and tips on using these services in next blog posts.

Happy coding! 😊