Using Filterable Fields in Azure AI Search for RAG-based GenAI Applications

Posted on August 1, 2024 by Haritha Thilakarathne

With the huge hype Retrieval Augmentation Generation (RAG) in the GenAI applications, vector databases play a vital role in these applications.

Azure AI Search is a leading cloud-based solution for indexing and querying a variety of data sources and a vector database, which is widely used in production-level applications.

Similar to any cloud resource, Azure AI Search has different pricing tiers and limitations. For instance, if you choose the Standard S1 pricing tier, you can create a maximum of 50 indexes.

Azure AI Search Pricing: https://azure.microsoft.com/en-au/pricing/details/search/

Recently, a use case arose where I had to create more than 50 individual indexes in a single AI Search resource. It’s a use case for a RAG-based application for an e-library, where each book should be indexed and have a logical separation between them.

The easiest way forward would be to create a separate index inside the vector database for each book. This was not feasible since the maximum number of indexes in S1 was 50, more than the number of books in the library. Being cost-conscious, there was no room for going for Standard S2 or a higher pricing tier. The actual storage required for all the indexes was not even 100GB. So, S1 was the choice to go with an approach to having a separation between the embeddings of each book.

My approach was to add an additional metadata field for the index and make it a filterable field. Then, the book’s name can be added as a value, which can then be used to filter the particular book when we query it through the API.

The embedding was done using the ada-002 embedding model and GPT-4o was used as the foundational model within the application.

Let’s walk through how that was done with the aid of the LangChain framework.

01. Import required libraries

We use the LangChain Python library to orchestrate LLM-based operations within the application.

import os
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader


from azure.search.documents.indexes.models import (
    SearchableField,
    SearchField,
    SearchFieldDataType,
    SimpleField
)

02. Initiate the embedding model

The ada-002 model has been used as the embedding model of this application. You can use any embedding model of choice here.

embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(
    azure_deployment= EMBEDDING_MODEL,
    openai_api_version= AZURE_OPENAI_API_VERSION,
    azure_endpoint= AZURE_OPENAI_ENDPOINT,
    api_key= AZURE_OPENAI_API_KEY,
)
embedding_function = embeddings.embed_query

03. Create the structure of the index within the vector database

We create the structure of the index by configuring an additional metadata field for it. This should be filterable since we will filter the content in the index with the value of it.

index_fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        key=True,
        filterable=True,
    ),
    SearchableField(
        name="content",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="content_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=len(embedding_function("Text")),
        vector_search_profile_name="myHnswProfile",
    ),
    SearchableField(
        name="metadata",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    # Additional field to store the name of the book
    SearchableField(
        name="book_name",
        type=SearchFieldDataType.String,
        filterable=True,
    )
]

04. Create the Azure AI Search vector database

We create the Azure AI Search vector database with custom field configurations that were initiated before.

vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint= AZURE_SEARCH_ENDPOINT,
    azure_search_key= AZURE_SEARCH_KEY,
    index_name= "index_of_books",
    embedding_function= embedding_function,
    fields = index_fields
)

05. Loading documents from a local directory

Any loader available in LangChain can be used to load the content. In this context, the pages array contains all the pages from a single book.

loader = DirectoryLoader("../", glob="**/*")
pages = loader.load()

06. Add the additional metadata field for each document object

The content of the book is read as LangChain documents. We should add a value for the custom filterable field that we initiated in the structure. We can assign the book’s name for that field by running through a simple loop. The value of it should be changed when the pages of a new book is loaded through the loader.

for page in pages:
    metadata = page.metadata
    metadata["book_name"] = "Oliver_Twist"
    page.metadata = metadata

07. Adding the embeddings to the vector store

The customised content can be now added to the vector storage

vector_store.add_documents(pages)

The Retrieval Process

As mentioned in the use case, the usage of adding a customised filterable field for the index is to retrieve the required documents only when answering a user query. For instance, in this use case if we only want to get answers from the book “Oliver Twist” we should only read the embeddings from that particular book. This can be done using the filter argument when parsing this through the OpenAI API. Here’s the sample body of the JSON request I sent for the API to get the filtered content. The filtration follows the OData $filter syntax.

  {  
  "data_sources": [
    {
      "type": "azure_search",
      "parameters": {
        "filter": "book_name eq 'Oliver_Twist'",
        "endpoint": "https://<SEARCH_RESOURCE>.search.windows.net",
        "key": "<AZURE_SEARCH_KEY>",
        "index_name": "index_of_books",
        "semantic_configuration": "azureml-default",
        "authentication": {
          "type": "system_assigned_managed_identity",
          "key": null
        },
        "embedding_dependency": null,
        "query_type": "vector_simple_hybrid",
        "in_scope": true,
        "role_information": "You are an AI assistant find information from the books in the library.",
        "strictness": 3,
        "top_n_documents": 4,
        "embedding_endpoint" : "<EMBEDDING_MODEL>",
        "embedding_key": "<AZURE_OPENAI_API_KEY>"
      }
    }
  ],
  "messages": [
    {
      "role": "system",
      "content": "You are an AI assistant find information from the books in the library."
    },
    {
      "role": "user",
      "content": "Please provide me with the summary of the book."
    }
  ],
  "deployment": "gpt-4o",
  "temperature": 0.7,
  "top_p": 0.95,
  "max_tokens": 800,
  "stop": null,
  "stream": false
}

Note that on line 6, we use the filter field to retrieve only the embeddings with the particular book name. Multiple filterable fields are also possible, and they can be used in complex applications. You should remember that using filterable fields makes the search a bit slow but convenient in the use cases where you need a logical separation and filtration capability for the embeddings within the vector store.

Happy to hear about interesting use cases you came up with similar patterns. 🙂

What’s next after ChatGPT?

Posted on January 31, 2024 by Haritha Thilakarathne

A Large Language Model as a Van Gogh Artwork (Dall-E2)

The hype on Generative AI is still there. Everyone is looking for applications of GenAI and investing to get a competitive advantage for businesses with AI-based interventions. These are a couple of questions I came across, as well as my view on LLMs and their future.

Can we achieve everything with LLMs now?

The straightforward answer is NO! LLMs can’t handle all tasks and are best suited as tools for most natural language processing tasks, particularly in information retrieval and conversational applications. Despite their strengths, there are still plenty of simple approaches that find practical use in real-world scenarios. In simpler terms, LLMs have their unique use cases, but they can’t do everything like a wizard!

Are LLMs taking over the tech world?

Is this the end of traditional ML? Not at all. As mentioned earlier, LLMs don’t cover all machine learning tasks. Most data analytical and machine learning use cases involve numerical data often organized in a relational structure, where traditional machine learning algorithms excel. Traditional machine learning techniques are expected to remain relevant for the foreseeable future.

Artificial General Intelligence (AGI)?

Have we reached it? General Artificial Intelligence (General AI) envisions AI systems with human-like abilities across various tasks. However, we are not there yet. While there’s a possibility of achieving some level of AGI, current LLMs, including applications like ChatGPT, should not be confused with AGIs. LLMs are proficient in predicting text frequencies using transformers but struggle with complex analytical tasks where human expertise is crucial.

Are enterprises ready for the AI hype?

Having worked with numerous enterprises, I’ve observed a willingness to invest in AI projects for streamlining business processes. However, many struggle to identify suitable use cases with a considerable Return on Investment (RoI). Some organizations, even if prepared for advanced analytics, face extensive groundwork in their IT and data infrastructure. Despite these challenges, the AI hype has prompted businesses to recognize the potential of leveraging organizational data resources effectively. In the coming months, we anticipate a significant boost not only in LLM-based applications but also in traditional machine learning and deep learning applications across industries.

Ethical AI? What’s happening there?

With the public’s increasing adoption of ChatGPT and large language models, conversations about responsible AI use have gained traction. The European Union has passed pioneering AI legislation, and Australia is actively working on regulating AI systems and establishing ethical AI guidelines. Countries like Australia have introduced AI ethics frameworks and established National AI Centres to promote responsible AI practices and innovation.

Leading companies like Microsoft are contributing to responsible AI by introducing guidelines and toolboxes for transparent machine learning application development. Governments and corporations are moving towards regulating and controlling AI applications, a positive development in ensuring responsible AI use.

There’s no turning back now. We must all adapt to the next wave of AI and prepare to harness its full potential.

Optimising Spark Analytics: The PartitionBy Function

Posted on August 25, 2023 by Haritha Thilakarathne

Apache Spark is an open source, distributed computing framework designed for big data processing and analytics. Spark was created to address the shortcomings of MapReduce in terms of use and versatility.

Recently I was using Spark framework (Through PySpark API) on Azure Synapse Analytics extensively for analysing bulks of retail data. Following better optimising approaches makes a huge difference in runtimes when it comes to Spark queries.

Among several Spark optimisation methods, I thought of discussing the importance of data partitioning in Spark and ways we can partition data.

Most of the enterprises are adapting data Lakehouse architecture today and it’s common that most of the data is stored in parquet format. Parquet is a columnar storage file format which is designed to optimise performance when dealing with large analytical workloads.

With some experiments I did with large datasets, I found “partitionBy” option with parquet in Spark makes a huge difference in query performance in most of the use cases. It specifies how the data should be organized within the storage system based on one or more columns in your dataset.

Using “partitionBy” column values has several advantages.

Improved Query Performance: When you partition data by specific column values, Spark can take advantage of partition pruning during query execution. This means that when you run a query that filters or selects data based on the partitioned column, Spark can skip reading irrelevant partitions, significantly reducing the amount of data that needs to be processed. This results in faster query execution.

Optimized Joins: If you often join tables based on the partitioned column, partitioning by that column can lead to more efficient join operations. Spark can perform joins on data that is collocated in the same partitions, minimizing data shuffling and reducing the overall execution time of the join operations.

Reduced Data Skew: Data skew occurs when certain values in a column are much more frequent than others. By partitioning data, you can distribute the skewed values across multiple partitions, preventing a single partition from becoming a bottleneck during processing.

Simplified Data Management: Each partition represents a distinct subset of your data based on the partitioned column’s values. This organization simplifies tasks such as data backup, archiving, and deletion, as you can target specific partitions rather than working with entire datasets.

Enhanced Parallelism: Different partitions can be processed concurrently by separate tasks or executors, maximizing resource utilization and speeding up data processing tasks.

Facilitates Time-Based Queries: If you have time-series data, partitioning by a timestamp or date column can be especially useful. It allows you to efficiently perform time-based queries, such as retrieving data for a specific date range, by only reading the relevant partitions.

Easier Maintenance: Partitioning can simplify data maintenance tasks, like adding new data or updating existing data. You can easily add new partitions or drop old ones without affecting the entire dataset.

Let’s walk through a practical experience to understand the advantages of using partitionBy column values in Spark.

Scenario: Imagine you’re working with a large dataset of sales transactions from an e-commerce platform. Each transaction has various attributes, including a timestamp representing the transaction date. You want to analyse the data to find out which products sold the most during a specific date range.

Step 1 – Partitioning by Date: First, partition the sales data by the transaction date using the partitionBy command in Spark when writing the data to the storage. Each partition represents a specific date, making it easy to organize and query the data.

# Specify the partition column
partition_column = "transaction_date"

# Write the DataFrame to Parquet with partitionBy
df.write.partitionBy(partition_column).parquet("/path/to/output/directory")

Step 2 – Query for Date Range: Let’s say you want to analyse sales for a specific date range. When you run your PySpark query, it automatically prunes irrelevant partitions and only reads the partitions that contain data for the selected date range.

# Read the partitioned data from Parquet
partitioned_df = spark.read.parquet("/path/to/output/directory")

# Query data for a specific date
specific_date = date(2023, 7, 1)
result = partitioned_df.filter(partitioned_df[partition_column] == specific_date)

result.show()

In summary, using partitionBy column values in Spark can significantly enhance query performance, optimize joins, improve data organization and management, and provide advantages in terms of data skew handling and parallelism. It’s a powerful technique for optimizing Spark jobs, particularly when dealing with large and diverse datasets.

Share your experience and approaches you use when it comes to optimising Spark query performance.

Empowering Conversations with ChatGPT: The Art of Prompt Engineering

Posted on July 22, 2023 by Haritha Thilakarathne

The release of ChatGPT in November 2022 gave a mega boost for the world of AI. With that, everyone started playing around with Large Language Models (LLMs) and explore the possibilities of them.

While opening up interesting possibilities, new approaches and methodologies for interacting with LLMs came into the play. Prompt engineering is one of the major domains even the general public is interested in right now.

To get myself familiar with the context and not to feel left behind, I followed several online resources including the Prompt Engineering for developers from Deeplearning.ai. Here are some interesting points caught my attention. This brief would be helpful for anyone who’s using ChatGPT or any LLM not only in development scenarios, even for general daily chaos.

What’s actually Prompt Engineering?

Without any hesitation, I asked this question from ChatGPT 😀

Prompt engineering refers to the process of designing and refining prompts to elicit desired responses from a language model like ChatGPT. It involves crafting the initial instruction or query in a way that effectively communicates the desired task or information to the model. Prompt engineering can significantly influence the quality and relevance of the model’s responses.
ChatGPT

In simple terms,

Prompt engineering is the methodology of giving clear instructions to LLMs. It helps the language model to understand the instructions clearly and provide a better precise output.

LLMs in the GPT family (GPT-3, GPT3.5, GPT4 etc.) are trained to predict the next word occurrence of a given output. given that, the instructions we provide to the model should be specific and understandable. There are no hard bound rules for prompting the instructions, but the most precise would be better.

There are two key principles we should keep in mind when prompting.

Write clear and specific instructions.
Give the model time to “think”.

Be clear! Be Precise!

There are many tactics which we can follow in order to make out instructions clear and easy to understand for an LLM.

Use delimiters to indicate distinct parts of the input

Let’s get the example of using a GPT model to summarize a particular text. It’s always better to clearly indicate which text parts of the prompt is the instruction and which is the actual text to be summarized. You can use any delimiter you feel comfortable with. Here I’m using double quotes to determine the text to summarised.

Ask for structured outputs

When it comes to LLM aided application development, we use OpenAI APIs to perform several natural language processing tasks. For an example we can use the GPT models to extract key entities, and the sentiment from a set of product reviews. When using the output of the model in a software system, it’s always easy to get the output from a structured format like JSON object or in HTML.

Here’s a prompt which gets feedback for a hotel as the input and gives a structured JSON output. This comes handy in many analytics scenarios and integrating OpenAI APIs in production environments.

It’s always good to go step by step

Even for humans, it’s better to provide instructions in steps to complete a particular task. It works well with LLMs too. Here’s an example of performing a summarization and two translations for a customer review through a single prompt. Observe the structure of the prompt and how it’s guiding the LLM to the desired output. Make sure you construct your prompt as procedural instructions.

Here’s the output in JSON.

{
  "review": "I recently stayed at Sunny Sands Hotel in Galle, Sri Lanka, and it was amazing! The hotel is right by the beautiful beach, and the rooms were clean and comfy. The staff was friendly, and the food was delicious. I also loved the pool and the convenient location near tourist attractions. Highly recommend it for a memorable stay in Sri Lanka!",
  "English_summary": "A delightful stay at Sunny Sands Hotel in Galle, Sri Lanka, with its beautiful beachfront location, clean and comfortable rooms, friendly staff, delicious food, lovely pool, and convenient proximity to tourist attractions.",
  "French_summary": "Un séjour enchanté à l'hôtel Sunny Sands à Galle, Sri Lanka, avec son emplacement magnifique en bord de plage, ses chambres propres et confortables, son personnel amical, sa délicieuse cuisine, sa belle piscine et sa proximité pratique des attractions touristiques.",
  "Spanish_summary": "Una estancia encantadora en el hotel Sunny Sands en Galle, Sri Lanka, con su hermosa ubicación frente a la playa, habitaciones limpias y cómodas, personal amable, comida deliciosa, hermosa piscina y ubicación conveniente cerca de atracciones turísticas."
}

Hallucinations are one of the major limitations LLMs are having. Hallucination occurs when the model produces coherent-sounding, verbose but inaccurate information due to a lack of understanding of cause and effect. Following proper prompt methodologies and tactics can prevent hallucinations to some extend but not 100%.

It’s always the developer’s responsibility to develop AI systems follow the responsible AI principles and accountable for the process.

Feel free to share your experiences with prompt engineering and how you are using LLMs in your development scenarios.

Happy coding!

Do we really need AI?

Posted on June 24, 2023 by Haritha Thilakarathne

Since the launch of ChatGPT in last November, not only the tech community, but also the general public started peeping into the world of AI. As mentioned in my article “AI summer is Here”, organisations are looking for avenues where they can use the power of AI and advance analytics to empower their business processes and gain competitive advantage.

Though everyone is looking forward for using AI, are we really ready? Are we there?

These are my thoughts on the pathway an organisation may follow to adopt AI in their business processes with a sensible return on investment.

First of all, there’s a key concept we should keep in mind. “AI is not a wizard or a magical thing that can do everything.” It’s a man-made concept build upon mathematics, statistics and computer science which we can use as a toolset for certain tasks.

We want to use AI! We want to do something with it! OR We want to do everything with it!

Hold on… Though there’s a ‘trend’ for AI, you should not jump at it without knowing nothing or without analysing your business use cases thoroughly. you should first identify what value the organization is going to gain after using AI or any advance analytics capability. Most likely you can’t do everything with AI (yet). It’s all about identifying the correct use case and correct approach that aligns with your business process.

Let’s not focus on doing something with AI. Let’s focus on doing the right thing with it.

We have a lot of data! So, we are there, right?

Data is the key asset we have when it comes to any analytical use case. Most of the organizations are collecting data with their processes from day 1. The problem lies with the way data is managed and how they maintain data assets and the platform. Some may have a proper data warehouse or lake house architecture which has been properly managed with CI/CD concepts etc, but some may have spread sheets sitting on a local computer which they called their “data”!

The very first thing an organization should do before moving into advance analytics would be streamlining their data platform. Implementing a proper data warehouse architecture or a data lake architecture which follows something similar to Medallion architecture would be essential before moving into any analytics workloads.

If the organization is growing and having a plan to use machine learning and data science within a broader perspective, it is strongly recommended to enable MLOps capabilities within the organization. It would provide a clear platform for model development, maintenance and monitoring.

Having a lot of data doesn’t mean you are right on track. Clearing out the path and streamlining the data management process is essential.

Do we really need to use AI or advance analytics?

This question is real! I have seen many cases where people tend to invest on advance analytics even before getting their business processes align with modern infrastructure needs. It’s not something that you can just enable by a click of a button. Before investing your time and money for AI, first make sure your IT infrastructure, data platforms, work force and IT processes are up to date and ready to expand for future needs.

For an example, will say you are running a retail business which you are planning to use machine learning to perform sales forecasts. If your daily sales data is still sitting on a local SQL server and that’s the same infrastructure you going to use for predictive analytics, definitely that’s going to be a failure. First make sure your data is secured (maybe on a cloud infrastructure) and ready to expose for an analytical platform without hassling with the daily business operations.

ChatGPT can do everything right?

ChatGPT can chat! 😀

As I stated previously, AI is not a wizard. So as ChatGPT. You can use ChatGPT and it’s underlying OpenAI models mostly for natural language processing based tasks and code generation and understanding (Keep in mind that it’s not going to be 100% accurate). If your use case is related to NLP, then GPT-3 models may be an option.

When to use generative AI?

Variational Autoencoders, Generative Adversarial Networks and many more have risen and continue to advance the field of AI. That has given a huge boost for the domain of generative AI. These models are capable of generating new examples that are like examples from a given dataset.

It is being used in diverse fields such as natural language processing (GPT-3, GPT-4), computer vision (DALL-E), speech recognition and many more.

OpenAI is the leading Generative AI service provider in the domain right now and Microsoft Azure offers Azure OpenAI, which is an enterprise level serving of OpenAI services with additional advantages like security and governance from Azure cloud.

If you are thinking about using generative AI with your business use case, strongly recommend going through the following considerations.

If you have said yes to all of the above, then OpenAI may be the right cognitive service to go with.

If it’s not the case, you have to look at other machine learning paradigms and off-the-shelf ML models like cognitive services which may cater better to the scenario.

Ok! What should be the first step?

Take a deep dive for your business processes and identify the gaps in digital transformation. After addressing those ground level issues, then look on the data landscape and analyse the possibilities of performing analytical tasks with the it. Start with a small non-critical use case and then move for the complex scenarios.

If you have any specific use cases in mind, and want to see how AI/ machine learning can help to optimize those processes, I’m more than happy to have a discussion with you.

btw, the image on the top is generated by DALL-E with the prompt of “A cute image of a robot toy sitting near a lake – digital art“

Managing Python Libraries on Azure Synapse Apache Spark Pools

Posted on May 5, 2023 by Haritha Thilakarathne

Azure Synapse Analytics is Microsoft’s one-stop shop that integrates big data analytics and data warehousing. It features Apache Spark pools to provide scalable and cost-effective processing power for Spark workloads which enables you to quickly and easily process and analyse large volumes of data at scale.

I have been working with Spark environment on Azure Synapse for a while and thought of documenting the experience I had with installing external python libraries for Spark pools on Azure. This guideline may come handy for you if you are performing your big data analytics experiments with specific python libraries.

Apache Spark pools on Azure Synapse Workspaces comes with Anaconda python distribution and with pip as a package manager. Most of the native python libraries used in the data analytics space are already installed. If you need any additional packages to be installed and used within your scripts, there are 3 ways you can do it on Synapse.

Use magic command on notebooks to install packages in session level.
Upload the python package as a workspace package and install in Spark pool.
Install packages using PIP or conda input file.

01. Use magic command on notebooks to install packages in session level.

Install packages for session level on Notebooks

This is the most simple and straight forward way of installing a python package in Spark session level. You just have to use the magic command followed up with the usual pythonic way of installing the package through pip or conda. Though this is easy for prototyping and quick experiments, it’s pointless if you are installing it over and over again when you start a new spark session. Better to avoid this method in production environments. Good for rapid prototyping experiments.

02. Upload the python package as a workspace package and install in Spark pool.

Upload python libraries as workspace packages

Azure Synapse workspace allows to have workspace packages uploaded and install on the Spark pools. It accepts python wheels (.whl files), jar files or tar.gz as packages.

After uploading the packages go for specific Apache Spark pool and then select the packages you want to install on it. The initial installation may take few minutes. (In my case, it took around 20 mins to install 3 packages)

With the experience I had with different python packages, I would stick with python wheels from pip or jars from official package distributions. I tried sentence-transformers tar.gz file from pyPI (https://pypi.org/project/sentence-transformers/ ). It gave me an error during the installation process mentioning a package dependency with R (which is confusing)

03. Install packages using PIP or conda input file.

If you are familiar with building conda environments or docker configurations, having a package list as a config file should not be new to you. You can specify either a .txt file or an .yml file with the desired package versions to be installed to the Spark cluster. If you want to specify a specific channel to get libraries, you should use a .yml file.

For an example if you need to install sentence-transformers package and spark-nlp python packages which are used in NLP for the Spark environment, you should add these two line in a .txt file and upload it as the input file for the workspace.

sentence-transformers===2.2.2
spark-nlp===4.4.1

I find this option as the most robust way of installing python packages to a Spark since it gives you the flexibility to select the desired package version and the specific channel to use during installation. It saves the time from installing the packages each time when the session get refreshed too.

Let me know your experience and the lessons learned during experiments with Apache Spark pools on Azure Synapse Analytics.

Here’s the Azure documentation on this for further reference.

Different Approaches to Perform Machine Learning Experiments on Azure

Posted on May 30, 2022 by Haritha Thilakarathne

We have discussed a lot about Azure Machine Learning Studio; the one-stop portal for all ML related workloads on Azure cloud. AzureML Studio provides different approaches to work on the machine learning experiments based on needs, resources and constraints you have. Selecting the plan to attack is completely your choice.

We all have our own way of performing machine learning experiments. While some prefer working on Jupyter notebooks, some are more into less code environments. Being able to onboard data scientists with their familiar development environment without a big learning overhead is one of the main advantages of AzureML.

In this article, let’s have a discussion on different methods available in AzureML studio and their usage in practical scenarios. We may discuss pros and cons of each approach as well.

Please keep in mind that, these are my personal thoughts based on the experiences I had with ML experiments and this may change in different scenarios.

Automated ML

As the name implies, this is all automated. Automated ML is the easiest way of producing a predictive model just in few minutes. You don’t need to have any coding experience to use Automated ML. Just need to have an idea on machine learning basics and an understanding on the problem you going to solve with ML.

The process is pretty straight forward. You can start with selecting the dataset you want to use for ML model training and specify the ML task you want to perform. (Right now, it supports classification, regression, time series forecasting. Computer vision and NLP tasks are in preview). Then you can specify the algorithms you want to test it with and other optional parameters. Azure does all the hard work for you and provides a deployment ready model which can be exposed as a REST API.

Pros:

Zero code process.
Easy to use and well suited for fast prototyping.
Eliminate the environment setup step in ML model development
Limited knowledge on machine learning is needed to get a production viable result.

Cons:

Limited machine learning capabilities.
Right now, only works with supervised learning problem scenarios.
Works well with relational data, computer vision and NLP are still in preview.
There’s no way of using custom machine learning algorithms in the process.

Azure ML Designer

Azure ML Designer is an upgraded version of the pretty old Azure ML Studio drag and drop version. Azure ML Designer is having a similar drag and drop interface for building machine leaning experiment pipelines. You have a set of prebuilt components which you can connect together in a flowchart like manner to build machine learning experiments. You have the ability to use SQL queries or python/ R scripts if you want in the process too. After training a viable ML model, you can deploy it as a web service with just few clicks.

I personally prefer this for prototyping. Plus, I see a lot f potential on Azure ML designer in educational purposes. It’s really easy to visualize the ML process through the designer pipelines and it increases the interpretability of the operation.

Pros:

Zero code/ Less code environment
Easy to use graphical interface
No need to worry on development/ training environment configurations
Having the ability to expand the capabilities with python/ R scripts
Easy model deployment

Cons:

Less flexibility for complex ML model development.
Less support for deep learning workloads.
Code versioning should handle separately.

Azure ML notebooks

This maybe the most favourite feature on Azure ML for data scientists. I know Jupyter notebooks are like bread and butter for data scientists. Azure ML offers a fully managed notebook experience for them without giving them the hassle of setting up the dev environment on local computers. You just have to connect the notebook with a compute instance and it allows you to do your model development and training on cloud in the same way you did on a local compute or elsewhere on notebooks.

I would recommend this as the to-go option for most of the machine learning experiments since it’s really easy to spin up a notebook instance and get the job done. Most of the ML related libraries are pre-installed on the compute instance and you even have the flexibility to install 3^rd party packaged you need through conda or pip.

Pros:

Familiar notebook experience on cloud.
Option to use different python kernels.
No need to worry about dev environment setup on local compute.
Can use the powerful compute resources on Azure for model training.
Flexibility to install required libraries through package managers.

Cons:

Comes with a price for computation.
No direct support for spark workloads.
Code version control should manage separately.

Developing on local and connect to AzureML service through AML Python SDK.

This is the option I would suggest for more advanced users. Think of a scenario where you have a deep learning based computer vision experiment to run on Azure with a complex code base. If this is the case, I would definitely use AzureML python SDK and connect my prevailing code base with the AzureML service.

In this approach, your code base sits on your local computer and you are using Azure for model training, deployment and monitoring purposes. You have the total flexibility of using the power of cloud for computations as well as the flexibility of using local machine for development.

Pros:

Total flexibility in performing machine learning experiments with our comfortable dev tools.
AzureML python SDK is an open-source library.
Code version controlling can be handled easily.
Whole ML process can be managed using scripts. (Easy for automation)

Cons:

Setting up the local development environment may take some effort.
Some features are still in experimental stage.

Choosing the most convenient approach for your ML experiment is totally based on the need and resources you have. Before getting into the big picture, start small. Start with a prototype, then a workable MVP, gradually you can move forward with expanding it with complex machine learning approaches.

What’s your most preferred way of model development from these options? Please mention in the comments.

Cheers!

MLOps : Let’s start plumbing ML experiments!

Posted on April 22, 2022 by Haritha Thilakarathne

What’s all this hype on MLOps? What’s the difference between machine learning and MLOps? Is MLOps essential? Why we need MLOps? Through this article series we going to start a discussion on MLOps to get a good start with the upcoming trend. The first post is not going to go deep with technicalities, but going to cover up essential concepts behind MLOps.

What is MLOps?

As the name implies, it is obviously having some connection with DevOps. So, will see what DevOps is first.

“A compound of development (Dev) and operations (Ops), DevOps is the union of people, process, and technology to continually provide value to customers.”
Microsoft Docs

This is the formal definition of DevOps. In the simpler terms, DevOps is the approach of streamlining application development life cycle of software development process. It ensures the quality engineering and security of the product while making sure the team collaboration and coordination is managed effectively.

Imagine you are a junior level developer in a software development company who develops a mission critical system for a surveillance application. DevOps process make sure each and every code line you write is tracked, managed and integrated to the final product reliably. It doesn’t stop just by managing the code base. It involves managing all development life cycle steps including the final deployment and monitoring of the final product iteratively too.

That’s DevOps. Machine Learning Operations (MLOps) is influenced by DevOps principles and practices to increase the efficiency of machine learning workflows. Simply, it’s the way of managing ML workflows in a streamlines way to ensure quality, reliability, and interpretability of machine learning experiments.

Is MLOps essential?

We have been playing around with machine learning experiments with different tools, frameworks and techniques for a while. To be honest, most of our experiments didn’t end up in production environments :D. But, that’s the ultimate goal of predictive modeling.

Machine Learning experiment is an iterative process
Source : https://azure.microsoft.com/en-au/resources/gigaom-delivering-on-the-vision-of-mlops/

Building a machine learning model and deploying it is not a single step process. It starts with data collection and goes in an iterative life cycle till monitoring the deployed model in the production environment. MLOps approaches and concepts streamline these steps and interconnect them together.

Answer is Yes! We definitely need MLOps!

Why we need MLOps?

As I said earlier, MLOps interconnect the steps in ML life cycle and streamline the process.

I grabbed these points from Microsoft docs. As it implies, these are the goals of MLOps.

Faster experimentation and development of models

Good MLOps practices leads for more code and component reusability which leads for faster experiments and model development. For an example, without having separate training loops or data loading components for each experiment, we can reuse an abstract set of methods for those tasks and connect them with a machine learning pipeline for running different experiment configurations. That’s make the life easy of the developer a lot!

I do lot of experiments with computer vision. In my case, I usually have a set of abstract python methods that can be used for model training and model evaluation. When performing different experiments, I pass the required parameters to the pipeline and reuse the methods which makes the life easy with less coding hassle.

Faster deployment of models into production

Machine learning model deployment is always a tricky part. Managing the endpoints and making sure the deployment environment is having all the required platform dependencies maybe hard to keep track with manual processes. A streamlines MLOps pipeline helps to manage deployments by enabling us to choose which trained model should go for production etc. by keeping track of a model registry and deployment slots.

Quality assurance and end-to-end lineage tracking

Maintaining good coding practices, version controlling, dataset versioning etc. ensures the quality of your experiments. Good MLOps practices helps you to find out the points where errors are occurring easily rather than breaking down the whole process. Will say your trained model is not performing well with the testing data after sometime from model deployment. That might be caused by data drift happened with time. Correctly configured MLOps pipeline can track such changes in the inference data periodically and make sure to notify such incidents.

Trustworthiness and ethical AI

This is one of the most important use cases of MLOps. It’s crucial to have transparency in machine learning experiments. The developer/ data scientist should be able to interpret each and every decision they took while performing the experiment. Since handling data is the key component of ML model, there should be ways to maintain correct security measures in experiments too. MLOps pipelines ensure these ethical AI principles are met by streamlining the process with a defined set of procedures.

How we gonna do this?

Now we all know MLOps is crucial. Not just having set of python scripts sitting in a notebook it’s all about interconnecting all the steps in a machine learning experiments together in an iterative process pipeline. There are many methods and approaches to go forward with. Some sits on-prem while most of the solutions are having hybrid approach with the cloud. I usually use lot of Azure services in my experiments and Azure machine learning Studio provides a one-stop workbench to handle all these MLOps workloads which comes pretty handy. Let’s start with a real-world scenario and see how we can use Azure Machine Learning Studio in MLOps process to streamline our machine learning experiments.

“Stay Hungry Stay Foolish” – Let’s Learn Machine Learning without Code!

Posted on May 20, 2021 by Haritha Thilakarathne

“Stay Hungry Stay Foolish”
Steve Jobs

This famous quote of Steve Jobs is one of the most precious quotes I always keep in my mind. Stepping into the IT industry 11 years back as a teenager, I was always eagerly waiting to push myself beyond the barriers and keep trying new things. That hunger led me to explore Artificial Intelligence, undoubtably the most used buzz word in today’s industry. I always make sure to keep myself foolish and open to learn new things.

When I started exploring data science and related technologies 6 years ago, almost all the new things I experimented in the work life were self-taught from online resources. Even today I really enjoy going through documentations on different technologies and making myself familiar with those.

In the AI space, Microsoft Azure is a dominant player with their vast variety of tools and services. Being working with different AI related tools for years, I’m super thrilled to see the advancements on the Azure data & AI space. When the MVP cloud skills challenge was launched, I had no hesitation to go forward with Data & AI path, since I needed to sharpen up my skills and update myself with the new capabilities Azure AI is providing.

Azure AI is equipped with tools and services for anyone who’s interested in AI no matter in which expertise level the person is in. You can easily use Azure Cognitive services to add AI capabilities for your application by just calling for a REST API. If you want develop an advance machine learning/ deep learning experiment, Azure AI allows you to use your favorite open-source tools and frameworks with and adapt the power of cloud for your development.

What I actually learnt?

The challenge consists learning modules which covers most of the prominent parts in Azure AI domain including Azure machine learning, Azure cognitive services, Azure cognitive search and Bot framework. It has been a long time since I built bots. So, working with the new capabilities and functions of bot framework was a pretty good experience. In addition to that, Azure cognitive search is one of the services I’ve least used in my developments and I always wanted to give it a try. With the simple but well managed learning modules gave me a perfect star to sketch my first cognitive search application.

Here comes the most interesting part!

One of the most common questions I get when doing sessions in the community is “do we actually need to know coding to perform machine learning experiments?”

With no hesitation I say yes because Azure machine learning is offering two powerful tools for zero-code machine learning experiments. Automated machine learning supports training supervised machine learning models for classification, regression and time series forecasting. You can create and publish a machine learning experiment as a REST API with just few clicks!

Azure Machine Learning Designer provides you a simple drag and drop interface where you can create machine learning pipelines and publish those as REST endpoints. If you want advance functionalities, it allows you to add python or R code snippets inside the pipelines too.

I know you are super eager to learn on this zero-code machine learning development tools. Check out the Microsoft Learn collection I created specifically focusing on these two tools. Don’t forget to share your learning experience with me.

Click here for the Machine Learning with Zero-code learning collection!

Happy Learning!

Connecting Azure SQL server with Azure Machine Learning

Posted on January 29, 2021 by Haritha Thilakarathne

Accessing data in different data sources is one of the main tasks in machine learning model development life cycle. Let’s discuss one of the most common data accessing scenarios.

Scenario :

We have to set of relational data points stored in a Azure SQL server to develop a machine learning model using Azure Machine Learning. Let’s see how to leverage data stored in an Azure SQL database in an Azure Machine Learning experiment.

The process contains three main steps.

Set access permissions of Azure SQL database
Connect Azure SQL database to an Azure ML datastore
Register the data in datastore as an Azure ML dataset.

1. Set access permissions of Azure SQL database

Allow Azure services and resources to access this server

By default Azure SQL databases are protected with a firewall which limits outside access for data. Since we going to provide access for the traffic from Ips belongs to Azure resources and services, make sure you allow Azure services to access your SQL server.

2. Connect Azure SQL database to an Azure ML datastore

Azure ML datastores can be defined as the abstraction of data sources for the ML workspace or as the interconnection between the data resource and AzureML workspace.

Go to your Azure Machine Learning Studio (ml.azure.com) and click ‘New datastore’. Provide a datastore name and select ‘Azure SQL database’ as the datastore type. Make sure to authenticate the access with Azure SQL server’s user ID and the password.

3. Register the data in datastore as an Azure ML dataset.

AzureML supports two types of datasets (Take a look here to get an overview on the difference between those). Since we are dealing with a set of relational data, Tabular dataset is the option we have to use for creating the dataset.

Select ‘Create dataset’ from ‘Datasets’ tab on AML Studio and prmopt to ‘From datastore’ option.

Select the datastore we created in the previous step which establish the connection between AML workspace and the data source.

Provide the required SQL query to select the required data from SQL server. Make sure to validate the data before configuring the schema.

All done! Now you have the access to the data in your Azure SQL database from AzureML workspace. You can easily refer this in your experiments.

In the cases where your database is getting updated time to time, what you have to do is refreshing the dataset to fetch the newest data points specified by the SQL query.