Medallion Data Lakehouse Architecture

“Who rules the data, rules the world”

Business world has realised data is the most important asset for the growth and business success. Since traditional data storage and architectural methods are not capable of handling massive data volumes and advance analytics capabilities, organisations are looking for new approaches for dealing with it.

When it comes to big data stores, Data Warehouses and Data Lakes are well-known and widely used in the industry for a long time. Data Warehouses are used for storing structured data, Data Lakes are occupied when there’s the need for storing unstructured data from variety of formats.

Source : Databricks

As the table shows Data Warehouses and Data Lakes are having their own advantages and disadvantages. Modern data platforms often need the features from both worlds. Data Lakehouse is a highbred approach which combines the features from Data Warehouse and Data Lake.

In short, a Data Lakehouse is an architecture that enables efficient and secure Artificial Intelligence (AI) and Business Intelligence (BI) directly on vast amounts of data stored in Data Lakes.

https://www.databricks.com/blog/2021/08/30/frequently-asked-questions-about-the-data-lakehouse.html

With the increasing need of AI and BI workloads, Data Lakehouse architecture has caught up the attention of data professionals and has become the to-go approach to handle massive amount of data from different source systems and different formats. As the Data Lakehouse is the central location for data in an organisation, it is necessary to follow a multi-layer architecture that ensure consistency, isolation, and durability of data as it passes through multiple processing stages.

The medallion approach consists of 3 data layers. The terms bronze (raw), silver (validated), and gold (enriched) describe the quality of the data at each of these levels.

Let’s see the features and usage of each data layer of the architecture.

The Bronze Layer

This is where the journey starts! The data in various formats coming from different source systems, streaming data etc. are stored in a distributed file store which can hold vast data volumes. Normally the data in the bronze layer is not cleansed or processed (Data is in in their raw format). It can be timestamped or saved as it comes from sources.   

The Silver Layer

Though we have zillions of data in the bronze layer, most of it could be just garbage and not worthy enough for any analytical tasks. When transferring data to the silver layer, transformation rules are applied to cleanse the data, remove null values, and join with lookup tables. The cleansed and validated data in the silver layer can be used for ad-hoc reporting, sophisticated analytics and even for machine learning model training.  

The Gold Layer

As the name implies, this is the most valuable portion or the form of data in the organisation. This layer is accountable for aggregating data from the silver layer for business use. Since the data is enriched with business logic principles, data in the gold layer is mostly used in BI reporting and analytical tasks. Data analysts, data scientists and business analysts highly rely on the data available in the gold layer.

Storing the data with a medallion architecture ensure the data quality and its transparency. Medallion architecture is mainly promoted by Databricks, but can use as a blueprint for structuring Lakehouse in other platforms too.

What’s your choice for storing big data? Sticking with data lake or moving to Lakehouse?

References:

https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion

https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

One thought on “Medallion Data Lakehouse Architecture

  1. Pingback: Do we really need AI? | NaadiSpeaks

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.