Using Hierarchical Data Format (HDF5) in Machine Learning

Example of HDF5 file structure : https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5

Machine learning or deep learning is not all about algorithms and training predictive models on some set of data. It involves a wide range of tools, techniques and computing approaches to handle various steps of the machine learning process pipeline.

Starting from a raw data point, to the stage of exposing the model as a REST API there are numerous places where we need to pay attention on data handling approaches. (Yes! Data is the key component of any ML/DL pipeline.)

In this article am bringing out a problem I faced when dealing with a deep learning experiment and the approach I took to overcome the problem. I’m pretty sure you may have to face similar kind of issues if you using massive amounts of structured/ unstructured data for training your deep learning models.

Here’s the issue I faced :

In order to train a computer vision related deep learning model I had to write a PyTorch custom dataloader for loading a set of annotation data. The data points were stored in JSON format and believe me, that massive JSON file was nearly 4GB! It was not a simple data structure with keys and values, but had a mixed set of data structures including lists, single float values and keys in String format.

As usual I wrote a PyTorch custom dataset class and tried to load the massive JSON file inside init . Yp! It crashed! Memory was not enough for handling such a big file. Can’t you move that for getitem ? No. It’s not possible. Loading file on call is so inefficient and I had to think of solution which doesn’t load the massive file as a whole for the RAM and with the possibility of retrieving data inside the file with indexes.

(If you need to get some tips and tricks on writing PyTorch custom datasets, please refer this article.)

What I did?

The first dumb idea I got was converting the data into a multidimensional numpy array and save the file, but I figured out that gives the birth for another massive file which doesn’t solve my problem. With a suggestion I got from my co-supervisor, I started looking on HDF5; Hierarchical Data Format. Yes! It was the solution and it nicely solved my issue.

What is Hierarchical Data Format (HDF5) ?

The Hierarchical Data Format version 5 (HDF5), is an open source file format that supports large, complex, heterogeneous data. This uses a ‘directory-like’ structure to store data. In simpler terms, a HDF5 file can be identified as a definition of a file system (the way files and directories are stored in your computer) in a single file.

There are two important terms used in HDF5 format.

Groups – Folder like element within the HDF5 file which can contain subgroups or datasets.
Dataset – Actual data contained within the HDF5 file. (Numpy arrays etc. )

In simpler terms, if your data is large, complex, heterogeneous and need random access most probably HDF5 would be the best option you can go forward with.

How to use HDF5?

We all speak Python when it comes to machine learning. Python supports HDF5 format using h5py package. Since this is a wrapper based on native HDF C API, it provides almost the full functionality.

Create HDF5 file from a JSON array

Here I included a very brief code snippet of creating a HDF5 file from a JSON array which contains the data from famous iris dataset. This is a sample of JSON array I used. (You can get the full dataset from here)

[
    {"sepalLength": 5.1, "sepalWidth": 3.5, "petalLength": 1.4, "petalWidth": 0.2, "species": 0},
    {"sepalLength": 5.7, "sepalWidth": 2.8, "petalLength": 4.5, "petalWidth": 1.3, "species": 1},
    {"sepalLength": 6.9, "sepalWidth": 3.1, "petalLength": 5.4, "petalWidth": 2.1, "species": 2}
]

Here I created a separate group for each entry. (3 JSON objects in the array means 3 groups in HDF5 file.) The 5 datapoints in each object are stored as datasets.

import numpy as np
import json
import h5py
import os

hdf5_filename = 'iris_hdf5.hfd5'

#read iris.json file
with open('iris.json') as jsonfile:
    iris_data = json.load(jsonfile)
    
#create HDF5 file
h = h5py.File(hdf5_filename, 'w')

#running a loop through all entries in the JSON array
index = 0
for entry in iris_data:
    for k, v in entry.items():
        dataset_name = os.path.join(str(index), k) #groups are divided by '/'
        h.create_dataset(dataset_name, data = np.asarray(v, dtype=np.float32))
    index = index +1
h.close()
print('Iris data HDF5 file created.')


#read data from HDF5
h_read = h5py.File(hdf5_filename, 'r')

#read a single entry
 
h_read['0'].keys() 
# output : <KeysViewHDF5 ['petalLength', 'petalWidth', 'sepalLength', 'sepalWidth', 'species']>
np.asarray(h_read['0']['petalLength']) 
# output : array(1.4, dtype=float32)

h_read.close()

Though this is a very simple data structure, you can expand this to complex and large files. You’ll find it pretty easy to use HDF5 instead of using huge lists inside init of custom dataloaders. Here’s a rough sketch of the PyTorch custom dataset class I created for the above example.


import torch
import h5py
from torch.utils.data.dataset import Dataset

hdf5_filename = 'iris_hdf5.hfd5'

class MyCustomDataset(Dataset):
    def __init__(self, ...):
        # # All the data preperation tasks can be defined here
        # - HDF5 file is referenced here.
        h_read = h5py.File(hdf5_filename, 'r')
         
    def __getitem__(self, index):
        # # Returns data and labels
        # - access HDF5 file through indexing
        item = np.asarray(h_read[index]['petalLength'])
        return item
 
    def __len__(self):
        return count # of how many examples you have

This is only one usage of using HDF5 file format in machine learning. Share your experiences with HDF5 here too. 🙂

2 thoughts on “Using Hierarchical Data Format (HDF5) in Machine Learning”

Preetham says:

on December 1, 2022 at 12:19 pm

doesn’t the line “h_read = h5py.File(hdf5_filename, ‘r’)” in the CustomDataset class read the whole file into memory? won’t that defeat the original purpose of avoiding loading the entire file into moemory?

LikeLike

- Haritha Thilakarathne says:
  
  on December 30, 2022 at 11:35 am
  
  No. It’s not. It’s just referring the h5 file in the disk without loading it to the memory.
  
  LikeLike