How to Speed Up Data Loading for Machine Learning Training

Ng Wai Foong
4 min readMar 6

Based on the HuggingFace’s datasets package

Photo by imgix on Unsplash

The topic for today is about how to speed up the data loading process using the HuggingFace’s datasets package.

For your information, the datasets module is part of the HuggingFace’s data loader ecosystem. It serves as an easy-to-use and efficient data manipulation tools. It is integrated together with the transformers or diffusers package for training machine learning models.

Let’s proceed to the next section for setup and installation.


It is highly recommended to create a new virtual environment before the package installation.

Activate it and run the following command to install datasets:

pip install datasets


Behind the scene, most of the training scripts come with its own datasets builder for simplicity. For example, most image-related training scripts use the ImageFolder class which supports loading image files from URL or from local folders.

ImageFolder without metadata

Here is an example of the image datasets (without metadata) structure supported by the ImageFolder class:


Simply set the data_dir argument to the corresponding folder and the ImageBuilder class will find all the image files in the folder recursively. Alternatively, set data_files argument using os.path.join('path/to/data', **) to achieve the same behavior.

ImageFolder with metadata

Certain computer vision tasks (image captioning and object detection) requires metadata. The ImageFolder datasets folder supports loading image datasets with metadata via the metadata.jsonl file.


Each row in the metadata.jsonl represents a single image. It should be a JSON dict with a file_name key and any other attributes associated with the datasets:

Ng Wai Foong

Senior AI Engineer@Yoozoo | Content Writer #NLP #datascience #programming #machinelearning | Linkedin: