How to Speed Up Data Loading for Machine Learning Training

Ng Wai Foong
4 min readMar 6, 2023

Based on the HuggingFace’s datasets package

Photo by imgix on Unsplash

The topic for today is about how to speed up the data loading process using the HuggingFace’s datasets package.

For your information, the datasets module is part of the HuggingFace’s data loader ecosystem. It serves as an easy-to-use and efficient data manipulation tools. It is integrated together with the transformers or diffusers package for training machine learning models.

Let’s proceed to the next section for setup and installation.

Setup

It is highly recommended to create a new virtual environment before the package installation.

Activate it and run the following command to install datasets:

pip install datasets

Implementation

Behind the scene, most of the training scripts come with its own datasets builder for simplicity. For example, most image-related training scripts use the ImageFolder class which supports loading image files from URL or from local folders.

ImageFolder without metadata

Here is an example of the image datasets (without metadata) structure supported by the ImageFolder class:

--

--

Ng Wai Foong

Senior AI Engineer@Yoozoo | Content Writer #NLP #datascience #programming #machinelearning | Linkedin: https://www.linkedin.com/in/wai-foong-ng-694619185/