How to Speed Up Data Loading for Machine Learning Training
--
Based on the HuggingFace’s datasets package
The topic for today is about how to speed up the data loading process using the HuggingFace’s datasets
package.
For your information, the datasets module is part of the HuggingFace’s data loader ecosystem. It serves as an easy-to-use and efficient data manipulation tools. It is integrated together with the transformers
or diffusers
package for training machine learning models.
Let’s proceed to the next section for setup and installation.
Setup
It is highly recommended to create a new virtual environment before the package installation.
Activate it and run the following command to install datasets
:
pip install datasets
Implementation
Behind the scene, most of the training scripts come with its own datasets builder for simplicity. For example, most image-related training scripts use the ImageFolder
class which supports loading image files from URL or from local folders.
ImageFolder without metadata
Here is an example of the image datasets (without metadata) structure supported by the ImageFolder
class:
data/dog/golden_retriever.png
data/dog/german_shepherd.png
...
data/cat/maine_coon.png
data/cat/birman.png
Simply set the data_dir
argument to the corresponding folder and the ImageBuilder
class will find all the image files in the folder recursively. Alternatively, set data_files
argument using os.path.join('path/to/data', **)
to achieve the same behavior.
ImageFolder with metadata
Certain computer vision tasks (image captioning and object detection) requires metadata. The ImageFolder
datasets folder supports loading image datasets with metadata via the metadata.jsonl
file.
data/metadata.jsonl
data/dog/golden_retriever.png
data/dog/german_shepherd.png
data/cat/maine_coon.png
Each row in the metadata.jsonl
represents a single image. It should be a JSON dict with a file_name
key and any other attributes associated with the datasets: