Based on the HuggingFace’s datasets package
The topic for today is about how to speed up the data loading process using the HuggingFace’s
For your information, the datasets module is part of the HuggingFace’s data loader ecosystem. It serves as an easy-to-use and efficient data manipulation tools. It is integrated together with the
diffusers package for training machine learning models.
Let’s proceed to the next section for setup and installation.
It is highly recommended to create a new virtual environment before the package installation.
Activate it and run the following command to install
pip install datasets
Behind the scene, most of the training scripts come with its own datasets builder for simplicity. For example, most image-related training scripts use the
ImageFolder class which supports loading image files from URL or from local folders.
ImageFolder without metadata
Here is an example of the image datasets (without metadata) structure supported by the
Simply set the
data_dir argument to the corresponding folder and the
ImageBuilder class will find all the image files in the folder recursively. Alternatively, set
data_files argument using
os.path.join('path/to/data', **) to achieve the same behavior.
ImageFolder with metadata
Certain computer vision tasks (image captioning and object detection) requires metadata. The
ImageFolder datasets folder supports loading image datasets with metadata via the
Each row in the
metadata.jsonl represents a single image. It should be a JSON dict with a
file_name key and any other attributes associated with the datasets: