How to Speed Up Data Loading for Machine Learning Training
Based on the HuggingFace’s datasets package
The topic for today is about how to speed up the data loading process using the HuggingFace’s datasets
package.
For your information, the datasets module is part of the HuggingFace’s data loader ecosystem. It serves as an easy-to-use and efficient data manipulation tools. It is integrated together with the transformers
or diffusers
package for training machine learning models.
Let’s proceed to the next section for setup and installation.
Setup
It is highly recommended to create a new virtual environment before the package installation.
Activate it and run the following command to install datasets
:
pip install datasets
Implementation
Behind the scene, most of the training scripts come with its own datasets builder for simplicity. For example, most image-related training scripts use the ImageFolder
class which supports loading image files from URL or from local folders.
ImageFolder without metadata
Here is an example of the image datasets (without metadata) structure supported by the ImageFolder
class: