Building a Custom Dataloader¶

Through DagsHub’s streaming client, you can build custom data loaders and train your model on subsets of your data.

This has a few implications for DagsHub users:

Training works as if the data is already present on the local machine
We do not need to write a script or manually copy the necessary data to our training machine
Our training code can decide on-the-fly which data should be included in the current training run

Let’s take a look at an example on how to do this.

Configuration¶

Authentication¶

Manually¶

If you're streaming data from a private repo, we need to authenticate to DagsHub. When training using Colab or an interactive script, the DagsHub client will automatically redirect you to OAuth whenever it's required.

Automatically¶

However, if out training is not interactive, we need to authenticate in our code. We recommend getting a DagsHub Access Token and follow the below steps:

Set an environment variable
Mac, Linux, Windows
export DAGSHUB_TOKEN="<your-token>"

Then, in our training script, read that environment variable:

import os
DAGSHUB_TOKEN = os.environ.get('DAGSHUB_TOKEN', None)

Next, we authenticate to DagsHub:

import dagshub
dagshub.auth.add_app_token(DAGSHUB_TOKEN)

Install hooks¶

We'll hook to the repo we want to stream the data from using DagsHub Client

from dagshub.streaming import install_hooks
install_hooks(project_root='.', repo_url='https://dagshub.com/DagsHub-Datasets/LAION-Aesthetics-V2-6.5plus', branch='main')

What are DagsHub hooks?

These hooks are installed for most file IO and file system operations, like open() or os.listdir(). They first check to see if the file or directory being requested is a local one and if not, it checks the repository specified in the repo_url. In doing this, we’re essentially treating the DagsHub repo as if it were currently on the local machine.

How to access or stream a subset of a dataset?¶

As an example, let’s consider the LAION-Aesthetics-V2-6.5plus repo.

This repo includes about 540,000 images in the data/ folder. Additionally, there’s a labels.tsv file under there, which lists each image, a caption description, an aesthetics score, and the original URL it came from.

If we wanted to look at images that had the word squirrel in the caption, we could do something like this:

squirrel_files = []

with open('data/labels.tsv') as f:
    for row in f.readlines():
        image_name, caption = row.split('\t')[:2]
        if 'squirrel' in caption.lower():
            squirrel_files.append(image_name)

len(squirrel_files)

When we run this code, we see there are 115 images that mention squirrels.

We could even iterate over those images with the following code:

import os

from PIL import Image

for squirrel_file in squirrel_files:
    full_path = os.path.join('data', squirrel_file)
    image = Image.open(full_path).convert('RGB')

    # Do something with the image

How to create a DataLoader in PyTorch that streams data?¶

Luckily, thanks to the hooks that were installed, it’s very easy to create a PyTorch DataLoader using this streamed data. The first thing we need, however, is a PyTorch Dataset

Take a look at the LAIONAestheticsDataset example, which comes from the AestheticPredictor repo.

First the __init__ method:

class LAIONAestheticsDataset(Dataset):
    def __init__(self, annotations_file, img_dir, feature_extractor: EfficientNetFeatureExtractor, limit=None):

        # Set up class properties
        self.feature_extractor = feature_extractor
        self.img_path = img_dir
        self.img_files = []
        self.scores = []
        self.embeddings = {}

        # Open the annotations file (in this case data/labels.tsv)
        with open(annotations_file) as f:
            # Loop through each row
            for i, row in enumerate(f.readlines()):

                # Check if we're limiting the number of images
                if limit is not None and i >= limit:
                    break

                # Get the image file name and its aesthetic score
                img_name, _, aesthetic_score = row.split('\t')[:3]

                # Add the image file name to our list of files
                self.img_files.append(img_name)

                                # Add the aesthetic score as a PyTorch tensor to our list of scores
                self.scores.append(torch.tensor([float(aesthetic_score)]))

A Dataset needs two other methods, __len__ and __getitem__:

class LAIONAestheticsDataset(Dataset):
        ...

    def __len__(self):
        # The size of the dataset is the number of files in it
        return len(self.img_files)

    def __getitem__(self, idx):
        # Make sure we're not dealing with tensors as index values
        if torch.is_tensor(idx):
            idx = idx.tolist()

        # Get the stored aesthetic score for the index
        score = self.scores[idx]

        # See if the embedding for the image is cached
        embedding = self.embeddings.get(idx, None)

        if embedding is None:
            # If it's not cached, calculate it
            img_path = os.path.join(self.img_path, self.img_files[idx])
            embedding = self.feature_extractor.extract(img_path)

            # Then cache it
            self.embeddings[idx] = embedding

        # Return the embedding and the aesthetic score
        return embedding, score

To turn the Dataset into a DataLoader, we can use PyTorch’s built in class initializer:

train_dataloader = DataLoader(train_dataset, 
                              batch_size=32, 
                              shuffle=True, 
                              num_workers=0)

Now we can iterate through this DataLoader as usual to train your model!

To complete the picture, we also need to look at the EfficientNetFeatureExtractor, which is defined in the same file.

class EfficientNetFeatureExtractor:
    ...

    def extract(self, image_path):
        with torch.no_grad():
            image = Image.open(image_path).convert('RGB')
            X = self.preprocess(image)
            X = X.unsqueeze(0)
            embedding = self.model(X)[0, :, 0, 0]
        return embedding

The important line to pay attention to is:

image = Image.open(image_path).convert('RGB')

Normally, this would open a local image file found at image_path. However, thanks to the hooks we installed earlier, the Image.open() function can also stream data from the DagsHub repo.

The first time we run training, using this DataLoader, it will stream the images and extract the features. The images are cached to the local machine.

How to create a Data Generator in TensorFlow that streams data?¶

The hooks we installed can also help us create a DataGenerator in TensorFlow. To do so, we’re going to subclass keras.utils.Sequence.

Take a look at the LAIONAestheticsDataGenerator example.

Interestingly, there are no file IO methods called at all from the data generator. Well, there is one, but it’s hidden in this line:

def __data_generation(self, idxs):
        embeddings = np.empty((self.batch_size, self.feature_extractor.feature_dims))
        scores = np.empty((self.batch_size))

        # Generate data
        for i, idx in enumerate(idxs):
            # Store sample
            embedding = self.embeddings.get(idx, None)
            if embedding is None:
                img_path = os.path.join(self.img_path, self.img_files[idx])

                                #=======================
                                # SECRET HIDDEN FILE IO
                                #=======================
                embedding = self.feature_extractor.extract(img_path)

                self.embeddings[idx] = embedding

            embeddings[i,] = embedding
            scores[i] = self.scores[idx]

This comes from the EfficientNetFeatureExtractor, which is defined at the top of the same file:

class EfficientNetFeatureExtractor:
        ...

    def extract(self, image_path):
        image = Image.open(image_path).convert('RGB')
        image = image.resize((480, 480))  # Resize the image to match EfficientNet's input size
        image = tf.keras.preprocessing.image.img_to_array(image)
        image = self.preprocess(image)
        image = tf.expand_dims(image, axis=0)
        embedding = self.model(image)[0]
        return embedding

The installed hooks allow Image.open() to either stream data from a DagsHub repo or open local image files, if they exist.

The labels.tsv file is also streamed in the train_valid_split function. This function takes care of determining how the data generators are initialized.

See the project on DagsHub