Map-style Dataset vs Iterable-style Dataset
Overview
The following article explores the two types of datasets offered in PyTorch - Map style and iterable datasets while looking at the PyTorch classes using which these can be implemented. We structurally understand Map style vs iterable datasets and code examples of each of them.
Introduction
Data loading, which is the step preceding the model training, is one of the major components of any "deep neural modeling pipeline". If not done right, data loading can become a bottleneck in the modeling pipeline.
Moreover, data could also come from various sources and be present in different forms (audio, text, images, etc.). For example - one might have all the files stored on the system disk, or one might be crawling websites on the fly.
To these ends, PyTorch provides separate classes that deal with data accessing, storing, and loading. It also provides two classes (Dataset and IterableDataset) that deal with data from different sources.
This article mainly focuses on Map-style vs. iterable datasets to understand their PyTorch API support.
Datasets in PyTorch
Pytorch provides two primitive classes to efficiently load our datasets for training deep neural networks. One of them is torch.utils.data.DataLoader and the other is torch.utils.data.Dataset (or torch.utils.data.IterableDataset).
The Dataset class deals with accessing/fetching and storing the data samples. In contrast, the DataLoader deals with the efficient loading of data samples in batches using parallel sub-processes so that data loading does not become a bottleneck in the model training pipeline and the models that train in the main process do not have to wait for the time a next batch of data is made available for them to use for training.
To create a custom dataset class to use almost any data for modeling, we could inherit from any base classes IterableDataset or Dataset and override the required functionality.
Broadly speaking, PyTorch offers support for two types of datasets:
- Map style dataset via the torch.utils.data.Dataset class, and
- Iterable style dataset via the torch.utils.data.IterableDataset class
We'll look at them one by one.
Map-style Dataset
A map-style dataset is the one that implements the __getitem__() and __len__() protocols and can be represented as a map from (possibly non-integral) indices or keys to the data samples.
For example, when accessed with dataset[index], such a dataset could read the index-th element and its corresponding label from the data source, which could be a folder located on the disk.
The torch.utils.data.Dataset class is the base class providing support for Map-style datasets.
Any custom dataset class inherits from the base class and overrides some required functionality. The skeleton structure of such a custom dataset class should hence look like the following :
The len method simply returns the length of the dataset. While it is not strictly an abstract method, it is recommended that every subclass of the Dataset class implements this. The reason why that is so is that this method is expected to return the size of the dataset by many implementations of the class torch.utils.data.Sampler and the default options of the classtorch.utils.data.DataLoader.
__getitem__ is an abstract method in the abstract Dataset class and hence any subclass of torch.utils.data.The dataset class should implement this method.
This is a dunder method and deals with returning the input features and the corresponding label of a particular data point at a specified index.
Let us now implement an example of a map-style dataset.
We will first construct pandas dataframe out of the california_housing_train.csv which is a dataset that comes pre-loaded in google collab, like so :
which gives
index | longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value |
---|---|---|---|---|---|---|---|---|---|
0 | -114.31 | 34.19 | 15.0 | 5612.0 | 1283.0 | 1015.0 | 472.0 | 1.4936 | 66900.0 |
1 | -114.47 | 34.4 | 19.0 | 7650.0 | 1901.0 | 1129.0 | 463.0 | 1.82 | 80100.0 |
2 | -114.56 | 33.69 | 17.0 | 720.0 | 174.0 | 333.0 | 117.0 | 1.6509 | 85700.0 |
3 | -114.57 | 33.64 | 14.0 | 1501.0 | 337.0 | 515.0 | 226.0 | 3.1917 | 73400.0 |
4 | -114.57 | 33.57 | 20.0 | 1454.0 | 326.0 | 624.0 | 262.0 | 1.925 | 65500.0 |
As can be seen, there are a total of 9 columns in this dataset out of which 8 are the input features, and the last one which is median_house_value is the column to be predicted using the 8 features.
Now, we will create a custom dataset class suited to access samples from our dataset, like so :
Output :
We can now wrap our dataset instance inside the dataloader class to load data in batches like so :
Output :
That was about the map-style datasets in Pytorch and how they are worked through and used with the dataloader class to load batches of data samples.
We will now look at the other type of dataset, called the iterable style dataset, and understand how it differs from the one we just looked at - map style datasets.
Iterable Style Dataset
Iterable style datasets are primarily useful when the data arrives as part of a stream and when each data point cannot be mapped into an index.
An iterable-style dataset can be created as an instance of a subclass of the class IterableDataset that implements the __iter__() protocol and represents an iterable over the data samples. This dataset type is particularly suitable for cases where random data reads are costly or even improbable and where the batch size depends on the fetched data rather than being pre-defined/configurable using a dataloader.
For example, when called iter on like iter(dataset), such a dataset would return a stream of data reading from a database, from website calls over a network or a remote server, or even from logs generated in real-time.
Unlike map-style datasets where the data loading order can be specified via the dataloader, the data loading order for iterable-style datasets is entirely controlled by the user-defined iterable. This also means that iterable size datasets easily allow for dynamic batch sizes as iter protocol makes it possible to yield a batch of data at a time rather than a single data point.
torch.utils.data.IterableDataset class is the PyTorch primitive concerned with supporting iterable style datasets.
Let us code an example where we will implement our custom class meant for an iterable style dataset.
Notice the use of torch.utils.data.get_worker_info() - this is used in a worker process to get information about the worker processes. When using an IterableDataset with multi-process data loading.
The same dataset object is replicated on each worker process, and thus the replicas must be configured differently to avoid duplicated data. In other words, when num_workers > 0, each worker process will have a different copy of the dataset object, so it is often desired to configure each copy independently to avoid getting duplicate data returned from the different worker processes.
As can be seen, the iter method returns a Python iterable, an instance of this class can now be fed into the dataloader class to create yet another dataloader instance, and each item in the dataset will be yielded from the DataLoader iterator, like so :
Output:
Map-style vs Iterable Datasets
Until now, we learned about map style datasets and iterable style datasets individually while understanding how each works mechanically.
However, making a brief yet clear distinction between the two is crucial to enhance understanding. Therefore, we lay below some points of difference that conceptually differentiate map-style datasets from iterable-style datasets.
-
Map style datasets are useful when reading the data randomly through subsequent calls like dataset[idx] is viable, and the length of the full dataset is known beforehand. In contrast, iterable style datasets are used in cases when the data comes sequentially from a stream, and the full length of how large the dataset is could be unknown.
-
With map-style datasets, a single item could be loaded at a time, like so:
On the other hand, with iterable style datasets, it is possible to load a batch (or a single item too) of data at a time, like so :
-
The data loading order for map-style datasets is controlled by the dataloader class using another Sampler class. Hence, the DataLoader samples indices of the items to be loaded (this is also where the len method implemented in custom dataset classes is used). On the other hand, the data loading order in iterable style datasets is completely determined by the user implementation, and the dataloader has nothing to do with it.
-
In the case of map-style datasets, the DataLoader specifies a fixed batch size and takes care of collation (possibly with user-defined custom collate_fn). In contrast, with iterable style datasets, batching could be implemented in the iter protocol itself.
-
We must be careful with multiprocessing when dealing with iterable style datasets as specified above (by using worker info). However, with map-style datasets, the dataloader internally applies to shard in every dataset copy across the different worker processes.
Conclusion
- In this article, we started by looking at PyTorch's support for data loading using its primitive classes.
- We looked into the two broad datasets provided by PyTorch: Map-style datasets and iterable datasets.
- We learned about the distinction between Map-style vs iterable datasets while understanding each one through code examples using PyTorch.
- We also understood the use cases the different datasets are appropriate for.