No overlap using dataloader workers and prefetcher #2582

fsolgui · 2025-09-16T15:15:14Z

fsolgui
Sep 16, 2025

Hi,

Context

I've been developing a custom prefetcher inspired by the Prefetcher class in timm. My version uses a dataset that generates simple random numbers as seeds for GPU-based image generation via custom CUDA kernels. These generated images are then augmented using GPU-enabled torchvision transforms (e.g., RandomResizedCrop, ColorJitter), and finally passed to the model for training.

However, I'm observing throughput bottlenecks, and suspect that image generation, augmentation, and model training are not overlapping as expected — i.e., the GPU isn’t being kept busy while data is being prepared.

Profiling timm

To rule out issues in my own implementation, I took a step back and instrumented the default train.py script from the timm repository with the PyTorch Profiler (exporting traces viewable in chrome://tracing). I made minimal changes to insert the profiling logic. From the profiler output (see trace below), I noticed that there is no overlap between:

The data loading and prefetching pipeline (including normalization and random erasing)
The model's forward and backward passes
The DataLoader iteration

Questions

Is this behavior expected in timm’s current training pipeline? I was under the impression that data loading and prefetching should ideally overlap with model execution to improve throughput.
Are there any known limitations in the current pipeline design (e.g., placement of the prefetcher, blocking ops, or stream usage) that could be preventing overlap?
I observed the same non-overlapping behavior using both the PyTorch Profiler and NVIDIA NSight Systems (nsys profile). Could these tools be missing the overlap or misrepresenting timing?
Interestingly, I observed worse throughput when using the --no-prefetcher flag. How can throughput improve with the prefetcher enabled if no actual overlap is occurring?

Thanks in advance for any insight!

Answered by rwightman

Sep 23, 2025

I don't think that significantly overlapping the computation kernels on GPUs is realistic, kernels tend to saturate SM by design. The data transfer and computation kernels are supposed to overlap, and the prefetcher was supposed to leverage that so it could kick off the transfer of the next batch from CPU -> GPU memory previous batch computation was finishing.

The long pauses are possibly backups caused by excessive IO or memory swapping or other system events that hold up the dataloader worker processes or main process for a moment. Having excessive dataloader worker processes can make that worse.

View full answer

rwightman · 2025-09-18T15:51:24Z

rwightman
Sep 18, 2025
Maintainer

@fsolgui I'm surprised there doesn't appear to be any overlap, but at the same time it's a very simple solution so I wouldn't necessarily expect a large degree of overlap either... is it the same with dataloader memory pinned / unpinned?

I'm curious to try the GIL free Python builds, looks like torch should be supporting that now / very soon... see if that unlocks any dataloader contention.

One thing to note, if your dataloading is really lagging with images, for install PIL SIMD, one of the best things you can do with timm as it's using Pillow based pipelines like many torch codebases (https://github.com/uploadcare/pillow-simd). It's a bit of a pain because you constantly have to check if the simd package has been stomped over by the normal package (they have the same name so pip dep resolver will always install the original). I tend to keep separate stable, train envs that I don't touch for this reason.

pip uninstall pillow
CC="cc -mavx2" pip install -U --force-reinstall pillow-simd

1 reply

fsolgui Sep 19, 2025
Author

Hi, thanks for your reply!

The trace I shared was generated with pinned memory. I also tested with unpinned memory — the results are similar in that there’s still no overlap, and performance is even slower.
I wasn’t aware of the GIL-free Python builds — I’ll definitely look into that, thanks for the suggestion.
I’ll also take a look at PIL-SIMD, though my main bottleneck is different. The custom prefetcher I’m developing performs all the data generation and augmentation directly on the GPU, without using images at all.

What I really need is true overlap between the prefetcher running on stream 16 and the training loop on stream 7. That’s where the core performance issue lies.

fsolgui · 2025-09-23T14:13:57Z

fsolgui
Sep 23, 2025
Author

Update on Prefetching Behavior After Additional Profiling

Following up with more profiling results and observations. My initial profiling was done using deit_tiny, and I’ve now repeated the analysis with two larger models — vit_base and vit_large — to explore whether the short forward/backward passes in smaller models were preventing overlap with the data pipeline.

Results on `deit_tiny`

These are the results I initially posted. There was no overlap between prefetching, model forward/backward passes, and DataLoader iteration. One pattern I noticed was that certain DataLoader iterations took significantly longer than others, introducing inconsistent waiting times between batches. Here's a breakdown of the last 10 batches to illustrate:

batch0 -  85ms waiting (only 2ms on prefetcher) + 30ms model F/B = 125ms total
batch1 -   2ms waiting (all prefetcher)         + 30ms model F/B =  32ms total
batch2 -   2ms waiting (all prefetcher)         + 30ms model F/B =  32ms total
batch3 - 180ms waiting (only 2ms on prefetcher) + 30ms model F/B = 210ms total
batch4 -   2ms waiting (all prefetcher)         + 30ms model F/B =  32ms total
batch5 -   2ms waiting (all prefetcher)         + 30ms model F/B =  32ms total
batch6 -  52ms waiting (only 2ms on prefetcher) + 30ms model F/B =  82ms total
batch7 -   2ms waiting (all prefetcher)         + 30ms model F/B =  32ms total
batch8 -  98ms waiting (only 2ms on prefetcher) + 30ms model F/B = 128ms total
batch9 -   2ms waiting (all prefetcher)         + 30ms model F/B =  32ms total

This variability suggests that the dataloader can not keep up with the model F/B given its short execution time

Results on `vit_base`

With this larger model, the longer forward/backward passes hide the latency of DataLoader iterations — there’s no added batch latency from the DataLoader. However, prefetching still does not overlap with the training loop. The prefetch step (including moving data to GPU, normalization, and random erasing) consistently takes around 2ms, which is only about 3% of total batch time.

Results on `vit_large`

Same behavior as vit_base. Prefetching takes about 2ms, which is now only 0.82% of the total batch time — negligible in terms of overall performance.

Key Conclusions (Partial Answers to Original Questions)

On simpler models (or faster hardware), the shorter forward/backward times mean the CPU pipeline may not be fast enough, causing delays per batch. Otherwise, the standard CPU pipeline appears to work fine.
Prefetching never overlaps with model forward/backward execution, regardless of the model used.
I am 99.99% confident that this is not a profiling artifact.

Implications for My Use Case

On typical training pipelines (image loading + CPU-based augmentation + prefetching to GPU), the prefetcher’s runtime is minimal, so the lack of overlap isn’t critical. However, in my custom prefetcher, I generate and augment data entirely on the GPU, which takes 65–75ms per batch. The lack of overlap in this case is a major performance bottleneck.

Potential Solutions

As mentioned by @rwightman, GIL-free Python builds might help by allowing more effective parallel execution. I’ll explore this. I'm also exploring a more drastic approach:

Could we synchronize data prefetching on the GPU using two separate processes?
One process would handle prefetching (including data generation and augmentation), while the second would perform model forward/backward. My understanding is that, currently, both run in the same process, which likely causes the observed lack of overlap due to execution being effectively serialized.

Would be very interested to hear if anyone has explored multi-process synchronization strategies for this use case, or has ideas for achieving true concurrent GPU stream execution within Python’s current limitations.

Thanks again for the ongoing insights!

2 replies

rwightman Sep 23, 2025
Maintainer

I don't think that significantly overlapping the computation kernels on GPUs is realistic, kernels tend to saturate SM by design. The data transfer and computation kernels are supposed to overlap, and the prefetcher was supposed to leverage that so it could kick off the transfer of the next batch from CPU -> GPU memory previous batch computation was finishing.

The long pauses are possibly backups caused by excessive IO or memory swapping or other system events that hold up the dataloader worker processes or main process for a moment. Having excessive dataloader worker processes can make that worse.

Answer selected by fsolgui

fsolgui Sep 26, 2025
Author

You're right — after some testing, I wasn't able to get the kernels to overlap properly. Thanks for pointing that out!

On the other hand, I did notice that the prefetcher achieves some overlap when random erasing is disabled in train.py (from the timm library). Here are the results from the profiling:

With random erasing enabled

Without random erasing disabled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

No overlap using dataloader workers and prefetcher #2582

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

No overlap using dataloader workers and prefetcher #2582

Uh oh!

fsolgui Sep 16, 2025

Context

Profiling timm

Questions

Replies: 2 comments · 3 replies

Uh oh!

rwightman Sep 18, 2025 Maintainer

Uh oh!

fsolgui Sep 19, 2025 Author

Uh oh!

fsolgui Sep 23, 2025 Author

Update on Prefetching Behavior After Additional Profiling

Results on deit_tiny

Results on vit_base

Results on vit_large

Key Conclusions (Partial Answers to Original Questions)

Implications for My Use Case

Potential Solutions

Uh oh!

rwightman Sep 23, 2025 Maintainer

Uh oh!

fsolgui Sep 26, 2025 Author

fsolgui
Sep 16, 2025

Replies: 2 comments 3 replies

rwightman
Sep 18, 2025
Maintainer

fsolgui Sep 19, 2025
Author

fsolgui
Sep 23, 2025
Author

Results on `deit_tiny`

Results on `vit_base`

Results on `vit_large`

rwightman Sep 23, 2025
Maintainer

fsolgui Sep 26, 2025
Author