-
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
@fsolgui I'm surprised there doesn't appear to be any overlap, but at the same time it's a very simple solution so I wouldn't necessarily expect a large degree of overlap either... is it the same with dataloader memory pinned / unpinned? I'm curious to try the GIL free Python builds, looks like torch should be supporting that now / very soon... see if that unlocks any dataloader contention. One thing to note, if your dataloading is really lagging with images, for install PIL SIMD, one of the best things you can do with timm as it's using Pillow based pipelines like many torch codebases (https://github.com/uploadcare/pillow-simd). It's a bit of a pain because you constantly have to check if the simd package has been stomped over by the normal package (they have the same name so pip dep resolver will always install the original). I tend to keep separate stable, train envs that I don't touch for this reason. pip uninstall pillow
CC="cc -mavx2" pip install -U --force-reinstall pillow-simd |
Beta Was this translation helpful? Give feedback.
-
Update on Prefetching Behavior After Additional ProfilingFollowing up with more profiling results and observations. My initial profiling was done using Results on
|
Beta Was this translation helpful? Give feedback.
I don't think that significantly overlapping the computation kernels on GPUs is realistic, kernels tend to saturate SM by design. The data transfer and computation kernels are supposed to overlap, and the prefetcher was supposed to leverage that so it could kick off the transfer of the next batch from CPU -> GPU memory previous batch computation was finishing.
The long pauses are possibly backups caused by excessive IO or memory swapping or other system events that hold up the dataloader worker processes or main process for a moment. Having excessive dataloader worker processes can make that worse.