Skip to content
This repository was archived by the owner on Sep 7, 2023. It is now read-only.

Speed up data loading #24

Closed
2 tasks done
JackKelly opened this issue May 14, 2021 · 11 comments
Closed
2 tasks done

Speed up data loading #24

JackKelly opened this issue May 14, 2021 · 11 comments
Labels
data Data processing, loading, or analysis enhancement New feature or request

Comments

@JackKelly
Copy link
Member

JackKelly commented May 14, 2021

@JackKelly JackKelly added enhancement New feature or request data Data processing, loading, or analysis labels May 14, 2021
@JackKelly
Copy link
Member Author

JackKelly commented May 14, 2021

the next code (as of the evening of Fri 2021-05-14), can get ~ 50 it/s and only 24 G of RAM usage without NWP loading (with 4 workers).

@JackKelly
Copy link
Member Author

NWPDataInMem.get_sample takes about 70 ms per sample. So with 8 samples per batch, it takes over half a second. That's probably the issue.

The interpolation (even linear) takes a while. Replacing the linear interpolation for ffill decreases the runtime of NWPDataInMem.get_sample from 70 ms to 15 ms. And increases the training speed from about 5 it/s to 15 it/s.

@JackKelly
Copy link
Member Author

JackKelly commented May 14, 2021

hmmm, maybe the issue is that get_nwp_example resamples the entire NWP field (a big image!) Some options to speed it up:

  1. resample to 5-minutely in NWPDataLoader.load_single_chunk()
  2. Only resample data that we need. e.g. maybe the NWPDataInMem.get_sample() would return hourly data, from start.floor('h') to end.ceil('h') and then it'd be up to the Transform to resample, after selecting what we need. I like this idea.

JackKelly added a commit that referenced this issue May 14, 2021
…Ps are still a little slow though (#24).  About to try resampling 'step' in load_single_chunk()
@JackKelly
Copy link
Member Author

I've implemented option 2, and it's helped a lot! NWPDataInMem.get_sample() now takes only 7.26 ms, and the system trains 25 it/sec, with GPU usage hovering around 15%.

@JackKelly
Copy link
Member Author

JackKelly commented May 14, 2021

More things to try:

  • Limit the spatial extent of the satellite imagery. DONE: Reduces size of nwp_in_mem to 14 MB (from 37 MB), and reduces runtime of get_sample to 5.78 ms (from 7.26 ms). Doesn't seem to speed up training much, or reduce mem during training much (with 5 workers, uses 53 GB RAM, and does about 20 it/s).
  • run get_sample() from the 3 AsyncDataLoaders in parallel. Try both threads and processes. Thoughts: Can't spawn child processes from daemonic worker processes. And not sure multiple threads will help because get_sample() is CPU-bound
  • a VM with more RAM, and then add more workers. (10 workers, 8-bit NWP, 32-bit PV uses 78 GB, and gets about 30 it/s, GPU usage of max 22%. 12 workers = 33 it/s, 99.5 GB RAM)
  • use minimal data type for NWP (uint8 for temperature) (DONE: reduces size of nwp_in_mem to 3.6 MB (from 14 MB) and reduces runtime of nwp_in_mem.get_sample() down to 4.6 ms, down from 5.78 ms. Uses about 44 GB RAM during training with 5 workers.)
  • Try again without NWP data, to see the memory usage and the training speed (it/s) and GPU usage. DONE. Without NWP, and with 12 workers, uses 71 GB RAM. Achieves 71 it/s and max GPU utilisation of 40%.
  • Is the PV data using lots of memory? If so, use minimal data type for PV? Share data between processes?!
  • try loading a complete batch at once
  • try using different processes for each data source: Can't spawn child processes from daemonic worker processes!

@JackKelly
Copy link
Member Author

So, we know that including NWP data slows training down by a factor of more the 2x.

get_sample takes 4.3ms for NWP; and takes 1.13ms for sat data. So maybe need to speed up get_sample?

@JackKelly
Copy link
Member Author

Profiling each line in get_nwp_example:

0.179 ms: date_range
1.686 ms: nwp.sel(init_time=target_times_hourly, method=ffill)
0.157 ms: init_time_future
0.043 ms: init_times[target_times_hourly > t0_hourly]
0.216 ms: steps = target_times_hourly - init_times
0.360 ms: init_time_indexer
0.103 ms: step_indexer
1.526 ms: nwp.sel(init_time=init_time_indexer, step=step_indexer)
CPU times: user 7.57 ms, sys: 0 ns, total: 7.57 ms
Wall time: 6.46 ms

@JackKelly
Copy link
Member Author

JackKelly commented May 17, 2021

oooh... looks like it's possible to significantly speed up the selection based on 'step' by first transposing so that 'step' is the first dimension. This gets the runtime down to 1.73 ms if always using the first init_time. Need to see if this speed up holds when using multiple init times based on t0.

@JackKelly
Copy link
Member Author

JackKelly commented May 17, 2021

Nope, doesn't look like transposing gives us the same performance increase when selecting multiple init times.

But, better news: I noticed that, when using NWPs, the code is pretty constantly loading from disk when min_n_samples_per_disk_load = 1000 and max_n_samples_per_disk_load = 2000. Increasing these to 4,000 and 8,000, respectively, gets us up to 50 it/s after 30,000 iterations (yay!) with NWPs, and 12 workers.

To really speed things up, I think we perhaps need to re-create the NWP Zarr, so the data is stored more efficiently on disk (#26).

@JackKelly
Copy link
Member Author

Swapping back to the 'old', more thorough way of getting NWPs, gives us 47.8 it/s

@JackKelly
Copy link
Member Author

Can't launch sub-processes from the worker processes: daemonic child processes aren't allowed to have children :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data Data processing, loading, or analysis enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant