-
-
Notifications
You must be signed in to change notification settings - Fork 6
BUG: InvalidIndexError #42
Comments
Doesn't seem to happen when not using multiprocessing? Also can't seem to replicated in testing_NWPDataSource. Maybe try older versions of Pandas?!? |
Huh, it does happen with just 1 worker (with Pandas 1.2.5):
(after 1025 iterations, whilst validating I think???) |
Possibly related? pandas-dev/pandas#39882 |
Try disabling the multi-threaded loop in |
Error does still occur with a
|
It also happens with a This is a known issue: pandas-dev/pandas#21150 And here's the output of some debugging (which is only possible when using
|
I re-wrote init_times = nwp_ds.data.sel(init_time=target_times_hourly, method='ffill').init_time.values in Numpy: indexes = np.searchsorted(self.data.init_time, target_times_hourly, side='right')
indexes -= 1 # Because searchsorted returns the index _after_ the index we want.
init_times = self.data.init_time.values[indexes] Which works! But now we're hitting the some possible fixes:
|
Try (5); and use the new NWP Zarr file, to see if that speeds things up. If not, try further shrinking the NWP Zarr. |
With no threading, and loading NWPs, PV & Sat, getting 1.7 secs per iteration (which is horrible. Before adding NWPs, we were getting more like 20 it/s!) Trying (5) (use one thread per DataSource)... UPDATE: Hmm, still gets same performance. I guess it's being swamped by reading enormous volumes of NWP data! |
Huh. Using smaller NWP Zarr doesn't help much. OK. Let's go for (7). Then, in NWPDataSource, instead of 'manually' creating threads, we can lazily build the batch, and then call dask.compute() at the end of the batch. And, for SatelliteDataSource, we can go back to using threads for each example. |
The text was updated successfully, but these errors were encountered: