-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Excessive memory usage when printing multi-file Dataset #1481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @hadfieldnz -- I believe this issue could be related to #1396, which was fixed in dask/dask#2364. Could you let us know what versions of xarray and dask you are using? import xarray
import dask
print(xarray.__version__)
print(dask.__version__) |
xarray 0.9.6 |
0.14.3 pre-dates the fix dask/dask#2364 mentioned above: can you try to update dask? |
I ran "conda update dask", which upgraded me from 0.14.3 to 0.15.0. Short report: No this has not eliminated the problem. Long report: Today (Friday) I am on my home machine, which has only 6 GiB RAM. I confirmed earlier today with dask 0.14.3 that I can open and print the dataset with 25 files. And with 10 files IPython halts with a memory error reporting that 85% of the memory is being used. After the upgrade to 0.15.0, running the test script with 10 files, it exhausted all the RAM on my machine and locked it up within a few seconds. I will not be able to investigate this further until I get back on my work machine on Monday. |
Can you try calling |
Our formatting logic pulls out the first few values of arrays to print them in the repr. It appears that this is failing spectacularly in this case, though I'm not sure why. Can you share a quick preview of what a single one of your constituent netCDF files looks like? More broadly: maybe we should disable automatically printing a preview of the contents of |
Back at work and able to check things out more thoroughly on a machine with more RAM... A good number of files to trigger the problem is 10. As reported before, upgrading dask from 0.14.3 to 0.15.0 did not fix the problem. It seemed to speed up the handling of muli-file datasets generally, therefore causing my PC to crash faster when it crashes. Ryan, calling open_mfdataset with decode_cf=False does allow me to open and print the 10-file dataset, though this still seems to use an uncomfortably large amount of RAM: about 7 GiB in the Python kernel process, vs only a few hundred for the 25-file dataset. Stephan, although I discovered this problem when dealing with a 25-file sequence, I boiled it down to a test case involving one file opened multiple times before reporting it here. There is a copy of the file (2.27 Gib) in a publicly accessible location here: ftp://ftp.niwa.co.nz/incoming/hadfield/roms_avg_0001.nc and here is the output of ncdump -h: netcdf roms_avg_0001 { // global attributes: |
In response to your comment, Stephan
Speaking rather selfishly--as someone who is quite good at finding bugs in scientific software, but not much use in fixing them--my worry is that the bugs that are no longer uncovered by printing the dataset preview would come back to bite me some other way. |
@hadfieldnz - I think this was just fixed in #1532. Keep an eye out for the 0.10 release. Feel free to reopen if you feel there's more to do here. |
I have a dataset comprising 25 output files from the ROMS ocean model. They are netCDF files ("averages" files in ROMS jargon) containing a number of variables, but most of the storage is devoted to a few time-varying oceanographic variables, either 2D or 3D in space. I have post-processed the files by packing the oceanographic variables to int32 form using the netCDF add_offset and scale_factor attributes. Each file has 100 records in the unlimited dimension (ocean_time) so the complete dataset has 2500 records. The 25 files total 56.8 GiB so would expand to roughly 230 GiB in float64 form.
I open the 25 files with xarray.open_mfdataset, concatenating along the unlimited dimension. This takes a few seconds. I then print() the resulting xarray.Dataset. This takes a few seconds more. All good so far.
But when I vary the number of these files, n, that I include in my xarray.Dataset I get surprising and inconvenient results. All works as expected in reasonable time with n <= 8 and with n >= 19. But with 9 <= n <= 18, the interpreter that's processing the code (pythonw.exe via Ipython) consumes steadily more memory until the 12-14 GiB that's available on my machine is exhausted.
The attached script exposes the problem. In this case the file sequence consists of one file name repeated n times. The value of n currently hard-coded into the script is 10. With this value, the final statement in the script--printing the dataset--will exhaust the memory on my PC in about 10 seconds, if I fail to kill the process first.
I have put a copy of the ROMS output file here:
ftp://ftp.niwa.co.nz/incoming/hadfield/roms_avg_0001.nc
mgh_example_test_mfdataset.py.txt:
The text was updated successfully, but these errors were encountered: