Skip to content

running out of memory trying to write SQL #1874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lstillwe opened this issue Jan 31, 2018 · 3 comments
Closed

running out of memory trying to write SQL #1874

lstillwe opened this issue Jan 31, 2018 · 3 comments

Comments

@lstillwe
Copy link

Python version:3
xarray version: 0.9.6

I am using xarray to read very large NetCDF files (~47G).

Then I need to write the data to a postgres DB. I have tried parsing the array and using an INSERT for every row, but this is taking a very long time (weeks).

I have read that bulk insert would be a lot faster, so I am looking for a solution along those lines.

I also saw that Pandas has a DataFrame.to_sql() function and xarray has Dataset.to_dataframe() function, so I was trying out this approach. However, when trying to convert my xarray Dataset to a Pandas Dataframe, I ran out of memory quickly.
Is this expected behavior? If so can you suggest another solution to this problem?

@max-sixty
Copy link
Collaborator

There's no xarray->SQL connector, unfortunately.

I don't have that much experience here so I'll let other chime in. You could try chunking to pandas and then to Postgres (but you'll always be limited by memory with pandas). If there's a NetCDF -> tabular connector, that would allow you to operate beyond memory.

@shoyer
Copy link
Member

shoyer commented Feb 4, 2018

Then I need to write the data to a postgres DB. I have tried parsing the array and using an INSERT for every row, but this is taking a very long time (weeks).

I'm not a particular expert on postgres but I suspect it indeed has some sort of bulk insert facilities.

However, when trying to convert my xarray Dataset to a Pandas Dataframe, I ran out of memory quickly.

If you're working with a 47GB netCDF file, you probably don't have a lot of memory to spare. Often pandas.DataFrame objects can use significantly more memory than xarray.Dataset, especially keeping in mind that an xarray Dataset can lazily reference data on disk but a DataFrame is always in memory. The best strategy is probably to slice the Dataset into small pieces and to individually convert those.

@jhamman
Copy link
Member

jhamman commented Feb 4, 2019

This can be done using xarray->zarr->SQL (zarr-developers/zarr-python#368). Additional databases such are also available as stores in zarr.

@jhamman jhamman closed this as completed Feb 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants