Skip to content

Thread-safety in xarray #9836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
max-sixty opened this issue Nov 28, 2024 · 6 comments
Open

Thread-safety in xarray #9836

max-sixty opened this issue Nov 28, 2024 · 6 comments

Comments

@max-sixty
Copy link
Collaborator

max-sixty commented Nov 28, 2024

What is your issue?

There's a process that can't use xarray because it's a multithreaded context, and xarray objects aren't thread-safe, because pandas indexes aren't thread-safe pandas-dev/pandas#2728

I did some quite exploration on where the issue was. I don't have a clear answer but thought it would be worthwhile to post some results.

Here's a function that tests a few operations:

import traceback
import pandas as pd
import xarray as xr
import numpy as np
import concurrent.futures
import copy

# Dict of test cases - just function and its string representation
tests = {
    'x.reindex(dim_0=x.dim_0)': lambda s, x, idx: x.reindex(dim_0=x.dim_0),
    'x.copy().reindex(dim_0=idx.copy())': lambda s, x, idx: x.copy().reindex(dim_0=idx.copy()),
    'copy.deepcopy(idx)': lambda s, x, idx: copy.deepcopy(idx),
    'x.reindex(dim_0=idx.copy())': lambda s, x, idx: x.reindex(dim_0=idx.copy()),
    'x.reindex(dim_0=idx)': lambda s, x, idx: x.reindex(dim_0=idx),
    'x.reindex(dim_0=x.dim_0); x.reindex(dim_0=idx)': 
        lambda s, x, idx: (x.reindex(dim_0=x.dim_0), x.reindex(dim_0=idx)),
    'x.sel(dim_0=idx); x.reindex(dim_0=idx.copy())':
        lambda s, x, idx: (x.sel(dim_0=idx), x.reindex(dim_0=idx.copy()))
}

def run_test(test_fn, n=1000):
    try:
        a = np.arange(0, 30000)
        def gen_args():
            for i in range(n):
                if i % 1000 == 0:
                    s = pd.Series(data=a, index=a)
                    x = xr.DataArray(a, dims=['dim_0'], coords={'dim_0': a})
                    test_fn(s, x, a)
                yield s, x, np.arange(0, 1000)
                
        with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
            list(executor.map(lambda args: test_fn(*args), gen_args()))
        return True
    except Exception as e:
        return False, str(e)

# Run all tests and collect results
results = {}
for test_str, fn in tests.items():
    print(f"Running: {test_str}")
    results[test_str] = run_test(fn)

# Print summary
print("\nResults Summary:")
print("=" * 50)
for test_str, result in results.items():
    status = "✓ PASS" if result is True else "✗ FAIL"
    print(f"{status} - {test_str}")

The results:


Results Summary:
==================================================
✓ PASS - x.reindex(dim_0=x.dim_0)
✓ PASS - x.copy().reindex(dim_0=idx.copy())
✓ PASS - copy.deepcopy(idx)
✗ FAIL - x.reindex(dim_0=idx.copy())
✗ FAIL - x.reindex(dim_0=idx)
✗ FAIL - x.reindex(dim_0=x.dim_0); x.reindex(dim_0=idx)
✓ PASS - x.sel(dim_0=idx); x.reindex(dim_0=idx.copy())

A couple of things to note:

  • Most .sel operations seemed to pass, most .reindex operations seem to fail
  • Running x.reindex(dim_0=idx.copy()) fails, but running x.sel(dim_0=idx) beforehand makes the initial .reindex pass (?!)
  • Reindexing with x.reindex(dim_0=x.dim_0) works, but not with an index that's passed in, x.reindex(dim_0=idx). (Could we claim that when xarray objects are accessed by different threads, they're safe? I'd be surprised if that were the case, but couldn't immediately see a case where that was falsified...)

I don't think there are any easy answers to this; if we wanted xarray / parts of xarray to be threadsafe. I think it's a bunch of bad options:

  • Remove pandas dependency (could there be a slower dumber indexing option as a plugin?)
  • Protect every indexing with a lock
  • ?
@max-sixty max-sixty added needs triage Issue that has not been reviewed by xarray team member topic-internals and removed needs triage Issue that has not been reviewed by xarray team member labels Nov 28, 2024
@shoyer
Copy link
Member

shoyer commented Feb 12, 2025

My hope would be that pandas can fix this thread-safety issue upstream. Hopefully they are more amenable to doing so now (in 2025) than they were in 2013.

@alippai
Copy link

alippai commented May 25, 2025

In a thread where x1 and x2 is shared I write:

x3 = xr.merge(x1, x2)

why do I hit the above issue?
At least I get data duplication error randomly, so I guess I'm hitting this.
I assumed since I don't modify x1 and x2, I shouldn't see threadsafety issues.

The actual message I get: ValueError: cannot reindex or align along dimension '***' because the (pandas) index has duplicate values

@max-sixty
Copy link
Collaborator Author

if that's the whole program, and there are no other threads, you shouldn't hit an issue

if that's running in a thread pool or similar, then you can get that sort of problem... (and iirc that is a message you would get...)

@alippai
Copy link

alippai commented May 25, 2025

Yes, this is within threadpool.starmap, x1 and x2 passed as parameters

@alippai
Copy link

alippai commented May 25, 2025

What's the workaround? x3 = xr.merge(x1.copy(deep=True), x2.copy(deep=True))?
My understanding is that simple .copy() doesn't work. Or maybe I can reuse the data, but deep copy the index?

@max-sixty
Copy link
Collaborator Author

those sound like reasonable things to try! I'm not confident on the solution unfortunately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants