-
-
Notifications
You must be signed in to change notification settings - Fork 732
"can't pickle thread.lock objects" when working with published dataframe #1556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not able to reproduce with this simple example: In [1]: import dask.dataframe as dd
In [2]: from distributed import Client
In [4]: with open("file.csv", "w") as f: f.write("A,B\n1,2\n3,4\n5,6")
In [5]: client = Client()
In [6]: df = dd.read_csv("s3://dask-data/airline-data/1987.csv", storage_options={'anon': True})
In [7]: client.publish_dataset(ds_name=df)
In [9]: ds = client.get_dataset('ds_name')
In [10]: ds.compute()
Out[10]:
A B
0 1 2
1 3 4
2 5 6 Could you try adapting that example until you reproduce the failure?
|
You can reproduce it on a local cluster, but you need to load CSV data from S3. |
Thanks, that's valuable information. I've updated the example. |
Same issue here. Any update? |
It would be useful to see a full traceback from a minimal example |
A simple example to read from a s3 file and persist on cluster. We use server-side encryption for the s3 bucket. client = Client(scheduler_file=get_dask_scheduler_file())
boto_session = session.Session()
boto_session = configure_session(boto_session, credential)
df = dd.read_csv('s3://data.csv', storage_options={'botocore_session': boto_session})
df = client.persist(df)
|
My first guess is that the |
I can try other ways to pass the credentials. But this is the recommended way inside our company. Do you think dask SerializableLock can help? I see you have solved similar issues for read_hdf(). |
My guess is that the boto library has a lock in it somewhere that we're not going to be able to touch. You could ask them upstream, but they'll probably say "why would you want to pass around session objects? This may be unsafe" Alternatively, what I often see in production is that some other mechanism is used to manage security so that when workers go to grab the default credentials they already have them automatically. For example maybe environment variables or .boto files are pre-populated. Moving around credentials within a computational framework (like dask, hadoop, spark, ...) is sometimes considered unsafe. |
Thanks for the suggestions! Currently we are still in the early stage for proof of concept. Will think about other ways for this credential thing. |
I seem to encounter the same problem using a dataset published using |
We're using dask distributed scheduler with multiprocessing workers on an EC2 cluster.
dask 0.15.4 and distributed 1.19.3
I'm trying to publish named dataset (dataframe) and then retrieve and continue working on it. Basically:
This results in
'TypeError: can't pickle thread.lock objects'
error.I suppose this might be related to:
#780
dask/dask#1683
#539
I don't know how to work around this issue because read_csv() doesn't seem to accept lock argument.
full traceback:
traceback.txt
The text was updated successfully, but these errors were encountered: