Skip to content

Not all Pandas dataframes are shared in a multiprocessing list #20792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
freezas opened this issue Apr 23, 2018 · 6 comments
Closed

Not all Pandas dataframes are shared in a multiprocessing list #20792

freezas opened this issue Apr 23, 2018 · 6 comments

Comments

@freezas
Copy link

freezas commented Apr 23, 2018

Hello,

I've tried to get answer for this question on StackOverflow first, but I hope some of you can explain this and hopefully lead us to a solution.

The StackOverflow question is here: https://stackoverflow.com/questions/49942878/not-all-pandas-dataframes-are-shared-in-a-multiprocessing-list

I've also added an error callback and managed to get an error:

RemoteError('Traceback (most recent call last):
File "lib\multiprocessing\managers.py", line 228, in serve_client
request = recv()
File "lib\multiprocessing\connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'DataFrame' on <module 'pandas.core.frame' from 'lib\site-packages\pandas\core\frame.py>'

I've looked into the GitHub tracker and I found this issue that looks a lot like mine: #2440 Although there are a few differences:

  • I'm using multiprocessing instead of threading. Because of this, we can use a multiprocessing.Pool and and a special list object to share objects.
  • In our example, we don't actually change the dataframe in the different processes. We're only adding it to the list of shared objects.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.2 (I've also tested this with pandas version 0.22.0, which I believe was the latests)
nose: 1.3.7
pip: 10.0.0
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.13.1
scipy: 1.0.1
statsmodels: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
matplotlib: 2.1.1
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
httplib2: None
apiclient: None
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
boto: 2.48.0
pandas_datareader: None

If you need anything else, let me know. We appreciate all the work you've done!

@jreback
Copy link
Contributor

jreback commented Apr 23, 2018

there is not enough detail here to even guess at what is wrong. not likely a pandas problem, rather a usage of multiprocessing.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 23, 2018

@jreback did you see the full reproducible example on StackOverflow?
(not that I know based on that what can be going on, but at least there is some detail to the question. And if there is still not enough detail, please request clarification or changes to the reproducible example)

@KhaledTo
Copy link

KhaledTo commented Apr 24, 2018

I was able to reproduce this, but I also had the same issue when sharing lists or ints (not only pandas data frames).

My setup:

MacOs High Sierra
1,3 GHz Intel Core i5
python: 3.6.5
numpy: 1.14.2
pandas: 0.22.0

@freezas
Copy link
Author

freezas commented Apr 24, 2018

@KhaledTo How did you create lists or ints to reproduce this? It bothers me I couldn't reproduce this with lists or ints myself now... Thanks for trying to reproduce it!

@KhaledTo
Copy link

Hi @freezas, yes it's better if you check if what I did makes sens.

I added this to my_function.py:

def share_random_pandas_dataframe(shared_list):
    list_int = [1, 2, 3]
    shared_list.append(list_int)

In multiprocessing_example.py I then set processes_count to 19:

processes_count = 19

My pleasure.

@freezas
Copy link
Author

freezas commented Apr 25, 2018

@KhaledTo Weird, that still doesn't seem to raise the same problem on my computer. Maybe it's different for different operating systems..?

If it also happens for other data structures/types, it's probably a multiprocessing issue instead of a pandas issue. I'll close this issue. Thank you all!

@freezas freezas closed this as completed Apr 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants