Closed
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): OSX 11.6
- Modin version (
modin.__version__
): 0.11.3 - Python version: 3.8.11
- Code we can use to reproduce:
The following code gives different results between pandas as modin:
import pandas as pd
# import ray
# ray.init()
# import modin.pandas as pd
import uuid
df = pd.DataFrame(
data=[dict(id=str(uuid.uuid4())) for _ in range(10000)]
)
df.set_index('id', inplace=True, drop=True)
df['status'] = pd.Series([""]*len(df), dtype="category", index=df.index)
df.status = df.status.cat.add_categories('occupied')
df.loc[list(df.index)[:1000], 'status'] = 'occupied'
print(df)
I use df.status = df.status.cat.add_categories
rather than using the inplace
param as that is being deprecated in Pandas.
Replace the pandas import is modin and see the following error:
UserWarning: Distributing <class 'list'> object. This may take some time.
UserWarning: Distributing <class 'list'> object. This may take some time.
UserWarning: `Series.<lambda>` defaulting to pandas implementation.
To request implementation, send an email to [email protected].
FutureWarning: The `inplace` parameter in pandas.Categorical.add_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
Traceback (most recent call last):
File "test.py", line 15, in <module>
print(df)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/pandas/base.py", line 3054, in __str__
return repr(self)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/pandas/dataframe.py", line 210, in __repr__
result = repr(self._build_repr_df(num_rows, num_cols))
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/pandas/base.py", line 197, in _build_repr_df
return self.iloc[indexer]._query_compiler.to_pandas()
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 255, in to_pandas
return self._modin_frame.to_pandas()
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2254, in to_pandas
df = self._partition_mgr_cls.to_pandas(self._partitions)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in to_pandas
retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in <listcomp>
retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in <listcomp>
retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 193, in to_pandas
dataframe = self.get()
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 75, in get
return ray.get(self.oid)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/ray/worker.py", line 1625, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::apply_list_of_funcs() (pid=38681, ip=127.0.0.1)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3080, in iloc_mut
partition.iloc[row_internal_indices, col_internal_indices] = item
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 723, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1732, in _setitem_with_indexer
self._setitem_single_block(indexer, value, name)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1968, in _setitem_single_block
self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 355, in setitem
return self.apply("setitem", indexer=indexer, value=value)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 327, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 1491, in setitem
self.values[indexer] = value
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/arrays/_mixins.py", line 182, in __setitem__
value = self._validate_setitem_value(value)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/arrays/categorical.py", line 2044, in _validate_setitem_value
raise ValueError(
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
During handling of the above exception, another exception occurred:
ray::apply_list_of_funcs() (pid=38681, ip=127.0.0.1)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 484, in apply_list_of_funcs
partition = func(partition.copy(), *args, **kwargs)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3080, in iloc_mut
partition.iloc[row_internal_indices, col_internal_indices] = item
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 723, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1732, in _setitem_with_indexer
self._setitem_single_block(indexer, value, name)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1968, in _setitem_single_block
self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 355, in setitem
return self.apply("setitem", indexer=indexer, value=value)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 327, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 1491, in setitem
self.values[indexer] = value
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/arrays/_mixins.py", line 182, in __setitem__
value = self._validate_setitem_value(value)
File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/arrays/categorical.py", line 2044, in _validate_setitem_value
raise ValueError(
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
Interestingly, when I run the code below and only update a single row, I don't get the error:
import ray
ray.init()
import modin.pandas as pd
import uuid
df = pd.DataFrame(
data=[dict(id=str(uuid.uuid4())) for _ in range(10000)]
)
df.set_index('id', inplace=True, drop=True)
df['status'] = pd.Series([""]*len(df), dtype="category", index=df.index)
df.status = df.status.cat.add_categories('occupied')
df.loc[list(df.index)[:1], 'status'] = 'occupied'
print(df)
Also, there error doesn't occur when the categorical value is set (df.loc... = 'occupied') - but when the dataframe itself is printed
Appreciate any help you can provide!