Skip to content

Inconsistent behavior with Categorical.add_categories #3736

Closed
@chrisgoddard

Description

@chrisgoddard

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): OSX 11.6
  • Modin version (modin.__version__): 0.11.3
  • Python version: 3.8.11
  • Code we can use to reproduce:

The following code gives different results between pandas as modin:

import pandas as pd
# import ray
# ray.init()
# import modin.pandas as pd
import uuid

df = pd.DataFrame(
            data=[dict(id=str(uuid.uuid4())) for _ in range(10000)]
        )
df.set_index('id', inplace=True, drop=True)
df['status'] = pd.Series([""]*len(df), dtype="category", index=df.index)
df.status = df.status.cat.add_categories('occupied')
df.loc[list(df.index)[:1000], 'status'] = 'occupied'

print(df)

I use df.status = df.status.cat.add_categories rather than using the inplace param as that is being deprecated in Pandas.

Replace the pandas import is modin and see the following error:

UserWarning: Distributing <class 'list'> object. This may take some time.
UserWarning: Distributing <class 'list'> object. This may take some time.
UserWarning: `Series.<lambda>` defaulting to pandas implementation.
To request implementation, send an email to [email protected].
FutureWarning: The `inplace` parameter in pandas.Categorical.add_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
Traceback (most recent call last):
  File "test.py", line 15, in <module>
    print(df)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/pandas/base.py", line 3054, in __str__
    return repr(self)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/pandas/dataframe.py", line 210, in __repr__
    result = repr(self._build_repr_df(num_rows, num_cols))
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/pandas/base.py", line 197, in _build_repr_df
    return self.iloc[indexer]._query_compiler.to_pandas()
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 255, in to_pandas
    return self._modin_frame.to_pandas()
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2254, in to_pandas
    df = self._partition_mgr_cls.to_pandas(self._partitions)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in to_pandas
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in <listcomp>
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in <listcomp>
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 193, in to_pandas
    dataframe = self.get()
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 75, in get
    return ray.get(self.oid)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/ray/worker.py", line 1625, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::apply_list_of_funcs() (pid=38681, ip=127.0.0.1)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3080, in iloc_mut
    partition.iloc[row_internal_indices, col_internal_indices] = item
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 723, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1732, in _setitem_with_indexer
    self._setitem_single_block(indexer, value, name)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1968, in _setitem_single_block
    self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 355, in setitem
    return self.apply("setitem", indexer=indexer, value=value)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 327, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 1491, in setitem
    self.values[indexer] = value
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/arrays/_mixins.py", line 182, in __setitem__
    value = self._validate_setitem_value(value)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/arrays/categorical.py", line 2044, in _validate_setitem_value
    raise ValueError(
ValueError: Cannot setitem on a Categorical with a new category, set the categories first

During handling of the above exception, another exception occurred:

ray::apply_list_of_funcs() (pid=38681, ip=127.0.0.1)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 484, in apply_list_of_funcs
    partition = func(partition.copy(), *args, **kwargs)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3080, in iloc_mut
    partition.iloc[row_internal_indices, col_internal_indices] = item
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 723, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1732, in _setitem_with_indexer
    self._setitem_single_block(indexer, value, name)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1968, in _setitem_single_block
    self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 355, in setitem
    return self.apply("setitem", indexer=indexer, value=value)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 327, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 1491, in setitem
    self.values[indexer] = value
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/arrays/_mixins.py", line 182, in __setitem__
    value = self._validate_setitem_value(value)
  File "/Users/chrisgoddard/Code/modin-test/.env/lib/python3.8/site-packages/pandas/core/arrays/categorical.py", line 2044, in _validate_setitem_value
    raise ValueError(
ValueError: Cannot setitem on a Categorical with a new category, set the categories first

Interestingly, when I run the code below and only update a single row, I don't get the error:

import ray
ray.init()
import modin.pandas as pd
import uuid

df = pd.DataFrame(
            data=[dict(id=str(uuid.uuid4())) for _ in range(10000)]
        )
df.set_index('id', inplace=True, drop=True)
df['status'] = pd.Series([""]*len(df), dtype="category", index=df.index)
df.status = df.status.cat.add_categories('occupied')
df.loc[list(df.index)[:1], 'status'] = 'occupied'

print(df)

Also, there error doesn't occur when the categorical value is set (df.loc... = 'occupied') - but when the dataframe itself is printed

Appreciate any help you can provide!

Metadata

Metadata

Assignees

Labels

Backport 🔙Issues that need to be backported to previous release(s)ExternalPull requests and issues from people who do not regularly contribute to modinbug 🦗Something isn't workingpandas concordance 🐼Functionality that does not match pandas

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions