Skip to content

PERF: potential room for optimizing aggregation(like sum, count and mean) on df.groupby #56728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
Alexia-I opened this issue Jan 4, 2024 · 3 comments
Closed
2 of 3 tasks
Labels
Closing Candidate May be closeable, needs more eyeballs Groupby Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance

Comments

@Alexia-I
Copy link

Alexia-I commented Jan 4, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Hello, I am writing to suggest a potential improvement in the memory efficiency of Pandas, specifically when performing aggregation operations like sum, count, and mean following a groupby operation.

Currently, after performing such operations, the resulting DataFrame or Series object often retains a more memory-intensive index type, like Int64Index, even when the index forms a continuous integer sequence. This behavior seems suboptimal, especially in cases where a RangeIndex could be used instead to achieve better memory efficiency.

import pandas as pd
import sys
lista = [i for i in range(100000)]
# create a DataFrame
df = pd.DataFrame({'a': lista, 'b': range(100000, 200000), 'c': range(200000, 300000)})

# use aggregation func like size after groupby
df_grouped = df.groupby('a').size()
print(df_grouped.index)

# compare the memory size of dfs in index and rangeindex
original_size = sys.getsizeof(df_grouped)
df_grouped.index = pd.RangeIndex(start=df_grouped.index.min(), stop=df_grouped.index.max() + 1)
print(df_grouped.index)
new_size = sys.getsizeof(df_grouped)

print("original memory size:", original_size)
print("memory size after :", new_size)

printed result:

Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,     9,
       ...
       99990, 99991, 99992, 99993, 99994, 99995, 99996, 99997, 99998, 99999],
      dtype='int64', name='a', length=100000)
RangeIndex(start=0, stop=100000, step=1)
original memory size: 1600016
memory size after : 800144

In the above example, changing the index to RangeIndex after a groupby operation significantly reduces the memory cost.

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.9.18.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-88-generic
Version : #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.4
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.18.1
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None

Prior Performance

No response

@Alexia-I Alexia-I added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 4, 2024
@rhshadrach
Copy link
Member

rhshadrach commented Jan 4, 2024

Thanks for the suggestion. We would need to check whether the groups do indeed form a contiguous sequence. In addition, it would mean the type of index in the result is dependent on the values being grouped, which can make it hard to predict for users. For these reasons, I do not think we should be swapping out the index.

@rhshadrach rhshadrach added Groupby Closing Candidate May be closeable, needs more eyeballs labels Jan 4, 2024
@Alexia-I
Copy link
Author

Alexia-I commented Jan 4, 2024

@rhshadrach Thanks for your quick response and for considering the suggestion. I believe the proposed optimization could be particularly useful when users group rows by an integer dtype column that forms a contiguous sequence. It really consumes much memory when using other index type other than rangeindex. I believe this enhancement could contribute to the overall efficiency and performance of Pandas, especially for data processing tasks involving large datasets. This is a common scenario, especially in large datasets, where the choice of index type can significantly impact memory usage. Utilizing RangeIndex in such cases could offer substantial memory savings.
Besides, most users interact with DataFrame contents rather than the index type and I think most operations do not distinguish int64index and rangeindex, and it would not affect user too much. Also, I anticipate that implementing this change could be relatively straightforward. Thus I think this might be easy and useful to fix.

@phofl
Copy link
Member

phofl commented Jan 4, 2024

I agree with @rhshadrach, we don't want value dependent behaviour if we can avoid it, the improvement isn't worth the hassle in this case. You can cast the Index by yourself if necessary.

@phofl phofl closed this as completed Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Groupby Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants