Skip to content

PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? #17111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mficek opened this issue Jul 29, 2017 · 9 comments
Closed

PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? #17111

mficek opened this issue Jul 29, 2017 · 9 comments
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@mficek
Copy link

mficek commented Jul 29, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

cnt = 10000000
np.random.seed(1234)
d_orig = pd.DataFrame({'timestamp': np.random.randint(0,np.iinfo(np.int64).max, cnt), 'user_id': np.random.randint(0,np.iinfo(np.int64).max, cnt)})


d = d_orig.copy()
%timeit d.sort_values(by=['user_id', 'timestamp'], inplace=True)

def sort_one_by_one(d, col1, col2):
    """
    Equivalent to pd.sort_values(by=[col1, col2]), but faster.
    """
    d.sort_values(by=[col2], inplace=True)
    d.sort_values(by=[col1], kind='mergesort', inplace=True) # keeps ordering of sorted col2 same


d = d_orig.copy()
%timeit sort_one_by_one(d, 'user_id', 'timestamp')

Problem description

I have a timestamped dataset with user ids and other information. I need to process (with numba) sequentially the dataset and for this I need it sorted by user_id and then by timestamp for each user_id.

First and obvious aproach is:

data.sort_values(by=['user_id', 'timestamp'], inplace=True)

I'm using inplace because the dataset is HUGE (yet fits into RAM and occupis approx 1/3 of computer's RAM) and by this I hope it wont explode much during processing.
The thing is, this direct approach is slow. I noticed, than sorting first by one column and then sort by the other (stabile sort = mergesort) is much faster. Depending on data used, I saw even 4x shorter time of processing, but on random seed 1234 data it is 3x.

I think mine solution works (I checked it by checking that the dataset is non-decreasing in user_id and non-decreasing in timestamp for each user_id.

Do I miss something, will this method work worse somewhere or in some situation? Both on small and big data (raise, lower the cnt variable) it behaves similarly.
Would you consider it an enhancement and performance speedup? (very easy to implement ;)

Output

1 loop, best of 3: 14.9 s per loop
1 loop, best of 3: 4.96 s per loop

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.6.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.20.3
pytest: 3.1.3
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jul 29, 2017

we use this same type of 'trick' in computing is_monotonic on a MultiIndex. so yes would take this (IOW, you make the change and if it passes the test suite, then would be good).

not that in general inplace=True doesn't do anything for perf or memory usage (it copies under the and just re-assigns the pointer), except in 1 instance IIRC.

@jreback jreback changed the title DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? Jul 29, 2017
@jreback
Copy link
Contributor

jreback commented Jul 29, 2017

note that inplace doesn't matter here per se, its really a multi-column sort that would benefit from this.

@jreback jreback added Difficulty Intermediate Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 29, 2017
@jreback jreback added this to the 0.21.0 milestone Jul 29, 2017
@mficek
Copy link
Author

mficek commented Jul 29, 2017

Thanks @jreback for letting me know about the inplace behavior. I read it somewhere already, but for some reason was hoping that for sorting it really could operate on the original array :)

I take the issue and try to prepare the PR.

@mficek
Copy link
Author

mficek commented Jul 31, 2017

Guys, I feel like lame, but here I am: I cloned pandas repo, created my virtual env, etc. etc.... following tightly the steps described in http://pandas.pydata.org/pandas-docs/stable/contributing.html but no way I can run pytest pandas or any of testing scripts.

I get the following error:

Traceback (most recent call last):
  File "/home/mficek/anaconda2/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py", line 336, in _getconftestmodules
    return self._path2confmods[path]
KeyError: local('/home/mficek/repos/pandas-mficek/pandas')

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/mficek/anaconda2/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py", line 367, in _importconftest
    return self._conftestpath2mod[conftestpath]
KeyError: local('/home/mficek/repos/pandas-mficek/conftest.py')

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/mficek/anaconda2/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py", line 373, in _importconftest
    mod = conftestpath.pyimport()
  File "/home/mficek/anaconda2/envs/pandas_dev/lib/python3.6/site-packages/py/_path/local.py", line 662, in pyimport
    __import__(modname)
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
  File "/home/mficek/anaconda2/envs/pandas_dev/lib/python3.6/site-packages/_pytest/assertion/rewrite.py", line 211, in load_module
    py.builtin.exec_(co, mod.__dict__)
  File "/home/mficek/repos/pandas-mficek/conftest.py", line 4, in <module>
    import pandas
  File "/home/mficek/repos/pandas-mficek/pandas/__init__.py", line 42, in <module>
    from pandas.core.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/api.py", line 10, in <module>
    from pandas.core.groupby import Grouper
  File "/home/mficek/repos/pandas-mficek/pandas/core/groupby.py", line 46, in <module>
    from pandas.core.index import (Index, MultiIndex,
  File "/home/mficek/repos/pandas-mficek/pandas/core/index.py", line 2, in <module>
    from pandas.core.indexes.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/api.py", line 1, in <module>
    from pandas.core.indexes.base import (Index, _new_Index,  # noqa
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/base.py", line 4006, in <module>
    Index._add_numeric_methods_disabled()
NameError: name 'Index' is not defined
ERROR: could not load /home/mficek/repos/pandas-mficek/conftest.py
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mficek/repos/pandas-mficek/pandas/__init__.py", line 42, in <module>
    from pandas.core.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/api.py", line 10, in <module>
    from pandas.core.groupby import Grouper
  File "/home/mficek/repos/pandas-mficek/pandas/core/groupby.py", line 46, in <module>
    from pandas.core.index import (Index, MultiIndex,
  File "/home/mficek/repos/pandas-mficek/pandas/core/index.py", line 2, in <module>
    from pandas.core.indexes.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/api.py", line 1, in <module>
    from pandas.core.indexes.base import (Index, _new_Index,  # noqa
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/base.py", line 4006, in <module>
    Index._add_numeric_methods_disabled()
NameError: name 'Index' is not defined

I tried python==2.7, python==3.6 but nothing. I don't like posting questions like this but I simply spent a non-trivial time in finding out why pandas does not import but without success. Any hint, please?

I'd like to work on a PR for the issue mentioned above, but without tests running, I don't know :/

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

Not entirely sure yet why you can't run tests, but try this for now : open your Python interpreter (for dev environment), and run this:

import pandas as pd
pd.__version__ # to confirm proper installation
pd.test()

@mficek
Copy link
Author

mficek commented Jul 31, 2017

This is the result when I run

import pandas as pd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mficek/repos/pandas-mficek/pandas/__init__.py", line 42, in <module>
    from pandas.core.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/api.py", line 10, in <module>
    from pandas.core.groupby import Grouper
  File "/home/mficek/repos/pandas-mficek/pandas/core/groupby.py", line 46, in <module>
    from pandas.core.index import (Index, MultiIndex,
  File "/home/mficek/repos/pandas-mficek/pandas/core/index.py", line 2, in <module>
    from pandas.core.indexes.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/api.py", line 1, in <module>
    from pandas.core.indexes.base import (Index, _new_Index,  # noqa
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/base.py", line 4006, in <module>
    Index._add_numeric_methods_disabled()
NameError: name 'Index' is not defined

@mficek
Copy link
Author

mficek commented Jul 31, 2017

Ok, my bad, It seems I didn't run

python setup.py build_ext --inplace

and run only

python setup.py develop

From documentation I understand that there are two methods which are complementary, but both of them should be run. For tag 0.20.3 it works, now I'm trying upstream/master.

@mficek
Copy link
Author

mficek commented Mar 25, 2018

It seems that the issue is too dependent of data (sortedness, cardinality). Could work for domain-specific tasks, but I can not make it general enough to be a part of pandas.

@mficek mficek closed this as completed Mar 25, 2018
@jreback
Copy link
Contributor

jreback commented Mar 25, 2018

@mficek ok fair enough. If you want to continue to make this work even in a limited scenario (which is detectable), pls ping / re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants