PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? #17111

mficek · 2017-07-29T16:12:10Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

cnt = 10000000
np.random.seed(1234)
d_orig = pd.DataFrame({'timestamp': np.random.randint(0,np.iinfo(np.int64).max, cnt), 'user_id': np.random.randint(0,np.iinfo(np.int64).max, cnt)})


d = d_orig.copy()
%timeit d.sort_values(by=['user_id', 'timestamp'], inplace=True)

def sort_one_by_one(d, col1, col2):
    """
    Equivalent to pd.sort_values(by=[col1, col2]), but faster.
    """
    d.sort_values(by=[col2], inplace=True)
    d.sort_values(by=[col1], kind='mergesort', inplace=True) # keeps ordering of sorted col2 same


d = d_orig.copy()
%timeit sort_one_by_one(d, 'user_id', 'timestamp')

Problem description

I have a timestamped dataset with user ids and other information. I need to process (with numba) sequentially the dataset and for this I need it sorted by user_id and then by timestamp for each user_id.

First and obvious aproach is:

data.sort_values(by=['user_id', 'timestamp'], inplace=True)

I'm using inplace because the dataset is HUGE (yet fits into RAM and occupis approx 1/3 of computer's RAM) and by this I hope it wont explode much during processing.
The thing is, this direct approach is slow. I noticed, than sorting first by one column and then sort by the other (stabile sort = mergesort) is much faster. Depending on data used, I saw even 4x shorter time of processing, but on random seed 1234 data it is 3x.

I think mine solution works (I checked it by checking that the dataset is non-decreasing in user_id and non-decreasing in timestamp for each user_id.

Do I miss something, will this method work worse somewhere or in some situation? Both on small and big data (raise, lower the cnt variable) it behaves similarly.
Would you consider it an enhancement and performance speedup? (very easy to implement ;)

Output

1 loop, best of 3: 14.9 s per loop
1 loop, best of 3: 4.96 s per loop

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.6.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.20.3
pytest: 3.1.3
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-07-29T16:38:51Z

we use this same type of 'trick' in computing is_monotonic on a MultiIndex. so yes would take this (IOW, you make the change and if it passes the test suite, then would be good).

not that in general inplace=True doesn't do anything for perf or memory usage (it copies under the and just re-assigns the pointer), except in 1 instance IIRC.

jreback · 2017-07-29T16:39:44Z

note that inplace doesn't matter here per se, its really a multi-column sort that would benefit from this.

mficek · 2017-07-29T17:49:34Z

Thanks @jreback for letting me know about the inplace behavior. I read it somewhere already, but for some reason was hoping that for sorting it really could operate on the original array :)

I take the issue and try to prepare the PR.

mficek · 2017-07-31T15:26:18Z

Guys, I feel like lame, but here I am: I cloned pandas repo, created my virtual env, etc. etc.... following tightly the steps described in http://pandas.pydata.org/pandas-docs/stable/contributing.html but no way I can run pytest pandas or any of testing scripts.

I get the following error:

Traceback (most recent call last):
  File "/home/mficek/anaconda2/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py", line 336, in _getconftestmodules
    return self._path2confmods[path]
KeyError: local('/home/mficek/repos/pandas-mficek/pandas')

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/mficek/anaconda2/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py", line 367, in _importconftest
    return self._conftestpath2mod[conftestpath]
KeyError: local('/home/mficek/repos/pandas-mficek/conftest.py')

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/mficek/anaconda2/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py", line 373, in _importconftest
    mod = conftestpath.pyimport()
  File "/home/mficek/anaconda2/envs/pandas_dev/lib/python3.6/site-packages/py/_path/local.py", line 662, in pyimport
    __import__(modname)
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
  File "/home/mficek/anaconda2/envs/pandas_dev/lib/python3.6/site-packages/_pytest/assertion/rewrite.py", line 211, in load_module
    py.builtin.exec_(co, mod.__dict__)
  File "/home/mficek/repos/pandas-mficek/conftest.py", line 4, in <module>
    import pandas
  File "/home/mficek/repos/pandas-mficek/pandas/__init__.py", line 42, in <module>
    from pandas.core.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/api.py", line 10, in <module>
    from pandas.core.groupby import Grouper
  File "/home/mficek/repos/pandas-mficek/pandas/core/groupby.py", line 46, in <module>
    from pandas.core.index import (Index, MultiIndex,
  File "/home/mficek/repos/pandas-mficek/pandas/core/index.py", line 2, in <module>
    from pandas.core.indexes.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/api.py", line 1, in <module>
    from pandas.core.indexes.base import (Index, _new_Index,  # noqa
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/base.py", line 4006, in <module>
    Index._add_numeric_methods_disabled()
NameError: name 'Index' is not defined
ERROR: could not load /home/mficek/repos/pandas-mficek/conftest.py

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mficek/repos/pandas-mficek/pandas/__init__.py", line 42, in <module>
    from pandas.core.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/api.py", line 10, in <module>
    from pandas.core.groupby import Grouper
  File "/home/mficek/repos/pandas-mficek/pandas/core/groupby.py", line 46, in <module>
    from pandas.core.index import (Index, MultiIndex,
  File "/home/mficek/repos/pandas-mficek/pandas/core/index.py", line 2, in <module>
    from pandas.core.indexes.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/api.py", line 1, in <module>
    from pandas.core.indexes.base import (Index, _new_Index,  # noqa
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/base.py", line 4006, in <module>
    Index._add_numeric_methods_disabled()
NameError: name 'Index' is not defined

I tried python==2.7, python==3.6 but nothing. I don't like posting questions like this but I simply spent a non-trivial time in finding out why pandas does not import but without success. Any hint, please?

I'd like to work on a PR for the issue mentioned above, but without tests running, I don't know :/

gfyoung · 2017-07-31T15:32:29Z

Not entirely sure yet why you can't run tests, but try this for now : open your Python interpreter (for dev environment), and run this:

import pandas as pd
pd.__version__ # to confirm proper installation
pd.test()

mficek · 2017-07-31T15:42:40Z

This is the result when I run

import pandas as pd

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mficek/repos/pandas-mficek/pandas/__init__.py", line 42, in <module>
    from pandas.core.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/api.py", line 10, in <module>
    from pandas.core.groupby import Grouper
  File "/home/mficek/repos/pandas-mficek/pandas/core/groupby.py", line 46, in <module>
    from pandas.core.index import (Index, MultiIndex,
  File "/home/mficek/repos/pandas-mficek/pandas/core/index.py", line 2, in <module>
    from pandas.core.indexes.api import *
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/api.py", line 1, in <module>
    from pandas.core.indexes.base import (Index, _new_Index,  # noqa
  File "/home/mficek/repos/pandas-mficek/pandas/core/indexes/base.py", line 4006, in <module>
    Index._add_numeric_methods_disabled()
NameError: name 'Index' is not defined

mficek · 2017-07-31T15:51:15Z

Ok, my bad, It seems I didn't run

python setup.py build_ext --inplace

and run only

python setup.py develop

From documentation I understand that there are two methods which are complementary, but both of them should be run. For tag 0.20.3 it works, now I'm trying upstream/master.

mficek · 2018-03-25T11:39:57Z

It seems that the issue is too dependent of data (sortedness, cardinality). Could work for domain-specific tasks, but I can not make it general enough to be a part of pandas.

jreback · 2018-03-25T13:50:17Z

@mficek ok fair enough. If you want to continue to make this work even in a limited scenario (which is detectable), pls ping / re-open.

jreback changed the title ~~DateFrame.sort_values(by=[x,y], inplace=True) speed improvement?~~ PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? Jul 29, 2017

jreback added Difficulty Intermediate Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 29, 2017

jreback added this to the 0.21.0 milestone Jul 29, 2017

This was referenced Aug 1, 2017

BUG: NaT in Timestamp ignored by sort_values with na_position='last' #17138

Closed

PERF: multi-column sort_values speedup #17141

Closed

jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017

mficek closed this as completed Mar 25, 2018

nils-braun mentioned this issue Jul 30, 2021

Consider using default sorting algorithm with partition.sort_values function dask-contrib/dask-sql#204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? #17111

PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? #17111

mficek commented Jul 29, 2017

jreback commented Jul 29, 2017

jreback commented Jul 29, 2017

mficek commented Jul 29, 2017

mficek commented Jul 31, 2017

gfyoung commented Jul 31, 2017

mficek commented Jul 31, 2017

mficek commented Jul 31, 2017

mficek commented Mar 25, 2018

jreback commented Mar 25, 2018

PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? #17111

PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? #17111

Comments

mficek commented Jul 29, 2017

Code Sample, a copy-pastable example if possible

Problem description

Output

Output of pd.show_versions()

jreback commented Jul 29, 2017

jreback commented Jul 29, 2017

mficek commented Jul 29, 2017

mficek commented Jul 31, 2017

gfyoung commented Jul 31, 2017

mficek commented Jul 31, 2017

mficek commented Jul 31, 2017

mficek commented Mar 25, 2018

jreback commented Mar 25, 2018

Output of `pd.show_versions()`