-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement? #17111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
we use this same type of 'trick' in computing not that in general |
note that |
Thanks @jreback for letting me know about the I take the issue and try to prepare the PR. |
Guys, I feel like lame, but here I am: I cloned pandas repo, created my virtual env, etc. etc.... following tightly the steps described in http://pandas.pydata.org/pandas-docs/stable/contributing.html but no way I can run I get the following error:
I tried python==2.7, python==3.6 but nothing. I don't like posting questions like this but I simply spent a non-trivial time in finding out why pandas does not import but without success. Any hint, please? I'd like to work on a PR for the issue mentioned above, but without tests running, I don't know :/ |
Not entirely sure yet why you can't run tests, but try this for now : open your Python interpreter (for dev environment), and run this: import pandas as pd
pd.__version__ # to confirm proper installation
pd.test() |
This is the result when I run
|
Ok, my bad, It seems I didn't run
and run only
From documentation I understand that there are two methods which are complementary, but both of them should be run. For tag 0.20.3 it works, now I'm trying upstream/master. |
It seems that the issue is too dependent of data (sortedness, cardinality). Could work for domain-specific tasks, but I can not make it general enough to be a part of pandas. |
@mficek ok fair enough. If you want to continue to make this work even in a limited scenario (which is detectable), pls ping / re-open. |
Code Sample, a copy-pastable example if possible
Problem description
I have a timestamped dataset with user ids and other information. I need to process (with numba) sequentially the dataset and for this I need it sorted by user_id and then by timestamp for each user_id.
First and obvious aproach is:
I'm using inplace because the dataset is HUGE (yet fits into RAM and occupis approx 1/3 of computer's RAM) and by this I hope it wont explode much during processing.
The thing is, this direct approach is slow. I noticed, than sorting first by one column and then sort by the other (stabile sort = mergesort) is much faster. Depending on data used, I saw even 4x shorter time of processing, but on random seed 1234 data it is 3x.
I think mine solution works (I checked it by checking that the dataset is non-decreasing in user_id and non-decreasing in timestamp for each user_id.
Do I miss something, will this method work worse somewhere or in some situation? Both on small and big data (raise, lower the cnt variable) it behaves similarly.
Would you consider it an enhancement and performance speedup? (very easy to implement ;)
Output
1 loop, best of 3: 14.9 s per loop
1 loop, best of 3: 4.96 s per loop
Output of
pd.show_versions()
pandas: 0.20.3
pytest: 3.1.3
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: