-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
EHN: consider multiple skiplist classes for different datatypes #12141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Why don't we adapt the algorithms from |
|
Yeah, I was thinking of This one and only commit comment doesn't look that encouraging though :) https://github.com/kwgoodman/bottleneck/commits/5a9ecb818b63ffcb8d6e787e51a236a109302c6a/bottleneck/src/auto_pyx/csrc/move_median.c |
wow |
@kawochen can you post any benchmarks? as an aside, still need to keep these routines as |
When benchmarking, please remember to benchmark for different window sizes and different axis. When testing median, I came across different implementations where the running times were very different depending on window_size due to the algorithms being used. E.g. if window sizes are small, maintaining multiple heaps hurts performance compared to running quickselect multiple times. Whereas, with big window sizes, it (big O) starts to matter a lot and multiple heap implementations start to win. Also, with bigger matrices, and iterating across different axis, the cache effects could matter a bit too. So, when you benchmark, please be careful about these effects. |
Will work on that :) |
What is the pandas policy regarding this? E.g. numexpr evals automatically use multiple cores (unless explicitly told to use only one core), but bottleneck uses 1 core by default. If it is OK to use multiple threads, should I just create a |
@jreback @wesm any feedback on the pandas policy? I would like to avoid doing un-necessary work, if possible. In this case, I am referring to writing a threaded implementation, and have the PR in pandas get rejected. I have upgraded the move_median algorithm in bottleneck to accept nans, and I benched it in different scenarios and am consistently getting ~ 3.5 - 5.5x performance of the pandas master [except for the trivial case of window_size=1, which bn optimizes]. So, I am guessing that the pandas' rolling median is using an algorithm with a The double heap algorithm implementation (contributed by someone else) in bottleneck is pretty efficient. Also, bottleneck generates code for double, float, int32, and int64 data types at compile time. In case you're interested in the move median implementation: https://github.com/jennolsen84/bottleneck/tree/movemedian I have to clean up the function names a little bit and add more documentation. I have tested the implementation thoroughly so it should be good. Test cases were a bit tough to generate -- could use advice on this. I ended up hammering the implementation with random datasets, and that helped shake out (hopefully all?) a lot of bugs. |
here is the timing data (I stopped it after while, as it wasn't that interesting): |
@jennolsen84 is it possible to handle the multithreading using pthreads in C rather than via Python threads? |
@jennolsen84 so in general to play nice, you can:
note that: |
@jreback yep, I sent in the PR to numexpr for that lock.. :) pydata/numexpr#200 . @wesm yes, can definitely use pthreads. How many threads should be used? I was thinking of peeking at numexpr settings, but numexpr is not required by pandas. On a bigger picture. I think the best option might be to expose something that returns a list of functions, which can then be scheduled as the user pleases. This allows the user to use an existing threadpool (one that might be partially in use, so this can avoid oversaturation of the CPU, etc). At this moment, it is not possible to implement efficiently on top of pandas due to the function signatures in pandas. (The Thoughts? I can move the discussion of the last paragraph to a separate issue, if you would like. |
bottleneck 1.1 now has a good implementation of move median. It can handle nans, etc. There is talk about implementing parallel algorithms in bottleneck as well. Perhaps pandas can just use those. |
Seems like the general feel is that we want to move towards using bottleneck (or maybe even numba) in the future so I guess extending the skiplist to handle more types is not really a priority so closing |
Right now, when we calculate something like a rolling max, if a float32 dataset is being passed in, it upcasts it to a float64. This increases the memory requirements of a big dataset considerably.
This is due to the fact that the skiplist implementation only works with doubles. Would you be opposed if I added instantiations for int32, int64, float32, uint64, uint32?
This way, we can avoid copies for big datasets (which take up a lot of memory, and time).
cc @kawochen
xref #3007
The text was updated successfully, but these errors were encountered: