-
Notifications
You must be signed in to change notification settings - Fork 2
Update to the BufferHandler #469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… filling for both electron and pulse
In Tutorial 4, I get the following error when trying to load the processor:
|
Pull Request Test Coverage Report for Build 10064447099Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this out now. The electron df seems to work correctly, but the computation of the normalization histograms takes a lot longer now (~33s vs. +24s on main). Also, the normalization does not look as good as before:
main:
fix-459:
It could off course be that the earlier normalization effectively only normalizes by total number of electrons, and not time...
Edited: Index sorting does not improve speed with using dask. I find that the current approach is the fastest. |
WIth respect to what you posted earlier here: I think it does not matter if the sub-indices are unsorted. For dask, there is only one main index which should ideally be gapless and sorted, no? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now.
I was thinking setting the index to trainId would help, since otherwise, there is no monotonic index for dask to work with (it is only monotonic within a file and then restart). But weirdly, the operations take longer with index. I find that rather unexpected behavior. |
I think the natural dask index is per partition, as every partition is like a separate pandas df. How would you use trainId as a unique index? Anyways, what would you need a continous index across partitions for? |
Yes that's true but if the index exists and is sorted, dask can calculate divisions. It is supposed to avoid expensive data shuffling (See https://docs.dask.org/en/stable/dataframe-best-practices.html#use-the-index) |
Flash minor changes (Merge to #469)
@zain-sohail I think this PR and #459 contain the relevant information for reverting back to the old behavior |
Fixes the issue in #459 by correctly filling the pulse resolved channels. To this end and more, this PR has the following changes
data_raw_dir
anddata_parquet_dir
toraw
andprocessed
as these config parameters already exist under paths and hence can be used by other methods in sed, such as saving of binned data.Benchmarks (62 files and binning over X and Y):
62 raw files to buffer and binning over X and Y (using 20 cores):
Old: 37.6 s ± 839 ms
New: 8.26 s ± 223 ms
62 buffer files to binning:
Old: 2.8 s ± 44.2 ms
New: 2.64 s ± 44.9 ms