Skip to content

Update to the BufferHandler #469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jul 23, 2024
Merged

Update to the BufferHandler #469

merged 30 commits into from
Jul 23, 2024

Conversation

zain-sohail
Copy link
Member

Fixes the issue in #459 by correctly filling the pulse resolved channels. To this end and more, this PR has the following changes

  • Seperation of buffer files for timed and electron dataframes
    • The pulse resolved channels have the 0 values removed (TBD if it makes sense) because bam contains pulses that are invalid and stores them as 0s
    • Using concat to combine dataframe instead of join as it's faster and gives same results
    • After getting combined dataframe, df is filled on pulse and train channels
    • For electron df, all rows that don't have an electron are dropped
    • For pulse, the electron rows are dropped so we get pulse resolved data
  • A class for buffer file paths which stores a list of dicts with raw, electron buffer and timed buffer filenames
  • Schema check happens on both buffer types
  • Changing the names from data_raw_dir and data_parquet_dir to raw and processed as these config parameters already exist under paths and hence can be used by other methods in sed, such as saving of binned data.
  • Updates to metadata and tests

Benchmarks (62 files and binning over X and Y):
62 raw files to buffer and binning over X and Y (using 20 cores):
Old: 37.6 s ± 839 ms
New: 8.26 s ± 223 ms

62 buffer files to binning:
Old: 2.8 s ± 44.2 ms
New: 2.64 s ± 44.9 ms

@rettigl
Copy link
Member

rettigl commented Jul 8, 2024

In Tutorial 4, I get the following error when trying to load the processor:

	"name": "FileNotFoundError",
	"message": "No files found for run 44762 in directory ['tests/data/loader/flash']",
	"stack": "---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[15], line 1
----> 1 sp = SedProcessor(runs=[44762], config=config_override, system_config=config_file, collect_metadata=False, force_recreate=True)
      2 # You can set collect_metadata=True if the scicat_url and scicat_token are defined

File /mnt/pcshare/users/Laurenz/AreaB/sed/sed/sed/core/processor.py:164, in SedProcessor.__init__(self, metadata, config, dataframe, files, folder, runs, collect_metadata, verbose, **kwds)
    162 # Load data if provided:
    163 if dataframe is not None or files is not None or folder is not None or runs is not None:
--> 164     self.load(
    165         dataframe=dataframe,
    166         metadata=metadata,
    167         files=files,
    168         folder=folder,
    169         runs=runs,
    170         collect_metadata=collect_metadata,
    171         **kwds,
    172     )

File /mnt/pcshare/users/Laurenz/AreaB/sed/sed/sed/core/processor.py:410, in SedProcessor.load(self, dataframe, metadata, files, folder, runs, collect_metadata, **kwds)
    402         dataframe, timed_dataframe, metadata = self.loader.read_dataframe(
    403             folders=cast(str, self.cpy(folder)),
    404             runs=runs,
   (...)
    407             **kwds,
    408         )
    409     else:
--> 410         dataframe, timed_dataframe, metadata = self.loader.read_dataframe(
    411             runs=runs,
    412             metadata=metadata,
    413             collect_metadata=collect_metadata,
    414             **kwds,
    415         )
    417 elif folder is not None:
    418     dataframe, timed_dataframe, metadata = self.loader.read_dataframe(
    419         folders=cast(str, self.cpy(folder)),
    420         metadata=metadata,
    421         collect_metadata=collect_metadata,
    422         **kwds,
    423     )

File /mnt/pcshare/users/Laurenz/AreaB/sed/sed/sed/loader/flash/loader.py:303, in FlashLoader.read_dataframe(self, files, folders, runs, ftype, metadata, collect_metadata, detector, force_recreate, processed_dir, debug, **kwds)
    301 runs_ = [str(runs)] if isinstance(runs, (str, int)) else list(map(str, runs))
    302 for run in runs_:
--> 303     run_files = self.get_files_from_run_id(
    304         run_id=run,
    305         folders=self.raw_dir,
    306     )
    307     files.extend(run_files)
    308 self.runs = runs_

File /mnt/pcshare/users/Laurenz/AreaB/sed/sed/sed/loader/flash/loader.py:170, in FlashLoader.get_files_from_run_id(self, run_id, folders, extension)
    168 # Check if any files are found
    169 if not files:
--> 170     raise FileNotFoundError(
    171         f\"No files found for run {run_id} in directory {str(folders)}\",
    172     )
    174 # Return the list of found files
    175 return [str(file.resolve()) for file in files]

FileNotFoundError: No files found for run 44762 in directory ['tests/data/loader/flash']"

@coveralls
Copy link
Collaborator

coveralls commented Jul 9, 2024

Pull Request Test Coverage Report for Build 10064447099

Details

  • 234 of 245 (95.51%) changed or added relevant lines in 13 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.01%) to 92.688%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sed/loader/flash/buffer_handler.py 69 70 98.57%
sed/loader/flash/utils.py 18 19 94.74%
tests/loader/test_loaders.py 14 18 77.78%
sed/loader/flash/dataframe.py 22 27 81.48%
Files with Coverage Reduction New Missed Lines %
sed/loader/flash/dataframe.py 1 91.09%
Totals Coverage Status
Change from base Build 9987723124: -0.01%
Covered Lines: 7073
Relevant Lines: 7631

💛 - Coveralls

Copy link
Member

@rettigl rettigl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out now. The electron df seems to work correctly, but the computation of the normalization histograms takes a lot longer now (~33s vs. +24s on main). Also, the normalization does not look as good as before:
main:
grafik
fix-459:
grafik

It could off course be that the earlier normalization effectively only normalizes by total number of electrons, and not time...

@zain-sohail
Copy link
Member Author

zain-sohail commented Jul 16, 2024

Edited: Index sorting does not improve speed with using dask. I find that the current approach is the fastest.

@rettigl
Copy link
Member

rettigl commented Jul 18, 2024

Edited: Index sorting does not improve speed with using dask. I find that the current approach is the fastest.

WIth respect to what you posted earlier here: I think it does not matter if the sub-indices are unsorted. For dask, there is only one main index which should ideally be gapless and sorted, no?

@rettigl
Copy link
Member

rettigl commented Jul 18, 2024

Indeed, I think the current approach produces the correct normalization, and before it was just normalizing to total counts and not time because of the missing entries in the timed_dataframe.
These are the normalition histograms:
old:
grafik
new:
grafik

@rettigl
Copy link
Member

rettigl commented Jul 18, 2024

Histogram of timeStamp (should be flat)
Old:
grafik
New:
grafik

Copy link
Member

@rettigl rettigl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

@zain-sohail
Copy link
Member Author

WIth respect to what you posted earlier here: I think it does not matter if the sub-indices are unsorted. For dask, there is only one main index which should ideally be gapless and sorted, no?

I was thinking setting the index to trainId would help, since otherwise, there is no monotonic index for dask to work with (it is only monotonic within a file and then restart). But weirdly, the operations take longer with index. I find that rather unexpected behavior.

@rettigl
Copy link
Member

rettigl commented Jul 18, 2024

WIth respect to what you posted earlier here: I think it does not matter if the sub-indices are unsorted. For dask, there is only one main index which should ideally be gapless and sorted, no?

I was thinking setting the index to trainId would help, since otherwise, there is no monotonic index for dask to work with (it is only monotonic within a file and then restart). But weirdly, the operations take longer with index. I find that rather unexpected behavior.

I think the natural dask index is per partition, as every partition is like a separate pandas df. How would you use trainId as a unique index? Anyways, what would you need a continous index across partitions for?

@zain-sohail
Copy link
Member Author

zain-sohail commented Jul 18, 2024

I think the natural dask index is per partition, as every partition is like a separate pandas df. How would you use trainId as a unique index? Anyways, what would you need a continous index across partitions for?

Yes that's true but if the index exists and is sorted, dask can calculate divisions. It is supposed to avoid expensive data shuffling (See https://docs.dask.org/en/stable/dataframe-best-practices.html#use-the-index)
But I believe since we mostly work with shared memory (not entirely sure), it might not make an impact, even causes a slowdown.

@zain-sohail zain-sohail merged commit 89ba09e into v1_feature_branch Jul 23, 2024
2 checks passed
@zain-sohail zain-sohail deleted the fix-459 branch July 23, 2024 18:39
@rettigl
Copy link
Member

rettigl commented Dec 20, 2024

@zain-sohail I think this PR and #459 contain the relevant information for reverting back to the old behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants