Update to the BufferHandler #469

zain-sohail · 2024-07-08T12:14:38Z

Fixes the issue in #459 by correctly filling the pulse resolved channels. To this end and more, this PR has the following changes

Seperation of buffer files for timed and electron dataframes
- The pulse resolved channels have the 0 values removed (TBD if it makes sense) because bam contains pulses that are invalid and stores them as 0s
- Using concat to combine dataframe instead of join as it's faster and gives same results
- After getting combined dataframe, df is filled on pulse and train channels
- For electron df, all rows that don't have an electron are dropped
- For pulse, the electron rows are dropped so we get pulse resolved data
A class for buffer file paths which stores a list of dicts with raw, electron buffer and timed buffer filenames
Schema check happens on both buffer types
Changing the names from data_raw_dir and data_parquet_dir to raw and processed as these config parameters already exist under paths and hence can be used by other methods in sed, such as saving of binned data.
Updates to metadata and tests

Benchmarks (62 files and binning over X and Y):
62 raw files to buffer and binning over X and Y (using 20 cores):
Old: 37.6 s ± 839 ms
New: 8.26 s ± 223 ms

62 buffer files to binning:
Old: 2.8 s ± 44.2 ms
New: 2.64 s ± 44.9 ms

… filling for both electron and pulse

rettigl · 2024-07-08T22:04:33Z

In Tutorial 4, I get the following error when trying to load the processor:

	"name": "FileNotFoundError",
	"message": "No files found for run 44762 in directory ['tests/data/loader/flash']",
	"stack": "---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[15], line 1
----> 1 sp = SedProcessor(runs=[44762], config=config_override, system_config=config_file, collect_metadata=False, force_recreate=True)
      2 # You can set collect_metadata=True if the scicat_url and scicat_token are defined

File /mnt/pcshare/users/Laurenz/AreaB/sed/sed/sed/core/processor.py:164, in SedProcessor.__init__(self, metadata, config, dataframe, files, folder, runs, collect_metadata, verbose, **kwds)
    162 # Load data if provided:
    163 if dataframe is not None or files is not None or folder is not None or runs is not None:
--> 164     self.load(
    165         dataframe=dataframe,
    166         metadata=metadata,
    167         files=files,
    168         folder=folder,
    169         runs=runs,
    170         collect_metadata=collect_metadata,
    171         **kwds,
    172     )

File /mnt/pcshare/users/Laurenz/AreaB/sed/sed/sed/core/processor.py:410, in SedProcessor.load(self, dataframe, metadata, files, folder, runs, collect_metadata, **kwds)
    402         dataframe, timed_dataframe, metadata = self.loader.read_dataframe(
    403             folders=cast(str, self.cpy(folder)),
    404             runs=runs,
   (...)
    407             **kwds,
    408         )
    409     else:
--> 410         dataframe, timed_dataframe, metadata = self.loader.read_dataframe(
    411             runs=runs,
    412             metadata=metadata,
    413             collect_metadata=collect_metadata,
    414             **kwds,
    415         )
    417 elif folder is not None:
    418     dataframe, timed_dataframe, metadata = self.loader.read_dataframe(
    419         folders=cast(str, self.cpy(folder)),
    420         metadata=metadata,
    421         collect_metadata=collect_metadata,
    422         **kwds,
    423     )

File /mnt/pcshare/users/Laurenz/AreaB/sed/sed/sed/loader/flash/loader.py:303, in FlashLoader.read_dataframe(self, files, folders, runs, ftype, metadata, collect_metadata, detector, force_recreate, processed_dir, debug, **kwds)
    301 runs_ = [str(runs)] if isinstance(runs, (str, int)) else list(map(str, runs))
    302 for run in runs_:
--> 303     run_files = self.get_files_from_run_id(
    304         run_id=run,
    305         folders=self.raw_dir,
    306     )
    307     files.extend(run_files)
    308 self.runs = runs_

File /mnt/pcshare/users/Laurenz/AreaB/sed/sed/sed/loader/flash/loader.py:170, in FlashLoader.get_files_from_run_id(self, run_id, folders, extension)
    168 # Check if any files are found
    169 if not files:
--> 170     raise FileNotFoundError(
    171         f\"No files found for run {run_id} in directory {str(folders)}\",
    172     )
    174 # Return the list of found files
    175 return [str(file.resolve()) for file in files]

FileNotFoundError: No files found for run 44762 in directory ['tests/data/loader/flash']"

coveralls · 2024-07-09T07:03:05Z

Pull Request Test Coverage Report for Build 10064447099

Details

234 of 245 (95.51%) changed or added relevant lines in 13 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.01%) to 92.688%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
sed/loader/flash/buffer_handler.py	69	70	98.57%
sed/loader/flash/utils.py	18	19	94.74%
tests/loader/test_loaders.py	14	18	77.78%
sed/loader/flash/dataframe.py	22	27	81.48%

Files with Coverage Reduction	New Missed Lines	%
sed/loader/flash/dataframe.py	1	91.09%

Totals
Change from base Build 9987723124:	-0.01%
Covered Lines:	7073
Relevant Lines:	7631

💛 - Coveralls

rettigl

I tested this out now. The electron df seems to work correctly, but the computation of the normalization histograms takes a lot longer now (~33s vs. +24s on main). Also, the normalization does not look as good as before:
main:

fix-459:

It could off course be that the earlier normalization effectively only normalizes by total number of electrons, and not time...

benchmarks/benchmark_sed.py

sed/config/flash_example_config.yaml

sed/loader/flash/buffer_handler.py

zain-sohail · 2024-07-16T20:49:29Z

Edited: Index sorting does not improve speed with using dask. I find that the current approach is the fastest.

rettigl · 2024-07-18T08:04:11Z

Edited: Index sorting does not improve speed with using dask. I find that the current approach is the fastest.

WIth respect to what you posted earlier here: I think it does not matter if the sub-indices are unsorted. For dask, there is only one main index which should ideally be gapless and sorted, no?

rettigl · 2024-07-18T08:47:48Z

Indeed, I think the current approach produces the correct normalization, and before it was just normalizing to total counts and not time because of the missing entries in the timed_dataframe.
These are the normalition histograms:
old:

new:

rettigl · 2024-07-18T08:54:35Z

Histogram of timeStamp (should be flat)
Old:

New:

rettigl

LGTM now.

zain-sohail · 2024-07-18T10:08:07Z

WIth respect to what you posted earlier here: I think it does not matter if the sub-indices are unsorted. For dask, there is only one main index which should ideally be gapless and sorted, no?

I was thinking setting the index to trainId would help, since otherwise, there is no monotonic index for dask to work with (it is only monotonic within a file and then restart). But weirdly, the operations take longer with index. I find that rather unexpected behavior.

rettigl · 2024-07-18T10:23:17Z

WIth respect to what you posted earlier here: I think it does not matter if the sub-indices are unsorted. For dask, there is only one main index which should ideally be gapless and sorted, no?

I was thinking setting the index to trainId would help, since otherwise, there is no monotonic index for dask to work with (it is only monotonic within a file and then restart). But weirdly, the operations take longer with index. I find that rather unexpected behavior.

I think the natural dask index is per partition, as every partition is like a separate pandas df. How would you use trainId as a unique index? Anyways, what would you need a continous index across partitions for?

zain-sohail · 2024-07-18T10:46:09Z

I think the natural dask index is per partition, as every partition is like a separate pandas df. How would you use trainId as a unique index? Anyways, what would you need a continous index across partitions for?

Yes that's true but if the index exists and is sorted, dask can calculate divisions. It is supposed to avoid expensive data shuffling (See https://docs.dask.org/en/stable/dataframe-best-practices.html#use-the-index)
But I believe since we mostly work with shared memory (not entirely sure), it might not make an impact, even causes a slowdown.

Flash minor changes (Merge to #469)

rettigl · 2024-12-20T10:00:03Z

@zain-sohail I think this PR and #459 contain the relevant information for reverting back to the old behavior

zain-sohail added 15 commits July 6, 2024 19:27

fix group_name error

f6c33e6

bring back old behavior

30e7447

minor changes

49d93e9

use concat for faster join

d830e8c

try saving both

c48a6ec

change fill section

4d11d7a

parallel processing

88f0e9c

change paths config to hold raw and processed keys

f6a45e5

change filehandling of buffer files as it's more complicated now, and…

a8632ab

… filling for both electron and pulse

add dtype handler, make more changes to filepath handler

53ea8fd

update methods and tests

e436397

sxp test fixes

1b2e788

processor test fixes

7ed0962

sxp test fixes

7d1262b

remove 0 vals from all pulse channels

5b6f940

Merge branch 'v1_feature_branch' into fix-459

4b682a5

update paths

b325ba6

rettigl requested changes Jul 9, 2024

View reviewed changes

zain-sohail added 2 commits July 10, 2024 19:34

revert the benchmarking

852ce8e

update aux channel handling

2de5243

zain-sohail and others added 4 commits July 17, 2024 16:35

index sorting

770af0d

roll back the buffer_handler with small changes

7944043

fix some test issues

adfd335

Merge remote-tracking branch 'origin/v1_feature_branch' into fix-459

dc74566

rettigl approved these changes Jul 18, 2024

View reviewed changes

zain-sohail added 7 commits July 18, 2024 14:08

some metadata changes

4d35e54

Merge branch 'fix-459' into flash-minor-changes

b9bef4f

use correct lock

0aab5c4

roll back to iterations

1966ac4

add aux alias and subchannels argument

d09e715

change name and return of the run method

b0471f2

Merge pull request #479 from OpenCOMPES/flash-minor-changes

85c1315

Flash minor changes (Merge to #469)

zain-sohail merged commit 89ba09e into v1_feature_branch Jul 23, 2024
2 checks passed

zain-sohail deleted the fix-459 branch July 23, 2024 18:39

zain-sohail mentioned this pull request Jul 26, 2024

Refactored flashloader can't normalize data #459

Closed

Update to the BufferHandler #469

Update to the BufferHandler #469

Uh oh!

Conversation

zain-sohail commented Jul 8, 2024

Uh oh!

rettigl commented Jul 8, 2024

Uh oh!

coveralls commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 10064447099

Details

💛 - Coveralls

Uh oh!

rettigl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zain-sohail commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rettigl commented Jul 18, 2024

Uh oh!

rettigl commented Jul 18, 2024

Uh oh!

rettigl commented Jul 18, 2024

Uh oh!

rettigl left a comment

Choose a reason for hiding this comment

Uh oh!

zain-sohail commented Jul 18, 2024

Uh oh!

rettigl commented Jul 18, 2024

Uh oh!

zain-sohail commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rettigl commented Dec 20, 2024

Uh oh!

Uh oh!

coveralls commented Jul 9, 2024 •

edited

Loading

zain-sohail commented Jul 16, 2024 •

edited

Loading

zain-sohail commented Jul 18, 2024 •

edited

Loading