Runs loader option #127

rettigl · 2023-06-09T23:49:21Z

I address some of the issues raised with adding the runs options as general option to the loader interface in a separate PR, to be merged with flash-new first.
I overhuled the general implementation in the base-loader and processor, and reworked your implementation in the flash loader. Some things are yet preliminary.
In particularly, I am not sure if you really need several source folders where to look for your runs. Your doc strings and typing information is very inconsistent there. Also, if runs should be int or string, a list or just one entry is rather unclear. I expect it a sequence of strings for now... We can explode it to all kinds of Unions if you want, but let's be clear and consistent.

…flash still intermediate

rettigl · 2023-06-10T00:06:40Z

@zainsohail04 Have a look and see if you agree to the changes. We should also still develop tests for the new functionalities. Then this could be merged to the flash branch and finished there. The remaining issues in metadata and flash-utils I did not look into but are rather trivial, just look at the mypy and pylint messages.
Flash loader also still needs usable test-data (that's the one failing test)...
I'd also suggest to integrate the code in flash-utils again into the flash reader, then you can use the class attributes, and e.g. put all the stuff into the config, and no need for an extra config file.

fix remaining issues with linting and tests

rettigl · 2023-06-12T00:15:25Z

I added support for multiple source folders now for all loaders, and fixed all linting and tests. Flash test also runs if you provide test data (I did not add them yet to the repo).

zain-sohail · 2023-06-12T07:06:46Z

sed/loader/flash/loader.py

+        for the specified data acquisition (daq).
+
+        Args:
+            run_id (str): The run identifier to locate.


I dont understand the switch from runs to run_id, and also making it a str, when it is always a int and would preferably be a sequence of ints

Well, your original function had run_number here. Your implementation between base loader and flash loader was insofar inconsistent, as flash loader defined runs as Sequence[int], and base loader as Sequence[str]. I thought the second one as the more flexible, as it allows, e.g., also to put alphanumeric hashes or so as run identifiers. I changed this throughout and made it consistent. If you want, you can rename run_id here to run, but then do it in all the loaders. As this function is called sequentially for all elements in runs, it takes only one element of the content of runs.

Yes I realize the inconsistency you mention, but it was purposefully overloaded, as there is no point for flash users to always put quotes for runs. But a type hint is just a type hint and a sequence of ints should also work if we just include a line with str(runs)

Mypy checks typing definitions for consistency, pylint also the consistency of subclass defitinitions. If you want your code to be consistent and these tools to pass the tests, you need to be consistent in your definitions.
Practically, Python does not check the type on execution, and your code (f-string) also works with int. I added an option now such that you also can use a single int or list of ints now, but adding these also to the Union will make the thing look more bulky.

zain-sohail · 2023-06-12T07:08:06Z

sed/loader/flash/utils.py

I don't think moving these to loader is a good idea. Seperating some funtionality to other files makes it more readable and allows for a more blackbox approach. I was even planning to refactor the loader.py further into other submodules eventually.

My motiviation to move these two functions into the loader was to i) give them access to the class attributes of the loader, and ii) derive in particular the run resolution function a template function of the class. If we want to stick with with, there would be only one function left in the file, which I think does not make sense. It does not necessary need to be a class function, though, one could pass the confic dict as argument as well.
Additionally, I find it quite unfortunate to have two files with the same name and different content very close by, I think this is not good practice. The readers are a different case, as they implement the same template, and here it's by design.

Moving functions into other files here could make sense if they are reused somewhere else outside of the flash loader. If this is the case, then put them into the loader/utils.py file, I would suggest...

I can change the utils in the flash loader to flash_utils. I am trying to refactor the flash loader itself, because we have the wespe instrument and also the lab system at desy. They all share a very similar set of functionality.
The refactoring would e.g. seperate, file conversion, the multiindex creation, dataframe creation etc.
Regarding just having them all in one class, it is not a problem but it just makes it harder for someone new to understand what is happening.

You can move them into a differnt file again later if you want, I would suggest. For now, let's finish the first version. I find code more difficult to read if functions are spread out over multiple files, rather than bundled in one file, but that's probably a matter of taste.

zain-sohail · 2023-06-12T07:10:01Z

sed/loader/flash/identifiers.json

These identifiers always stay consistent. I'd really like for the default values to not be necessary for the user to always put in the config file.
It was earlier hardcoded for this reason, but I did agree that seperating it from code made sense.

This appears to me to be quite beamline/user/instrument specific, and I would expect it to be not unlikely to also change with time and system. This feels to me strongly as something belonging into the user domain, and not into the code domain, which the user cannot change easily if you install the package e.g. from pypi. How about if you want to add another instrument, or mou the chamber at a different port?
If you want, you can introduce into the config a reference to another config file that contains these information, but I don't see a reason not to do it like I suggest here.
The default values could be part of the default config, for instance.

What you say makes sense but I believe then a default config should be provided if we are providing the loaders in the package. Otherwise, they'd remain unfunctional.. And only the sed folder is published. Anything in tests isn't, and doesn't need to be either.
If the user wants to alter the default config to include a new instrument, then it'd make sense for them to make a PR, no?

I think the intention is to have all readers we develop now as part of the code package available, with their default configuration. However, what I thought is that users could add modify their local config for a given instrument or so, without even changing upstream or the default config. That's only working if the configuration is living in the user space, and will make the whole package very versatile, and easy to use in different configurations.

zain-sohail · 2023-06-12T07:11:19Z

sed/loader/flash/metadata.py

Thanks. These are useful points.

Be aware that I could not test any of these changes...

zain-sohail · 2023-06-12T07:12:05Z

I think the overarching theme is if we are sticking to keeping all the options of files, folders and runs. I don't think it's necessary. Just additional burden on development for each instrument for no apparent reason.

I address some of the issues raised with adding the runs options as general option to the loader interface in a separate PR, to be merged with flash-new first. I overhuled the general implementation in the base-loader and processor, and reworked your implementation in the flash loader. Some things are yet preliminary. In particularly, I am not sure if you really need several source folders where to look for your runs. Your doc strings and typing information is very inconsistent there. Also, if runs should be int or string, a list or just one entry is rather unclear. I expect it a sequence of strings for now... We can explode it to all kinds of Unions if you want, but let's be clear and consistent.

I think there is something wrong because I removed all Unions in the flash-new branch and also a few things that you have in the PR seem old? There is nothing about source folders

zain-sohail · 2023-06-12T07:19:27Z

Flash loader also still needs usable test-data (that's the one failing test)
Should we just put the test data I provided in the repo?

rettigl · 2023-06-12T19:21:48Z

Should we just put the test data I provided in the repo?

I would be more happy with a smaller data set, let's say 2-3 mbyte. The one you provided i) takes up more than the current size of the whole repository. ii) It takes an awful long time to process, almost doubling the time tests will take. So, I strongly encourage to either find a shorter file, or extract only a few shots out of this one (not sure though how this works with the flash data structure).

rettigl · 2023-06-12T19:23:34Z

I think the overarching theme is if we are sticking to keeping all the options of files, folders and runs. I don't think it's necessary. Just additional burden on development for each instrument for no apparent reason.

The code in the base loader essentially does that for you. The current version works already for files, folders, and runs, check it out. At the end of the day, it all boils down collecting the right list of files and loading it...

rettigl · 2023-06-12T19:27:28Z

I think there is something wrong because I removed all Unions in the flash-new branch and also a few things that you have in the PR seem old? There is nothing about source folders

You commented on the initial text I wrote before implementing the current version which defines all three files, folders and runs as Union[str, Sequence[str]]. Not sure if you missed what I pushed afterwards. The call to the base loader function (super().read_data_frame() ) takes care to create a consistent list of files (except for runs), which each loader can work with.

rettigl · 2023-06-12T19:31:27Z

Independent of the flash loader, loader tests also need to be updated to cover the new functionalities. I can look into this the next days (can also be done after merging with flash_new).

…r/files to overwrite run data.

zain-sohail · 2023-06-15T13:55:34Z

sed/loader/flash/loader.py

@@ -37,7 +36,7 @@ class FlashLoader(BaseLoader):

    __name__ = "flash"

-    supported_file_types = ["h5", "parquet"]


How come parquet is removed as supported? The loader definitely supports reading such files.

Ah, sorry I forgot to mention this in my comments. I removed parquet as explicitly supported file type for the following reason. Technically, the reader works with parquet files, but as far as I can tell, it does not actively provide a way to load files with .parquet extension. This is what the supported file types mean, that the reader alllows loading file sets of the respective extension...
The test function iterates over the supported file types, and tries to test all supported types, which I don't know how to get it to work for .parquet. Anyways, there is the generic loader to load sets of standalone .parquet files.

zain-sohail · 2023-06-15T14:15:24Z

Should we just put the test data I provided in the repo?

I would be more happy with a smaller data set, let's say 2-3 mbyte. The one you provided i) takes up more than the current size of the whole repository. ii) It takes an awful long time to process, almost doubling the time tests will take. So, I strongly encourage to either find a shorter file, or extract only a few shots out of this one (not sure though how this works with the flash data structure).

That's challenging since I can't myself replicate the data structure without altering it. Maybe @kutnyakhov can find a solution to providing a very small test file.

zain-sohail · 2023-06-15T14:16:44Z

I think there is something wrong because I removed all Unions in the flash-new branch and also a few things that you have in the PR seem old? There is nothing about source folders

You commented on the initial text I wrote before implementing the current version which defines all three files, folders and runs as Union[str, Sequence[str]]. Not sure if you missed what I pushed afterwards. The call to the base loader function (super().read_data_frame() ) takes care to create a consistent list of files (except for runs), which each loader can work with.

I now understand your idea. Other than minor feedback I provided, this is good to merge with flash-new

…b actions

rettigl · 2023-06-15T20:47:27Z

That's challenging since I can't myself replicate the data structure without altering it. Maybe @kutnyakhov can find a solution to providing a very small test file.

I stripped down the dataset now to 50 macrobunches, and uploaded that. Seems to work, but still one of the tests tails (works locally). Try to figure that one out.

rettigl · 2023-06-15T21:28:04Z

I'll merge and then we can finalize flash loader and move on

rettigl added 4 commits June 10, 2023 00:38

modified base loader and processor to support runs as loader option, …

67de4dc

…flash still intermediate

working flash loader

5d694b4

linting and bugfixes

a4c4091

add dummy function to mpes and generic loaders

7ec4390

add option for multiple folders to general loader infrastructure

f9eb60e

fix remaining issues with linting and tests

rettigl force-pushed the runs_loader_option branch from f764fa7 to f9eb60e Compare June 12, 2023 00:08

rettigl requested a review from zain-sohail June 12, 2023 00:15

zain-sohail reviewed Jun 12, 2023

View reviewed changes

rettigl added 2 commits June 13, 2023 23:16

Bugfix to allow again single run w/o list, and prevent provided folde…

75b7d81

…r/files to overwrite run data.

add parametrized tests for loaders covering files, folders and runs

7d5301d

zain-sohail reviewed Jun 15, 2023

View reviewed changes

zain-sohail self-requested a review June 15, 2023 14:17

zain-sohail approved these changes Jun 15, 2023

View reviewed changes

rettigl added 2 commits June 15, 2023 22:14

add stripped down test data for FLASH reader

c1b1af1

add option for runs as int on runtime, and debug flash tests in guthu…

fcf2c92

…b actions

rettigl added 2 commits June 15, 2023 22:58

debug github actions

edef597

fix bug for single provided file, and remove debug info

be19cbc

rettigl merged commit 43a8782 into flash-new Jun 15, 2023

rettigl deleted the runs_loader_option branch June 15, 2023 21:28

		@@ -37,7 +36,7 @@ class FlashLoader(BaseLoader):

		__name__ = "flash"

		supported_file_types = ["h5", "parquet"]

Runs loader option #127

Runs loader option #127

Uh oh!

Conversation

rettigl commented Jun 9, 2023

Uh oh!

rettigl commented Jun 10, 2023

Uh oh!

rettigl commented Jun 12, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zain-sohail commented Jun 12, 2023

Uh oh!

zain-sohail commented Jun 12, 2023

Uh oh!

rettigl commented Jun 12, 2023

Uh oh!

rettigl commented Jun 12, 2023

Uh oh!

rettigl commented Jun 12, 2023

Uh oh!

rettigl commented Jun 12, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zain-sohail commented Jun 15, 2023

Uh oh!

zain-sohail commented Jun 15, 2023

Uh oh!

rettigl commented Jun 15, 2023

Uh oh!

rettigl commented Jun 15, 2023

Uh oh!

Uh oh!