Skip to content

Runs loader option #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 15, 2023
Merged

Runs loader option #127

merged 11 commits into from
Jun 15, 2023

Conversation

rettigl
Copy link
Member

@rettigl rettigl commented Jun 9, 2023

I address some of the issues raised with adding the runs options as general option to the loader interface in a separate PR, to be merged with flash-new first.
I overhuled the general implementation in the base-loader and processor, and reworked your implementation in the flash loader. Some things are yet preliminary.
In particularly, I am not sure if you really need several source folders where to look for your runs. Your doc strings and typing information is very inconsistent there. Also, if runs should be int or string, a list or just one entry is rather unclear. I expect it a sequence of strings for now... We can explode it to all kinds of Unions if you want, but let's be clear and consistent.

@rettigl
Copy link
Member Author

rettigl commented Jun 10, 2023

@zainsohail04 Have a look and see if you agree to the changes. We should also still develop tests for the new functionalities. Then this could be merged to the flash branch and finished there. The remaining issues in metadata and flash-utils I did not look into but are rather trivial, just look at the mypy and pylint messages.
Flash loader also still needs usable test-data (that's the one failing test)...
I'd also suggest to integrate the code in flash-utils again into the flash reader, then you can use the class attributes, and e.g. put all the stuff into the config, and no need for an extra config file.

fix remaining issues with linting and tests
@rettigl rettigl force-pushed the runs_loader_option branch from f764fa7 to f9eb60e Compare June 12, 2023 00:08
@rettigl
Copy link
Member Author

rettigl commented Jun 12, 2023

I added support for multiple source folders now for all loaders, and fixed all linting and tests. Flash test also runs if you provide test data (I did not add them yet to the repo).

@rettigl rettigl requested a review from zain-sohail June 12, 2023 00:15
for the specified data acquisition (daq).

Args:
run_id (str): The run identifier to locate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand the switch from runs to run_id, and also making it a str, when it is always a int and would preferably be a sequence of ints

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, your original function had run_number here. Your implementation between base loader and flash loader was insofar inconsistent, as flash loader defined runs as Sequence[int], and base loader as Sequence[str]. I thought the second one as the more flexible, as it allows, e.g., also to put alphanumeric hashes or so as run identifiers. I changed this throughout and made it consistent. If you want, you can rename run_id here to run, but then do it in all the loaders. As this function is called sequentially for all elements in runs, it takes only one element of the content of runs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I realize the inconsistency you mention, but it was purposefully overloaded, as there is no point for flash users to always put quotes for runs. But a type hint is just a type hint and a sequence of ints should also work if we just include a line with str(runs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy checks typing definitions for consistency, pylint also the consistency of subclass defitinitions. If you want your code to be consistent and these tools to pass the tests, you need to be consistent in your definitions.
Practically, Python does not check the type on execution, and your code (f-string) also works with int. I added an option now such that you also can use a single int or list of ints now, but adding these also to the Union will make the thing look more bulky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think moving these to loader is a good idea. Seperating some funtionality to other files makes it more readable and allows for a more blackbox approach. I was even planning to refactor the loader.py further into other submodules eventually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motiviation to move these two functions into the loader was to i) give them access to the class attributes of the loader, and ii) derive in particular the run resolution function a template function of the class. If we want to stick with with, there would be only one function left in the file, which I think does not make sense. It does not necessary need to be a class function, though, one could pass the confic dict as argument as well.
Additionally, I find it quite unfortunate to have two files with the same name and different content very close by, I think this is not good practice. The readers are a different case, as they implement the same template, and here it's by design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving functions into other files here could make sense if they are reused somewhere else outside of the flash loader. If this is the case, then put them into the loader/utils.py file, I would suggest...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change the utils in the flash loader to flash_utils. I am trying to refactor the flash loader itself, because we have the wespe instrument and also the lab system at desy. They all share a very similar set of functionality.
The refactoring would e.g. seperate, file conversion, the multiindex creation, dataframe creation etc.
Regarding just having them all in one class, it is not a problem but it just makes it harder for someone new to understand what is happening.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move them into a differnt file again later if you want, I would suggest. For now, let's finish the first version. I find code more difficult to read if functions are spread out over multiple files, rather than bundled in one file, but that's probably a matter of taste.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These identifiers always stay consistent. I'd really like for the default values to not be necessary for the user to always put in the config file.
It was earlier hardcoded for this reason, but I did agree that seperating it from code made sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to me to be quite beamline/user/instrument specific, and I would expect it to be not unlikely to also change with time and system. This feels to me strongly as something belonging into the user domain, and not into the code domain, which the user cannot change easily if you install the package e.g. from pypi. How about if you want to add another instrument, or mou the chamber at a different port?
If you want, you can introduce into the config a reference to another config file that contains these information, but I don't see a reason not to do it like I suggest here.
The default values could be part of the default config, for instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you say makes sense but I believe then a default config should be provided if we are providing the loaders in the package. Otherwise, they'd remain unfunctional.. And only the sed folder is published. Anything in tests isn't, and doesn't need to be either.
If the user wants to alter the default config to include a new instrument, then it'd make sense for them to make a PR, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intention is to have all readers we develop now as part of the code package available, with their default configuration. However, what I thought is that users could add modify their local config for a given instrument or so, without even changing upstream or the default config. That's only working if the configuration is living in the user space, and will make the whole package very versatile, and easy to use in different configurations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. These are useful points.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware that I could not test any of these changes...

@zain-sohail
Copy link
Member

I think the overarching theme is if we are sticking to keeping all the options of files, folders and runs. I don't think it's necessary. Just additional burden on development for each instrument for no apparent reason.

I address some of the issues raised with adding the runs options as general option to the loader interface in a separate PR, to be merged with flash-new first. I overhuled the general implementation in the base-loader and processor, and reworked your implementation in the flash loader. Some things are yet preliminary. In particularly, I am not sure if you really need several source folders where to look for your runs. Your doc strings and typing information is very inconsistent there. Also, if runs should be int or string, a list or just one entry is rather unclear. I expect it a sequence of strings for now... We can explode it to all kinds of Unions if you want, but let's be clear and consistent.

I think there is something wrong because I removed all Unions in the flash-new branch and also a few things that you have in the PR seem old? There is nothing about source folders

@zain-sohail
Copy link
Member

Flash loader also still needs usable test-data (that's the one failing test)
Should we just put the test data I provided in the repo?

@rettigl
Copy link
Member Author

rettigl commented Jun 12, 2023

Should we just put the test data I provided in the repo?

I would be more happy with a smaller data set, let's say 2-3 mbyte. The one you provided i) takes up more than the current size of the whole repository. ii) It takes an awful long time to process, almost doubling the time tests will take. So, I strongly encourage to either find a shorter file, or extract only a few shots out of this one (not sure though how this works with the flash data structure).

@rettigl
Copy link
Member Author

rettigl commented Jun 12, 2023

I think the overarching theme is if we are sticking to keeping all the options of files, folders and runs. I don't think it's necessary. Just additional burden on development for each instrument for no apparent reason.

The code in the base loader essentially does that for you. The current version works already for files, folders, and runs, check it out. At the end of the day, it all boils down collecting the right list of files and loading it...

@rettigl
Copy link
Member Author

rettigl commented Jun 12, 2023

I think there is something wrong because I removed all Unions in the flash-new branch and also a few things that you have in the PR seem old? There is nothing about source folders

You commented on the initial text I wrote before implementing the current version which defines all three files, folders and runs as Union[str, Sequence[str]]. Not sure if you missed what I pushed afterwards. The call to the base loader function (super().read_data_frame() ) takes care to create a consistent list of files (except for runs), which each loader can work with.

@rettigl
Copy link
Member Author

rettigl commented Jun 12, 2023

Independent of the flash loader, loader tests also need to be updated to cover the new functionalities. I can look into this the next days (can also be done after merging with flash_new).

@@ -37,7 +36,7 @@ class FlashLoader(BaseLoader):

__name__ = "flash"

supported_file_types = ["h5", "parquet"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come parquet is removed as supported? The loader definitely supports reading such files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry I forgot to mention this in my comments. I removed parquet as explicitly supported file type for the following reason. Technically, the reader works with parquet files, but as far as I can tell, it does not actively provide a way to load files with .parquet extension. This is what the supported file types mean, that the reader alllows loading file sets of the respective extension...
The test function iterates over the supported file types, and tries to test all supported types, which I don't know how to get it to work for .parquet. Anyways, there is the generic loader to load sets of standalone .parquet files.

@zain-sohail
Copy link
Member

Should we just put the test data I provided in the repo?

I would be more happy with a smaller data set, let's say 2-3 mbyte. The one you provided i) takes up more than the current size of the whole repository. ii) It takes an awful long time to process, almost doubling the time tests will take. So, I strongly encourage to either find a shorter file, or extract only a few shots out of this one (not sure though how this works with the flash data structure).

That's challenging since I can't myself replicate the data structure without altering it. Maybe @kutnyakhov can find a solution to providing a very small test file.

@zain-sohail
Copy link
Member

I think there is something wrong because I removed all Unions in the flash-new branch and also a few things that you have in the PR seem old? There is nothing about source folders

You commented on the initial text I wrote before implementing the current version which defines all three files, folders and runs as Union[str, Sequence[str]]. Not sure if you missed what I pushed afterwards. The call to the base loader function (super().read_data_frame() ) takes care to create a consistent list of files (except for runs), which each loader can work with.

I now understand your idea. Other than minor feedback I provided, this is good to merge with flash-new

@zain-sohail zain-sohail self-requested a review June 15, 2023 14:17
@rettigl
Copy link
Member Author

rettigl commented Jun 15, 2023

That's challenging since I can't myself replicate the data structure without altering it. Maybe @kutnyakhov can find a solution to providing a very small test file.

I stripped down the dataset now to 50 macrobunches, and uploaded that. Seems to work, but still one of the tests tails (works locally). Try to figure that one out.

@rettigl
Copy link
Member Author

rettigl commented Jun 15, 2023

I'll merge and then we can finalize flash loader and move on

@rettigl rettigl merged commit 43a8782 into flash-new Jun 15, 2023
@rettigl rettigl deleted the runs_loader_option branch June 15, 2023 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants