-
Notifications
You must be signed in to change notification settings - Fork 2
Runs loader option #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runs loader option #127
Conversation
@zainsohail04 Have a look and see if you agree to the changes. We should also still develop tests for the new functionalities. Then this could be merged to the flash branch and finished there. The remaining issues in metadata and flash-utils I did not look into but are rather trivial, just look at the mypy and pylint messages. |
fix remaining issues with linting and tests
f764fa7
to
f9eb60e
Compare
I added support for multiple source folders now for all loaders, and fixed all linting and tests. Flash test also runs if you provide test data (I did not add them yet to the repo). |
for the specified data acquisition (daq). | ||
|
||
Args: | ||
run_id (str): The run identifier to locate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont understand the switch from runs to run_id, and also making it a str, when it is always a int and would preferably be a sequence of ints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, your original function had run_number here. Your implementation between base loader and flash loader was insofar inconsistent, as flash loader defined runs as Sequence[int], and base loader as Sequence[str]. I thought the second one as the more flexible, as it allows, e.g., also to put alphanumeric hashes or so as run identifiers. I changed this throughout and made it consistent. If you want, you can rename run_id here to run, but then do it in all the loaders. As this function is called sequentially for all elements in runs, it takes only one element of the content of runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I realize the inconsistency you mention, but it was purposefully overloaded, as there is no point for flash users to always put quotes for runs. But a type hint is just a type hint and a sequence of ints should also work if we just include a line with str(runs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mypy checks typing definitions for consistency, pylint also the consistency of subclass defitinitions. If you want your code to be consistent and these tools to pass the tests, you need to be consistent in your definitions.
Practically, Python does not check the type on execution, and your code (f-string) also works with int. I added an option now such that you also can use a single int or list of ints now, but adding these also to the Union will make the thing look more bulky.
sed/loader/flash/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think moving these to loader is a good idea. Seperating some funtionality to other files makes it more readable and allows for a more blackbox approach. I was even planning to refactor the loader.py further into other submodules eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My motiviation to move these two functions into the loader was to i) give them access to the class attributes of the loader, and ii) derive in particular the run resolution function a template function of the class. If we want to stick with with, there would be only one function left in the file, which I think does not make sense. It does not necessary need to be a class function, though, one could pass the confic dict as argument as well.
Additionally, I find it quite unfortunate to have two files with the same name and different content very close by, I think this is not good practice. The readers are a different case, as they implement the same template, and here it's by design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving functions into other files here could make sense if they are reused somewhere else outside of the flash loader. If this is the case, then put them into the loader/utils.py file, I would suggest...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change the utils in the flash loader to flash_utils. I am trying to refactor the flash loader itself, because we have the wespe instrument and also the lab system at desy. They all share a very similar set of functionality.
The refactoring would e.g. seperate, file conversion, the multiindex creation, dataframe creation etc.
Regarding just having them all in one class, it is not a problem but it just makes it harder for someone new to understand what is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move them into a differnt file again later if you want, I would suggest. For now, let's finish the first version. I find code more difficult to read if functions are spread out over multiple files, rather than bundled in one file, but that's probably a matter of taste.
sed/loader/flash/identifiers.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These identifiers always stay consistent. I'd really like for the default values to not be necessary for the user to always put in the config file.
It was earlier hardcoded for this reason, but I did agree that seperating it from code made sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to me to be quite beamline/user/instrument specific, and I would expect it to be not unlikely to also change with time and system. This feels to me strongly as something belonging into the user domain, and not into the code domain, which the user cannot change easily if you install the package e.g. from pypi. How about if you want to add another instrument, or mou the chamber at a different port?
If you want, you can introduce into the config a reference to another config file that contains these information, but I don't see a reason not to do it like I suggest here.
The default values could be part of the default config, for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you say makes sense but I believe then a default config should be provided if we are providing the loaders in the package. Otherwise, they'd remain unfunctional.. And only the sed folder is published. Anything in tests isn't, and doesn't need to be either.
If the user wants to alter the default config to include a new instrument, then it'd make sense for them to make a PR, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intention is to have all readers we develop now as part of the code package available, with their default configuration. However, what I thought is that users could add modify their local config for a given instrument or so, without even changing upstream or the default config. That's only working if the configuration is living in the user space, and will make the whole package very versatile, and easy to use in different configurations.
sed/loader/flash/metadata.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. These are useful points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be aware that I could not test any of these changes...
I think the overarching theme is if we are sticking to keeping all the options of files, folders and runs. I don't think it's necessary. Just additional burden on development for each instrument for no apparent reason.
I think there is something wrong because I removed all Unions in the flash-new branch and also a few things that you have in the PR seem old? There is nothing about source folders |
|
I would be more happy with a smaller data set, let's say 2-3 mbyte. The one you provided i) takes up more than the current size of the whole repository. ii) It takes an awful long time to process, almost doubling the time tests will take. So, I strongly encourage to either find a shorter file, or extract only a few shots out of this one (not sure though how this works with the flash data structure). |
The code in the base loader essentially does that for you. The current version works already for files, folders, and runs, check it out. At the end of the day, it all boils down collecting the right list of files and loading it... |
You commented on the initial text I wrote before implementing the current version which defines all three files, folders and runs as Union[str, Sequence[str]]. Not sure if you missed what I pushed afterwards. The call to the base loader function (super().read_data_frame() ) takes care to create a consistent list of files (except for runs), which each loader can work with. |
Independent of the flash loader, loader tests also need to be updated to cover the new functionalities. I can look into this the next days (can also be done after merging with flash_new). |
@@ -37,7 +36,7 @@ class FlashLoader(BaseLoader): | |||
|
|||
__name__ = "flash" | |||
|
|||
supported_file_types = ["h5", "parquet"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come parquet is removed as supported? The loader definitely supports reading such files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry I forgot to mention this in my comments. I removed parquet as explicitly supported file type for the following reason. Technically, the reader works with parquet files, but as far as I can tell, it does not actively provide a way to load files with .parquet extension. This is what the supported file types mean, that the reader alllows loading file sets of the respective extension...
The test function iterates over the supported file types, and tries to test all supported types, which I don't know how to get it to work for .parquet. Anyways, there is the generic loader to load sets of standalone .parquet files.
That's challenging since I can't myself replicate the data structure without altering it. Maybe @kutnyakhov can find a solution to providing a very small test file. |
I now understand your idea. Other than minor feedback I provided, this is good to merge with flash-new |
I stripped down the dataset now to 50 macrobunches, and uploaded that. Seems to work, but still one of the tests tails (works locally). Try to figure that one out. |
I'll merge and then we can finalize flash loader and move on |
I address some of the issues raised with adding the runs options as general option to the loader interface in a separate PR, to be merged with flash-new first.
I overhuled the general implementation in the base-loader and processor, and reworked your implementation in the flash loader. Some things are yet preliminary.
In particularly, I am not sure if you really need several source folders where to look for your runs. Your doc strings and typing information is very inconsistent there. Also, if runs should be int or string, a list or just one entry is rather unclear. I expect it a sequence of strings for now... We can explode it to all kinds of Unions if you want, but let's be clear and consistent.