-
Notifications
You must be signed in to change notification settings - Fork 25
feat(dataset): copy/pull data from external storage #3066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eb3918d
to
2dabe13
Compare
2dabe13
to
c45993f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really nice, I have some suggestions/questions but the functionality is all good
renku/core/util/dispatcher.py
Outdated
@@ -52,6 +53,16 @@ def get_database_helper(database_dispatcher: IDatabaseDispatcher): | |||
return get_database_helper() | |||
|
|||
|
|||
def get_dataset_gateway() -> "DatasetGateway": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure I'm a big fan of these types of methods getting added everywhere. It might make the method body more readable, but it does lose the big plus where in testing you can just pass in a mocked instance of an injected class directly to test the method in isolation. Whereas this makes it so we must always set up fake injection when testing single methods.
I quite like being able to do https://github.com/SwissDataScienceCenter/renku-python/blob/develop/tests/core/commands/test_graph.py#L318-L323
Especially if the mocked instances have to be changed between calls, updating injection/replacing existing entries is a nightmare compared to just passing another instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, this is the first test that I see which doesn't use injection. Testing like this is possible only if the method doesn't call other method that use injection. Otherwise, we still need to set up a fake injection.
I've changed its usage in my code since it doesn't call other injected method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, this is the first test that I see which doesn't use injection.
How much of that is because we couldn't do that in the past?
And for smaller, contained methods/functions we should probably just pass things along instead of injecting again and again, in some places I think I went a bit overboard with injecting everything. But your point is valid, it wouldn't work for all cases.
Still not sure hiding the injection behind get_*
methods makes things cleaner, at least with the way things are now, you see directly what is needed in the signature and I don't think it makes a significant difference with number of characters typed, except for the dispatchers with the client = client_dispatcher.current_client
uglyness. And the dispatcher stuff hopefully becomes easier when we remove LocalClient
(I've written an epic for that, maybe you could take a look at it: #2254).
But I think the biggest pain-point is our DI in general. The reason I like just passing in classes over faking injection is that faking injection is cumbersome. the helpers like with client_database_injection_manager(client):
helps, but it's an ugly hack, having to write it for every call in a test is noise, if you need to mock something else, like ProjectGateway
you need to add some new fixture or patch it into the injection somehow, which is also annoying.
So this discussion just gave me an idea. Ideally, I'd like to have everything essentially work in tests with mocks/Dummy implementations, but with an easy way to override something, as a fixture.
I imagine something like
def my_test(dummy_injections):
assert x() == 1 # normal call that works with default setup
dummy_injections.replace_project(MagicMock(spec=Project, name="other project"))
assert y() == 1 # now IProjectGateway and IClientDispatcher.current_client.project both return the above project
dummy_injections
has everything, dummy gateways, dummy database, dummy dispatchers etc. set up, but essentially empty/with sane defaults. So it can be used directly in tests without having to do anything, but you can just add stuff to if when needed. Essentially it'd be one manager class that has methods for adding datasets, activities, project, etc. with default for the required ones. And then dummy implementations of our interfaces that just return what was set on the manager.
class InjectionManager():
_datasets = []
def add_dataset(dataset: Dataset):
self._datasets.append(dataset)
class DummyDatasetGateway(IDatasetGateway):
_injection_manager = ...
def get_by_id(self, id)->Dataset:
return self._injection_manager._datasets[0]
def get_all_active_datasets(self) -> List["Dataset"]:
return self._injection_manager._datasets
Just to illustrate the concept. The dummy_injections
fixture
- creates the Manager with some default values
- creates all the Dummy interface implementations and sets the manager on them
- sets up injection with those Dummy implementations
- returns the manager so it's available in tests to customize if needed
We might even extend the manager to have some methods that set up more complex scenarios with one method call, like create_complex_workflow_graph()
.
Shouldn't take too much time to implement and would simplify things, I think.
store_dataset_data_location(dataset=dataset, location=location) | ||
|
||
if updated_files: | ||
_update_datasets_files_metadata(client, updated_files=updated_files, deleted_files=[], delete=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only update metadata if a file was actually changed during the pull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we add data from S3, it's not always possible to get the checksum from S3. So, we leave an empty checksum in the metadata and then we calculate the checksum here once we've downloaded the data. This is for such cases.
Co-authored-by: Ralf Grubenmann <[email protected]>
Description
Adds a
dataset pull
command to download/copy external data to the local project.Fixes #2973
TODO: