Skip to content

feat(dataset): copy/pull data from external storage #3066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Aug 26, 2022

Conversation

mohammad-alisafaee
Copy link
Contributor

@mohammad-alisafaee mohammad-alisafaee commented Aug 4, 2022

Description

Adds a dataset pull command to download/copy external data to the local project.

Fixes #2973

TODO:

  • Add documentation once this feature is finalized
  • Test with a private S3 bucket

@mohammad-alisafaee mohammad-alisafaee marked this pull request as ready for review August 8, 2022 05:24
@mohammad-alisafaee mohammad-alisafaee requested a review from a team as a code owner August 8, 2022 05:24
Copy link
Member

@Panaetius Panaetius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice, I have some suggestions/questions but the functionality is all good

@@ -52,6 +53,16 @@ def get_database_helper(database_dispatcher: IDatabaseDispatcher):
return get_database_helper()


def get_dataset_gateway() -> "DatasetGateway":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure I'm a big fan of these types of methods getting added everywhere. It might make the method body more readable, but it does lose the big plus where in testing you can just pass in a mocked instance of an injected class directly to test the method in isolation. Whereas this makes it so we must always set up fake injection when testing single methods.

I quite like being able to do https://github.com/SwissDataScienceCenter/renku-python/blob/develop/tests/core/commands/test_graph.py#L318-L323

Especially if the mocked instances have to be changed between calls, updating injection/replacing existing entries is a nightmare compared to just passing another instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, this is the first test that I see which doesn't use injection. Testing like this is possible only if the method doesn't call other method that use injection. Otherwise, we still need to set up a fake injection.

I've changed its usage in my code since it doesn't call other injected method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, this is the first test that I see which doesn't use injection.

How much of that is because we couldn't do that in the past?
And for smaller, contained methods/functions we should probably just pass things along instead of injecting again and again, in some places I think I went a bit overboard with injecting everything. But your point is valid, it wouldn't work for all cases.

Still not sure hiding the injection behind get_* methods makes things cleaner, at least with the way things are now, you see directly what is needed in the signature and I don't think it makes a significant difference with number of characters typed, except for the dispatchers with the client = client_dispatcher.current_client uglyness. And the dispatcher stuff hopefully becomes easier when we remove LocalClient (I've written an epic for that, maybe you could take a look at it: #2254).

But I think the biggest pain-point is our DI in general. The reason I like just passing in classes over faking injection is that faking injection is cumbersome. the helpers like with client_database_injection_manager(client): helps, but it's an ugly hack, having to write it for every call in a test is noise, if you need to mock something else, like ProjectGateway you need to add some new fixture or patch it into the injection somehow, which is also annoying.

So this discussion just gave me an idea. Ideally, I'd like to have everything essentially work in tests with mocks/Dummy implementations, but with an easy way to override something, as a fixture.

I imagine something like

def my_test(dummy_injections): 
    assert x() == 1 # normal call that works with default setup

    dummy_injections.replace_project(MagicMock(spec=Project, name="other project"))

    assert y() == 1 # now IProjectGateway and IClientDispatcher.current_client.project both return the above project

dummy_injections has everything, dummy gateways, dummy database, dummy dispatchers etc. set up, but essentially empty/with sane defaults. So it can be used directly in tests without having to do anything, but you can just add stuff to if when needed. Essentially it'd be one manager class that has methods for adding datasets, activities, project, etc. with default for the required ones. And then dummy implementations of our interfaces that just return what was set on the manager.

class InjectionManager():
    _datasets = []

    def add_dataset(dataset: Dataset):
        self._datasets.append(dataset)

class DummyDatasetGateway(IDatasetGateway):
    _injection_manager = ...

    def get_by_id(self, id)->Dataset:
        return self._injection_manager._datasets[0]

    def get_all_active_datasets(self) -> List["Dataset"]:
        return self._injection_manager._datasets

Just to illustrate the concept. The dummy_injections fixture

  • creates the Manager with some default values
  • creates all the Dummy interface implementations and sets the manager on them
  • sets up injection with those Dummy implementations
  • returns the manager so it's available in tests to customize if needed

We might even extend the manager to have some methods that set up more complex scenarios with one method call, like create_complex_workflow_graph().

Shouldn't take too much time to implement and would simplify things, I think.

store_dataset_data_location(dataset=dataset, location=location)

if updated_files:
_update_datasets_files_metadata(client, updated_files=updated_files, deleted_files=[], delete=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only update metadata if a file was actually changed during the pull?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we add data from S3, it's not always possible to get the checksum from S3. So, we leave an empty checksum in the metadata and then we calculate the checksum here once we've downloaded the data. This is for such cases.

Base automatically changed from 2970-create-dataset-with-s3-storage to develop August 8, 2022 14:48
@mohammad-alisafaee mohammad-alisafaee enabled auto-merge (squash) August 26, 2022 14:27
@mohammad-alisafaee mohammad-alisafaee merged commit 289b1af into develop Aug 26, 2022
@mohammad-alisafaee mohammad-alisafaee deleted the 2973-copy-from-s3 branch August 26, 2022 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Copy data from S3
2 participants