feat(dataset): copy/pull data from external storage #3066

mohammad-alisafaee · 2022-08-04T11:29:41Z

Description

Adds a dataset pull command to download/copy external data to the local project.

Fixes #2973

TODO:

Add documentation once this feature is finalized
Test with a private S3 bucket

Panaetius

Looks really nice, I have some suggestions/questions but the functionality is all good

renku/core/dataset/dataset.py

Panaetius · 2022-08-08T14:15:05Z

renku/core/util/dispatcher.py

@@ -52,6 +53,16 @@ def get_database_helper(database_dispatcher: IDatabaseDispatcher):
    return get_database_helper()


+def get_dataset_gateway() -> "DatasetGateway":


I'm honestly not sure I'm a big fan of these types of methods getting added everywhere. It might make the method body more readable, but it does lose the big plus where in testing you can just pass in a mocked instance of an injected class directly to test the method in isolation. Whereas this makes it so we must always set up fake injection when testing single methods.

I quite like being able to do https://github.com/SwissDataScienceCenter/renku-python/blob/develop/tests/core/commands/test_graph.py#L318-L323

Especially if the mocked instances have to be changed between calls, updating injection/replacing existing entries is a nightmare compared to just passing another instance.

TBH, this is the first test that I see which doesn't use injection. Testing like this is possible only if the method doesn't call other method that use injection. Otherwise, we still need to set up a fake injection.

I've changed its usage in my code since it doesn't call other injected method.

TBH, this is the first test that I see which doesn't use injection.

How much of that is because we couldn't do that in the past?
And for smaller, contained methods/functions we should probably just pass things along instead of injecting again and again, in some places I think I went a bit overboard with injecting everything. But your point is valid, it wouldn't work for all cases.

Still not sure hiding the injection behind get_* methods makes things cleaner, at least with the way things are now, you see directly what is needed in the signature and I don't think it makes a significant difference with number of characters typed, except for the dispatchers with the client = client_dispatcher.current_client uglyness. And the dispatcher stuff hopefully becomes easier when we remove LocalClient (I've written an epic for that, maybe you could take a look at it: #2254).

But I think the biggest pain-point is our DI in general. The reason I like just passing in classes over faking injection is that faking injection is cumbersome. the helpers like with client_database_injection_manager(client): helps, but it's an ugly hack, having to write it for every call in a test is noise, if you need to mock something else, like ProjectGateway you need to add some new fixture or patch it into the injection somehow, which is also annoying.

So this discussion just gave me an idea. Ideally, I'd like to have everything essentially work in tests with mocks/Dummy implementations, but with an easy way to override something, as a fixture.

I imagine something like

def my_test(dummy_injections): assert x() == 1 # normal call that works with default setup dummy_injections.replace_project(MagicMock(spec=Project, name="other project")) assert y() == 1 # now IProjectGateway and IClientDispatcher.current_client.project both return the above project

dummy_injections has everything, dummy gateways, dummy database, dummy dispatchers etc. set up, but essentially empty/with sane defaults. So it can be used directly in tests without having to do anything, but you can just add stuff to if when needed. Essentially it'd be one manager class that has methods for adding datasets, activities, project, etc. with default for the required ones. And then dummy implementations of our interfaces that just return what was set on the manager.

class InjectionManager(): _datasets = [] def add_dataset(dataset: Dataset): self._datasets.append(dataset) class DummyDatasetGateway(IDatasetGateway): _injection_manager = ... def get_by_id(self, id)->Dataset: return self._injection_manager._datasets[0] def get_all_active_datasets(self) -> List["Dataset"]: return self._injection_manager._datasets

Just to illustrate the concept. The dummy_injections fixture

creates the Manager with some default values

creates all the Dummy interface implementations and sets the manager on them

sets up injection with those Dummy implementations

returns the manager so it's available in tests to customize if needed

We might even extend the manager to have some methods that set up more complex scenarios with one method call, like create_complex_workflow_graph().

Shouldn't take too much time to implement and would simplify things, I think.

renku/core/util/os.py

Panaetius · 2022-08-08T14:27:32Z

renku/core/dataset/dataset.py

+    store_dataset_data_location(dataset=dataset, location=location)
+
+    if updated_files:
+        _update_datasets_files_metadata(client, updated_files=updated_files, deleted_files=[], delete=False)


Should we only update metadata if a file was actually changed during the pull?

When we add data from S3, it's not always possible to get the checksum from S3. So, we leave an empty checksum in the metadata and then we calculate the checksum here once we've downloaded the data. This is for such cases.

Co-authored-by: Ralf Grubenmann <[email protected]>

mohammad-alisafaee added 9 commits July 25, 2022 08:23

feat(dataset): dataset creation with s3 storage backend

db19dd5

address review comments

98510b5

Merge branch 'develop' into 2970-create-dataset-with-s3-storage

e1d9527

fix

4162bf9

Merge branch 'develop' into 2970-create-dataset-with-s3-storage

e7acf7c

Merge branch 'develop' into 2970-create-dataset-with-s3-storage

ebbfd7e

fix s3 plugin

662a946

fix prompt for credentials

82291d5

refactor dataset plugin code

ee576e0

github-actions bot added the documentation:pending label Aug 4, 2022

mohammad-alisafaee added the do-not-merge label Aug 4, 2022

mohammad-alisafaee force-pushed the 2973-copy-from-s3 branch 2 times, most recently from eb3918d to 2dabe13 Compare August 7, 2022 20:05

mohammad-alisafaee added 2 commits August 8, 2022 00:53

feat(dataset): copy/pull data from external storage

065feec

poetry.lock

c45993f

mohammad-alisafaee force-pushed the 2973-copy-from-s3 branch from 2dabe13 to c45993f Compare August 7, 2022 20:24

mohammad-alisafaee removed the documentation:pending label Aug 7, 2022

mohammad-alisafaee marked this pull request as ready for review August 8, 2022 05:24

mohammad-alisafaee requested a review from a team as a code owner August 8, 2022 05:24

Panaetius requested changes Aug 8, 2022

View reviewed changes

Base automatically changed from 2970-create-dataset-with-s3-storage to develop August 8, 2022 14:48

mohammad-alisafaee and others added 3 commits August 18, 2022 13:32

address review comments

240e9de

Co-authored-by: Ralf Grubenmann <[email protected]>

Merge branch 'develop' into 2973-copy-from-s3

d3d7d97

fix style

83ad900

mohammad-alisafaee requested a review from Panaetius August 18, 2022 12:56

mohammad-alisafaee added 3 commits August 19, 2022 11:55

remove Literal

0b0af61

Merge branch 'develop' into 2973-copy-from-s3

abe3ea2

refactor tests

590ac88

mohammad-alisafaee removed the do-not-merge label Aug 26, 2022

Merge branch 'develop' into 2973-copy-from-s3

e12935a

mohammad-alisafaee added 2 commits August 26, 2022 15:04

fix typing

b673a17

Merge branch 'develop' into 2973-copy-from-s3

086ba3e

Panaetius approved these changes Aug 26, 2022

View reviewed changes

mohammad-alisafaee enabled auto-merge (squash) August 26, 2022 14:27

mohammad-alisafaee merged commit 289b1af into develop Aug 26, 2022

mohammad-alisafaee deleted the 2973-copy-from-s3 branch August 26, 2022 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dataset): copy/pull data from external storage #3066

feat(dataset): copy/pull data from external storage #3066

mohammad-alisafaee commented Aug 4, 2022 •

edited

Loading

Panaetius left a comment

Panaetius Aug 8, 2022

mohammad-alisafaee Aug 18, 2022

Panaetius Aug 18, 2022

Panaetius Aug 8, 2022

mohammad-alisafaee Aug 18, 2022

		@@ -52,6 +53,16 @@ def get_database_helper(database_dispatcher: IDatabaseDispatcher):
		return get_database_helper()


		def get_dataset_gateway() -> "DatasetGateway":

feat(dataset): copy/pull data from external storage #3066

feat(dataset): copy/pull data from external storage #3066

Conversation

mohammad-alisafaee commented Aug 4, 2022 • edited Loading

Description

TODO:

Panaetius left a comment

Choose a reason for hiding this comment

Panaetius Aug 8, 2022

Choose a reason for hiding this comment

mohammad-alisafaee Aug 18, 2022

Choose a reason for hiding this comment

Panaetius Aug 18, 2022

Choose a reason for hiding this comment

Panaetius Aug 8, 2022

Choose a reason for hiding this comment

mohammad-alisafaee Aug 18, 2022

Choose a reason for hiding this comment

mohammad-alisafaee commented Aug 4, 2022 •

edited

Loading