Skip to content

[WIP] 'dvc vdir' command for updating large remote datasets #4900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

BurnzZ
Copy link
Contributor

@BurnzZ BurnzZ commented Nov 16, 2020

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™


1. Overview

Attempts to introduce a Proof of Concept for #4657

NOTES: I haven't added the tests for now in case the entire approach could change based on users' feedback. Though I've tested it in my local env to make sure that the mechanism works okay.

2. Proposal

The proposed approach is to create a β€œvirtual” directory inside the local workspace that reflects the large data sets being contained in remote storage. In this way, users can navigate and transact with it without the need for the actual data files to be present locally.

3. Current State of the POC

3.1 Sample Workflow

Suppose we have the following remote storage directory structure:

$ tree data/
data
β”œβ”€β”€ train
β”‚   β”œβ”€β”€ dog.1.jpg
β”‚   β”œβ”€β”€ ...
β”‚   └── dog.800_000.jpg
└── validation
    β”œβ”€β”€ dog.800_001.jpg
    β”œβ”€β”€ ...
    └── dog.1_000_000.jpg

As we can see, it contains 1 Million images which can take a long time to download. 🐌

The main objective is to create a workflow that prevents us from downloading 100% of the dataset and yet is still able to perform the following file transactions:

  • cp (because we're still at the POC stage, only this one is implemented for now.)

  • rm

  • mv

    NOTES: Only the cp and rm operations were primarily discussed in the Mechanism to update a dataset w/o downloading it firstΒ #4657 issue. However, I think it’s also worth considering the mv use case since it can potentially affect the overall implementation.

The first step is to perform the following new dvc vdir command with its pull subcommand:

$ dvc vdir pull data/
$ tree data/
data
β”œβ”€β”€ train
└── validation

This does the following:

  1. Recreate the directory structure of the remote storage into the local workspace.
    • This is useful for helping users navigate the directory structure so they can easily add files around.
    • However, this becomes a bit of a challenge for rm and mv operations since only the directory structure is created and the actual files are not.
  2. Downloads the hash values and tracked data files in .dvc/cache/xx/<hash>.dir. This is important to recalculate the new hash later on, as well as being used as a reference for other operations. This is how it looks like:
$ jq '.[0:3]' .dvc/cache/22/e3f61e52c0ba45334d973244efc155.dir
[
  {
    "md5": "ed779276108738fdb2179ccabf9680d9",
    "relpath": "data/train/dog.1.jpg"
  },
  {
    "md5": "10d2a131081a3095726c5721ed31c21f",
    "relpath": "data/train/dog.10.jpg"
  },
  {
    "md5": "0f2bfe74e9c363064087d0cd8a322106",
    "relpath": "data/train/dog.100.jpg"
  }
]

Now that we have the essential information about the data set, we can now add a single file using:

$ dvc vdir cp --local-data ~/my-personal-dataset/dog.1_000_001.jpg ./data/validation/

It does the following:

  • Copies the local <src> file into the workspace
  • Calculate the hash for the new dog.1_000_001.jpg file
  • Update .dvc/cache/xx/<hash>.dir with the new data entry
  • Update the data.dvc

So far, this is the current status of our workspace:

$ tree data/
data
β”œβ”€β”€ train
└── validation
    └── dog.1_000_001.jpg

$ git status  # note: truncated some other output lines below

        modified:   data.dvc

$ cat data.dvc
outs:
- md5: e1a0cf7fcebe1c12bc0adeaf7ca38dfd.dir
  size: 247416247
  nfiles: 1000001
  path: data

NOTE: The md5, size, nfiles were updated inside the data.dvc file.

The user can then proceed with the usual workflow:

$ dvc push data/

$ git add data.dvc
$ git commit -m "update validation set"
$ git push

4. Other functionalities beyond the MVP/POC

4.1 Proposed additional commands/behavior

To create a more hollistic user workflow revolving around "partially updating large data sets", the following features in section 4.1.x are also proposed:

4.1.1 dvc vdir cp (default: <src> file is in remote storage)

The current implementation only works when a --local-src flag is present since adding additional files is only supported locally (it was the one that was specifically required in #4657).

In the future, users could opt to use tracked files from remote storages as sources directly. However, it's mostly an illusion because it's mainly an abstraction of:

  1. Download the <src> file from the source remote storage into the local workspace.
  2. Proceed with the usual dvc vdir cp --local-src <src-somewhere-in-tmp> <dst>

4.1.2 dvc vdir list ✨ new subcommand ✨

(UPDATE: dropped the proposed dvc vdir list in lieu of the existing dvc list command. See #4900 (comment) below.)

Since the files are not downloaded when performing dvc vdir pull, users need a way to navigate through this β€œvirtual” directory. We can use the file paths in .dvc/cache/xx/<hash>.dir as reference for listing out the files:

$ tree data/
data
β”œβ”€β”€ train
└── validation

$ dvc vdir list data/train
dog.1.jpg
dog.10.jpg
dog.100.jpg

# ... and so on ...

Filtering out the files could then be something like:

  • dvc vdir list data/train | head
  • dvc vdir list data/train | tail
  • dvc vdir list data/train | less
  • dvc vdir list data/train | grep <expr>

4.1.3 dvc vdir rm ✨ new subcommand ✨

$ dvc vdir rm ./data/validation/dog.999_999.jpg

Does the following:

  • Removes the corresponding entries in .dvc/cache/xx/<hash>.dir, specifically removing the deleted file.
  • Calculate the hash for the entire data/
  • Update .dvc/cache/xx/<hash>.dir with the deleted file entry

4.1.4 dvc vdir mv ✨ new subcommand ✨

$ dvc vdir mv ./data/validation/dog.800_001.jpg ./data/validation/dog.800_001-renamed.jpg

Does the following:

  • Calculate the hash of the new dog.800_001-renamed.jpg file (or even just reuse the hash)
  • Updates the corresponding entries in .dvc/cache/xx/<hash>.dir, specifically the paths that were moved.

NOTES: Could just reuse dvc vdir cp and dvc vdir rm underneath.

5. Caveats

  • A user that is renaming/moving a file could unintentionally overwrite another file if the <dst> path already exists for another file.
    • We could implement a check during dvc vdir mv <src> <dst> if another file from the <dst> is already being tracked.
  • However, it’s a much harder scenario for copying a file with potential conflicts.
    • We could create a check during dvc vdir cp but the system could treat it as a valid and intentional file update operation.

6. Tracking some Thoughts/Ideas

Some thoughts/ideas after discussing with @shcheklein and @dmpetrov.

  1. The dvc vdir pull <target> could simply be merged into the existing dvc pull command by adding a param like --vdir, or --dir-only, etc.

2. The dvc vdir list could also be merged into the existing dvc list command using a new param like --vdir, or --dir-only, etc. (UPDATE: dropped the proposed dvc vdir list in lieu of the existing dvc list command. See #4900 (comment) below.)

  1. Consider defining --rev for src and dst.

7. Checklist

7.1 Documentation Updates

  • New Command: dvc vdir (doc/command-reference/vdir)
  • New User Guide: "Partially Updating Large Datasets" (doc/user-guide/how-to)
    • Focus on without the need to download them first.
    • Also emphasize on the dangers mentioned in the β€œCaveats” section.
    • Show some examples for piping UNIX commands in the dvc vdir list approach in order to better filter out large quantities of - results.

7.2 Tests

NOTE: I'll be working on this as soon as the users are happy with the proposed workflow/apprach.

  • Fix Failing tests
  • Create new tests

This only contains the necessary changes to introduce a new command,
such as the help messages. Moreover, this only includes two (2) of
the `dvc vdir` subcommands which are 'pull' and 'add (cp)' and they
are not implemented yet.

Fixes #(4657)
This also changes its usage.

Fixes #(4657)
Implemented these behaviors:
- pull the .dvc/cache/*.dir
- created the directory structure

Fixes #(4657)
Users can now add a data file to a tracked remote storage by
first downloading the virtual dir using `dvc vdir pull`. This
prevents the entire data to be downloaded and allows the user
to interact with the files partially.

Fixes #(4657)
@BurnzZ BurnzZ changed the title WIP: 'dvc vdir' command for updating large remote datasets [WIP] 'dvc vdir' command for updating large remote datasets Nov 16, 2020
@efiop
Copy link
Contributor

efiop commented Nov 19, 2020

Hi @BurnzZ !

Thanks for the PR! Just a heads up that I'll look into this today/tomorrow. Thank you for your patience! πŸ™

@karajan1001
Copy link
Contributor

karajan1001 commented Nov 20, 2020

WOW!So detailed.

@efiop
Copy link
Contributor

efiop commented Nov 21, 2020

Thanks again for the PR and detailed proposal! I agree that cp/rm/mv is the set of basic operations that we need. As you've noted, we already have very similar commands like dvc list that operates on the virtual state already, so I'm wondering if it would make sense to actually add this functionality to them and whether or not there is a good way (from the ui perspective, e.g. in terms of CLI semantics, possible flags) to do so? Just a thought, curious to hear what you think about it.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Nov 21, 2020

Hi @efiop! I appreciate your time taking a look at my proposal.

I added the dvc vdir list in the proposed new subcommands because I initially thought that the existing dvc list command was intended to work only on tracked Git repositories. I thought adding a new functionality to list out remote storages would perhaps break some users' workflows.

I had the wrong idea. Thanks for raising this as I needed that bump to take a look into it further.

Moreover, I've filed an issue in iterative/dvc.org#1963 to add an example in the docs for such use cases.

Let's drop the proposed dvc vdir list as the existing dvc list is already sufficient for this requirement. :) I've updated my proposal above based on this change.

Thanks!

@efiop
Copy link
Contributor

efiop commented Nov 21, 2020

@BurnzZ Makes sense! And what about other commands? We already have dvc move and dvc remove(this one has a pretty odd semantic right now and the usefulness of it is pretty questionable). WDYT? Again, not pushing or anything, not sure if it is even a correct approach yet, just wondering about your thoughts.


@dataclass
class VirtualDirData:
operation: str # e.g. cp, mv, rm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we already have things like DvcTree that have copy/move/remove implemented, maybe it would be better to use those somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! At the moment, the VirtualDirData is currently being injected inside the BaseTree so perhaps I think I can reuse the implementations in the DvcTree. Cheers for the heads up!

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Nov 22, 2020

Hi @efiop,

Thanks for pointing me to the existing dvc move and dvc remove commands as I could most certainly utilize a lot of the code underneath those. πŸ™Œ

I played with some hypothetical scenarios with them today and have observed (like you said) that each of these commands have separate semantic expectations:

  • dvc move
    • accepts file or dir paths as inputs and not the *.dvc file.
    • move the file or entire directory into another destination whilst updating the *.dvc filename in the process.
  • dvc remove
    • accepts the *.dvc file as input (and not the file or dir paths) and deletes the *.dvc file from the workspace.
    • also handles removal of stages (I think this is where the main use case that is most used often).

I think it's a bit challenging to set the user's expectations when we plug the new functionalities into these existing commands (though I might be totally wrong and perhaps users find it way easier here), mostly because of the odd semantics.

On the other hand, I think creating the cp/mv/rm operations under the dvc vdir command could give out a clear expectation to the user that they only operate on partial data in the local workspace. At this early stage, I think creating a new command entirely would give us a lot of room to experiment with it. For example, there might be some new params needed for the "virtual" operations that might be otherwise convoluting to the usecases/expectations on the existing dvc move and dvc remove.

Please let me know if you're okay with proceeding with the dvc vdir mv and dvc vdir rm for now, as I can try to start implementing a PoC on them as well. We can then revisit the discussion around integrating them into the existing dvc move and dvc remove when we have a full picture of a functioning workflow. :)

Thanks!

@efiop
Copy link
Contributor

efiop commented Nov 23, 2020

@BurnzZ Yep, I'm fine with a dvc vdir as a temporary solution πŸ‘Œ

dvc move
accepts file or dir paths as inputs and not the *.dvc file.

Same as dvc vdir mv, right?

move the file or entire directory into another destination whilst updating the *.dvc filename in the process.

For dvc vdir mv files are within the same directory, so same dir.dvc so it won't need to update the filename, but will need to update the hash.

dvc remove
accepts the *.dvc file as input (and not the file or dir paths) and deletes the *.dvc file from the workspace.

As I've mentioned above dvc remove has questionable usefulness right now, so it could be discussed to change its behaviour. Regarding this particular point, many of our commands (e.g. dvc dag/status/etc) support both stage names or particular output names(including files in a tracked dir), so from that standpoint dvc vdir rm will be removing a file within one dataset that belongs to dir.dvc, so same as dvc vdir mv above, it will simply need to update the hash in the dir.dvc.

also handles removal of stages (I think this is where the main use case that is most used often).

πŸ‘

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Nov 24, 2020

@BurnzZ Yep, I'm fine with a dvc vdir as a temporary solution πŸ‘Œ

Cool! Looking forward to see all of these come together. πŸ™ I'll work on the PoC on my free time this week.

dvc move
accepts file or dir paths as inputs and not the *.dvc file.

Same as dvc vdir mv, right?

Yes, as well as the other operations (cp/rm) as they would mostly be used on those cases wherein only a tiny subset of the data would be updated.

However, at the moment dvc move errors out when a specific file (representing a subset of the entire data) is moved, though ths is to be expected. (see (1) below)

move the file or entire directory into another destination whilst updating the *.dvc filename in the process.

For dvc vdir mv files are within the same directory, so same dir.dvc so it won't need to update the filename, but will need to update the hash.

Yes, that's right!

As I've mentioned above dvc remove has questionable usefulness right now, so it could be discussed to change its behaviour. Regarding this particular point, many of our commands (e.g. dvc dag/status/etc) support both stage names or particular output names(including files in a tracked dir),

Thanks for reminding me about the stage and output names as I will definitely need to make sure that these will also fit in to the new functionality.

Come to think of it, re-using dvc remove would actually benefit users as it's already part of their workflow for trying to transact with stage or output names.

so from that standpoint dvc vdir rm will be removing a file within one dataset that belongs to dir.dvc, so same as dvc vdir mv above, it will simply need to update the hash in the dir.dvc.

Yep, you're on point here.

I failed to mention explicitly the key differences between dvc move and dvc vdir mv in my previous comment. Cheers for following it up to solidify the thoughts. πŸ™Œ


I've taken a look with a fresh pair of eyes how might we update dvc move and dvc remove so that they can be adjusted to fit into the realm of partially updating datasets.

Suppose we have the following workspace:

$ tree new-data
new-data
β”œβ”€β”€ bar
β”‚   └── bar_data.txt
└── foo
    └── foo_data.txt

$ dvc add new-data

At its current state, dvc move would function as:

# (1)
$ dvc move new-data-2/foo/foo_data.txt new-data-2/foo/foo_data-2.txt
# moving files within tracked dirs results in:
#   ERROR: failed to move 'new-data-2/foo/foo_data.txt' -> 'new-data-2/foo/foo_data-2.txt' - Unable to find DVC-file with output 'new-data-2/foo/foo_data.txt'

# (2)
$ dvc move new-data new-data-2  
# moving dirs have no problems

# (3) The *.dir file is still present in the cache. Data files are only deleted.
$ rm -rf new-data/ .dvc/cache/76 .dvc/cache/b1
$ dvc move new-data new-data-2
# this results in:
#	ERROR: unexpected error - [Errno 2] No such file or directory: '/Users/burnzz/dev/iterative/explore/new-data'

while dvc remove would be:

# (4)
$ dvc remove new-data/bar/bar_data.txt
# removing files within tracked dirs results in:
#     ERROR: failed to remove 'new-data/bar/bar_data.txt' - "Stage 'new-data/bar/bar_data.txt' not found inside 'dvc.yaml' file"

# (5)
$ dvc remove new-data
# removing the dir path to the data results in:
#     ERROR: failed to remove 'new-data' - "Stage 'new-data' not found inside 'dvc.yaml' file"

# (6)
$ dvc remove new.data.dvc
# instead of (5), the *.dvc file should be provided as the target

# (7) The *.dir file is still present in the cache. Data files are only deleted.
$ rm -rf new-data/ .dvc/cache/76 .dvc/cache/b1
$ dvc remove new.data.dvc

To fit into our new use case, here are some considerations we observe:

  1. dvc move needs to be updated so it can handle the subset of files in a tracked dir.
  2. All good here. πŸ‘
  3. To simulate a large data set that is not present in the workspace and cache, some dirs are deleted. Only the *.dir file in the cache and *.dvc in the workspace are present. dvc move needs to be updated to handle virtual directories.
  4. dvc remove needs to be updated so it can handle the subset of files in a tracked dir. (same as Reproduce: identify changed dependencyΒ #1)
  5. This is a bit tricky since dvc remove does not currently handle dir paths (_ the *.dvc file is the expected input_). We could update it so dir paths are supported while making sure that it's able to discern properly between stage names and out paths.
  6. All good here. πŸ‘
  7. All good here. πŸ‘ (since only the *.dvc file is removed)

At this point, apart from the errors we've observed in (1), (3), (4), and (5) I think it might be already pretty close to what we want to achieve. They need to both operate on cases when:

a.) The entire data is available in the workspace.
b.) The data is not present in the workspace but they can be found in .dvc/cache/xx/yy.dir (made possible by dvc vdir pull).

So far, changes we make in (5) might be backward incompatible but at the same time, it would be consistent with the other operations. I'll explore this some possible scenarios here when developing the PoC.

With this in mind, I'll make sure to check out the existing implementation of dvc move and dvc remove to see how straightforward will the update be, if we decide to update their exsting behaviors.

Please let me know if I missed any important points above. :) Cheers!

@karajan1001
Copy link
Contributor

@efiop @BurnzZ
Excuse me, it looks like that we are about to reuse dvc remove / dvc move / dvc list instead of some new command?
But, I have a question that how can we make dvc remove / dvc move / dvc list working on some dataset without downloading it.
Currently, all of them need to pull down everything first. Changing cache .dir is only one of their functional, they also take effect on working spaces.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Nov 25, 2020

Hello @karajan1001 πŸ‘‹

Excuse me, it looks like that we are about to reuse dvc remove / dvc move / dvc list instead of some new command?

Yes, that's currently on the table, alongside the newly proposed dvc vdir [cp/rm/mv] suite of commands. Although at this point, the proposed dvc vdir list has been discarded since the existing dvc list already fulfills its functionalities.

For the remaining ones, it's still not decided yet which route we're going to take. Nonetheless, it's a good exercise to consider both options at this early stage for technical due diligence before moving forward. :)

But, I have a question that how can we make dvc remove / dvc move / dvc list working on some dataset without downloading it. Currently, all of them need to pull down everything first. Changing cache .dir is only one of their functional, they also take effect on working spaces.

What I'll be doing in the next few days is try to implement dvc vdir rm by calling a modified dvc remove underneath. The same is true for dvc vdir mv and dvc move.

In that way, we have two interfaces to interact and explore with at the beginning of the full POC. The advantage would be that if we try to decide on a particular interface moving forward, it's simply going to modify the commands rather than the implementation.

I'd probably provide another branch just so that we can perhaps try two (2) different interfaces, say dvc vdir rm vs dvc remove, and see how natural would each one feel in a user's workflow. To do that, I'll make sure that the POC is fully working so we can at least try it in some of our personal data projects as a pragmatic approach.

In the meantime, do you perhaps have some thoughts about some possible flags or params we can add to the dvc remove / dvc move for partially updating datasets?

On the other hand, how does the sample interfaces from section 4.1.3 dvc vdir rm and 4.1.4 dvc vdir mv above feel?

Looking forward for your additional thoughts! :)


@efiop Kindly let me know if the track that I'd be taking is okay or if there's a much better approach in providing the PoC. Cheers!

@karajan1001
Copy link
Contributor

karajan1001 commented Nov 26, 2020

@BurnzZ
Sounds good. mv, rm, add, ls are basic operations of a filesystem on DVC cache files, I agreed to separate them from the interface level.

@efiop efiop closed this Nov 26, 2020
@efiop
Copy link
Contributor

efiop commented Nov 26, 2020

Closing for another attempt/iteration in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants