-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[WIP] 'dvc vdir' command for updating large remote datasets #4900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This only contains the necessary changes to introduce a new command, such as the help messages. Moreover, this only includes two (2) of the `dvc vdir` subcommands which are 'pull' and 'add (cp)' and they are not implemented yet. Fixes #(4657)
This also changes its usage. Fixes #(4657)
Implemented these behaviors: - pull the .dvc/cache/*.dir - created the directory structure Fixes #(4657)
Users can now add a data file to a tracked remote storage by first downloading the virtual dir using `dvc vdir pull`. This prevents the entire data to be downloaded and allows the user to interact with the files partially. Fixes #(4657)
Hi @BurnzZ ! Thanks for the PR! Just a heads up that I'll look into this today/tomorrow. Thank you for your patience! π |
WOW!So detailed. |
Thanks again for the PR and detailed proposal! I agree that |
Hi @efiop! I appreciate your time taking a look at my proposal. I added the I had the wrong idea. Thanks for raising this as I needed that bump to take a look into it further. Moreover, I've filed an issue in iterative/dvc.org#1963 to add an example in the docs for such use cases. Let's drop the proposed Thanks! |
@BurnzZ Makes sense! And what about other commands? We already have |
|
||
@dataclass | ||
class VirtualDirData: | ||
operation: str # e.g. cp, mv, rm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we already have things like DvcTree
that have copy/move/remove
implemented, maybe it would be better to use those somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! At the moment, the VirtualDirData
is currently being injected inside the BaseTree
so perhaps I think I can reuse the implementations in the DvcTree
. Cheers for the heads up!
Hi @efiop, Thanks for pointing me to the existing I played with some hypothetical scenarios with them today and have observed (like you said) that each of these commands have separate semantic expectations:
I think it's a bit challenging to set the user's expectations when we plug the new functionalities into these existing commands (though I might be totally wrong and perhaps users find it way easier here), mostly because of the odd semantics. On the other hand, I think creating the cp/mv/rm operations under the Please let me know if you're okay with proceeding with the Thanks! |
@BurnzZ Yep, I'm fine with a
Same as
For
As I've mentioned above
π |
Cool! Looking forward to see all of these come together. π I'll work on the PoC on my free time this week.
Yes, as well as the other operations (cp/rm) as they would mostly be used on those cases wherein only a tiny subset of the data would be updated. However, at the moment
Yes, that's right!
Thanks for reminding me about the stage and output names as I will definitely need to make sure that these will also fit in to the new functionality. Come to think of it, re-using
Yep, you're on point here. I failed to mention explicitly the key differences between I've taken a look with a fresh pair of eyes how might we update Suppose we have the following workspace:
At its current state,
while
To fit into our new use case, here are some considerations we observe:
At this point, apart from the errors we've observed in (1), (3), (4), and (5) I think it might be already pretty close to what we want to achieve. They need to both operate on cases when: a.) The entire data is available in the workspace. So far, changes we make in (5) might be backward incompatible but at the same time, it would be consistent with the other operations. I'll explore this some possible scenarios here when developing the PoC. With this in mind, I'll make sure to check out the existing implementation of Please let me know if I missed any important points above. :) Cheers! |
@efiop @BurnzZ |
Hello @karajan1001 π
Yes, that's currently on the table, alongside the newly proposed For the remaining ones, it's still not decided yet which route we're going to take. Nonetheless, it's a good exercise to consider both options at this early stage for technical due diligence before moving forward. :)
What I'll be doing in the next few days is try to implement In that way, we have two interfaces to interact and explore with at the beginning of the full POC. The advantage would be that if we try to decide on a particular interface moving forward, it's simply going to modify the commands rather than the implementation. I'd probably provide another branch just so that we can perhaps try two (2) different interfaces, say In the meantime, do you perhaps have some thoughts about some possible flags or params we can add to the On the other hand, how does the sample interfaces from section 4.1.3 Looking forward for your additional thoughts! :) @efiop Kindly let me know if the track that I'd be taking is okay or if there's a much better approach in providing the PoC. Cheers! |
@BurnzZ |
Closing for another attempt/iteration in the future. |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
1. Overview
Attempts to introduce a Proof of Concept for #4657
2. Proposal
The proposed approach is to create a βvirtualβ directory inside the local workspace that reflects the large data sets being contained in remote storage. In this way, users can navigate and transact with it without the need for the actual data files to be present locally.
3. Current State of the POC
3.1 Sample Workflow
Suppose we have the following remote storage directory structure:
As we can see, it contains 1 Million images which can take a long time to download. π
The main objective is to create a workflow that prevents us from downloading 100% of the dataset and yet is still able to perform the following file transactions:
cp (because we're still at the POC stage, only this one is implemented for now.)
rm
mv
The first step is to perform the following new
dvc vdir
command with itspull
subcommand:This does the following:
add
files around.rm
andmv
operations since only the directory structure is created and the actual files are not..dvc/cache/xx/<hash>.dir
. This is important to recalculate the new hash later on, as well as being used as a reference for other operations. This is how it looks like:Now that we have the essential information about the data set, we can now add a single file using:
It does the following:
<src>
file into the workspacedog.1_000_001.jpg
file.dvc/cache/xx/<hash>.dir
with the new data entrydata.dvc
So far, this is the current status of our workspace:
The user can then proceed with the usual workflow:
4. Other functionalities beyond the MVP/POC
4.1 Proposed additional commands/behavior
To create a more hollistic user workflow revolving around "partially updating large data sets", the following features in section 4.1.x are also proposed:
4.1.1
dvc vdir cp
(default:<src>
file is in remote storage)The current implementation only works when a
--local-src
flag is present since adding additional files is only supported locally (it was the one that was specifically required in #4657).In the future, users could opt to use tracked files from remote storages as sources directly. However, it's mostly an illusion because it's mainly an abstraction of:
<src>
file from the source remote storage into the local workspace.dvc vdir cp --local-src <src-somewhere-in-tmp> <dst>
4.1.2dvc vdir list
β¨ new subcommand β¨(UPDATE: dropped the proposed
dvc vdir list
in lieu of the existingdvc list
command. See #4900 (comment) below.)Since the files are not downloaded when performingdvc vdir pull
, users need a way to navigate through this βvirtualβ directory. We can use the file paths in.dvc/cache/xx/<hash>.dir
as reference for listing out the files:Filtering out the files could then be something like:dvc vdir list data/train | head
dvc vdir list data/train | tail
dvc vdir list data/train | less
dvc vdir list data/train | grep <expr>
4.1.3
dvc vdir rm
β¨ new subcommand β¨Does the following:
.dvc/cache/xx/<hash>.dir
, specifically removing the deleted file.data/
.dvc/cache/xx/<hash>.dir
with the deleted file entry4.1.4
dvc vdir mv
β¨ new subcommand β¨Does the following:
dog.800_001-renamed.jpg file
(or even just reuse the hash).dvc/cache/xx/<hash>.dir
, specifically the paths that were moved.5. Caveats
<dst>
path already exists for another file.dvc vdir mv <src> <dst>
if another file from the<dst>
is already being tracked.dvc vdir cp
but the system could treat it as a valid and intentional file update operation.6. Tracking some Thoughts/Ideas
Some thoughts/ideas after discussing with @shcheklein and @dmpetrov.
dvc vdir pull <target>
could simply be merged into the existingdvc pull
command by adding a param like--vdir
, or--dir-only
, etc.2. The(UPDATE: dropped the proposeddvc vdir list
could also be merged into the existingdvc list
command using a new param like--vdir
, or--dir-only
, etc.dvc vdir list
in lieu of the existingdvc list
command. See #4900 (comment) below.)--rev
for src and dst.7. Checklist
7.1 Documentation Updates
dvc vdir
(doc/command-reference/vdir)pull
subcommand:(UPDATE: dropped the proposedlist
(not implemented yet for the POC)dvc vdir list
in lieu of the existingdvc list
command. See [WIP] 'dvc vdir' command for updating large remote datasets Β #4900 (comment) below.)cp
rm
(not implemented yet for the POC)mv
(not implemented yet for the POC)7.2 Tests