Skip to content

add: support virtual operation #9389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 11, 2023
Merged

Conversation

skshetry
Copy link
Collaborator

@skshetry skshetry commented May 2, 2023

To try this out, make sure to re-install all dependencies.

pip install -e ".[dev]"

asciicast

The add and commit takes a path which can be granular.
If it's granular, the path can be of a file or a directory. If the path does not exist, it'll be considered as a remove operation and it will try to commit without the files under that prefix. Otherwise, it'll try to update the dataset.

If the path is already being tracked and is not a granular path, in case of commit, it'll fail and in add, it'll create a new stage as expected.

You can use dvc data status or dvc-data diff <old.dir hash> <new.dir hash> to find changes.

@codecov
Copy link

codecov bot commented May 2, 2023

Codecov Report

Patch coverage: 93.54% and project coverage change: -0.04 ⚠️

Comparison is base (06d8e3c) 91.59% compared to head (556ea1b) 91.56%.

❗ Current head 556ea1b differs from pull request most recent head ec29a1c. Consider uploading reports for the commit ec29a1c to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9389      +/-   ##
==========================================
- Coverage   91.59%   91.56%   -0.04%     
==========================================
  Files         487      488       +1     
  Lines       37857    38060     +203     
  Branches     5443     5462      +19     
==========================================
+ Hits        34677    34851     +174     
- Misses       2624     2645      +21     
- Partials      556      564       +8     
Impacted Files Coverage Δ
dvc/commands/add.py 92.85% <0.00%> (-7.15%) ⬇️
dvc/commands/commit.py 72.41% <0.00%> (-8.36%) ⬇️
dvc/repo/experiments/queue/base.py 86.02% <ø> (ø)
tests/unit/repo/test_repo.py 100.00% <ø> (ø)
dvc/stage/__init__.py 93.42% <88.00%> (-1.27%) ⬇️
dvc/output.py 88.25% <89.10%> (-1.48%) ⬇️
dvc/repo/add.py 98.00% <100.00%> (-2.00%) ⬇️
dvc/repo/commit.py 100.00% <100.00%> (+18.75%) ⬆️
dvc/repo/experiments/executor/base.py 85.29% <100.00%> (ø)
tests/func/test_add.py 98.74% <100.00%> (-0.01%) ⬇️
... and 7 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@skshetry skshetry force-pushed the virtual-operation branch 2 times, most recently from 506425a to 40a371b Compare May 5, 2023 11:47
@skshetry skshetry changed the title output: add virtual_update operation add/commit: support virtual operation May 5, 2023
@skshetry skshetry marked this pull request as ready for review May 5, 2023 11:50
@skshetry
Copy link
Collaborator Author

skshetry commented May 5, 2023

@dberenbaum, I have removed prompts from commit, as it likely makes things complicated.

@skshetry skshetry force-pushed the virtual-operation branch from 445c657 to 46290ab Compare May 5, 2023 13:26
@dberenbaum
Copy link
Contributor

@skshetry Let me know when you want me to take a look.

Only semi-related and not a blocker for this PR, but we should really have a summary of changes from add/commit, like 2 files added, 1 file modified, 1 file removed.

@skshetry skshetry force-pushed the virtual-operation branch from 32f6f49 to 1c19be3 Compare May 5, 2023 14:21
@skshetry

This comment was marked as resolved.

@skshetry skshetry force-pushed the virtual-operation branch from c27641d to 22423fe Compare May 8, 2023 07:05
@skshetry

This comment was marked as resolved.

@skshetry
Copy link
Collaborator Author

skshetry commented May 8, 2023

Just one edgecase regarding dvc commit.

dvc commit has ties to the pipelines, in that it is used to force dvc to commit all workspace changes and generate dvc.lock files, etc.

Plain dvc commit without any argument does that. If you do dvc commit <stage>, it will only update the given stage, which makes sense.

But what should dvc commit data do? Ideally, I'd argue that it should also update dependencies, not just output.

Similarly, it feels like users should be able to do dvc commit params.yaml, and update lockfiles.

@skshetry
Copy link
Collaborator Author

skshetry commented May 8, 2023

@dberenbaum, I decided not to implement support for external outputs, so this should be ready, whenever you have time.

@skshetry skshetry force-pushed the virtual-operation branch from 40224cb to 110bd2f Compare May 9, 2023 04:09
Comment on lines 45 to 51
commit_parser.add_argument(
"-f",
"--force",
action="store_true",
default=False,
help="Commit even if hash value for dependencies/outputs changed.",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a way, removing prompts could be seen as a breaking change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the option itself is more breaking, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also seems like a completely unrelated change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @daavoo. I will bring the flag back. But still the default behavior that it no longer fails is a breaking change.

@skshetry skshetry requested a review from a team May 9, 2023 04:18
@skshetry skshetry self-assigned this May 9, 2023
@skshetry skshetry added feature is a feature A: cli Related to the CLI A: data-management Related to dvc add/checkout/commit/move/remove labels May 9, 2023
Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry Looks amazing! I did a little QA and everything worked as expected. I even like the trick to use dvc add missing_path to remove a file.

But what should dvc commit data do? Ideally, I'd argue that it should also update dependencies, not just output.

I'm not sure I follow this part. Is data something with a data.dvc file? What would be its dependencies?

@skshetry
Copy link
Collaborator Author

skshetry commented May 10, 2023

I'm not sure I follow this part. Is data something with a data.dvc file? What would be
its dependencies?

yes, data could be an output of data.dvc. And it might be a dependency in other pipeline stages.

stages:
  stage1:
    cmd: python train.py
    deps: [data]
    outs: model.pkl

I guess my question is, should dvc commit data also update the lockfile of stage1 for data dependency?
Although it's not a blocker for now.

EDIT: related #2094.

@skshetry
Copy link
Collaborator Author

I even like the trick to use dvc add missing_path to remove a file.

We show a hint like following in data status (see #8033), and I really wanted to support that usecase.

DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
  (use "dvc checkout <file>..." to discard changes)
        deleted: dataset/train/0/00002.png

Now, the only thing that does not work is dvc checkout on newly-added file. It works on modified file already (although it prompts user to delete), and if the file is deleted, it checks out back again.
But that is questionable, and there's rm <file> to do that.

See https://clig.dev/#ease-of-discovery.

@skshetry skshetry linked an issue May 10, 2023 that may be closed by this pull request
@skshetry
Copy link
Collaborator Author

skshetry commented May 10, 2023

I just had an interesting discussion with @efiop about dvc commit.

The current behaviour of dvc commit is to store files to the cache, as recorded in .dvc files or dvc.lock files. Example: when you do dvc add --no-commit or dvc repro --no-commit, the hashes are recorded on .dvc/dvc.lock files but the files are not saved to the cache.

When you do dvc commit at a later point, if the changes in the workspace match with what you have recorded in .dvc/dvc.lock files, dvc commit will then happily commit those changes to the cache. If they differ, it'll prompt you to confirm.

dvc commit provides a way to complete DVC commands that track data (dvc add, dvc repro, dvc import, etc.), when they have been used with the --no-commit or --no-exec options.

Also see Rapid Iterations.

My understanding was that dvc commit,

stores the current contents of files and directories tracked by DVC in the cache, and updates dvc.lock or .dvc files if/as needed. This forces DVC to accept any changed contents of tracked data currently in the workspace.

And even though docs agree to a certain extent, it seems like the main motivation behind commit seems to be storing files to cache if the hashes in .dvc files match with workspace, but allowing users to commit from workspace with prompt confirmation. --force force commits from workspace, which was more of my understanding.

So, that's why prompts exist, and @efiop said that it'll be a major breaking change to the commit.


So, now with granular commit, what does dvc commit <granular path> even mean? I'd argue that most of the user's intention will be to force update that <granular path> in the workspace, save that to the cache and update records in .dvc files granularly. This is what this PR does.

But if we go with the current implementation, it'll mean checking if the files for the whole output match with the hash in the .dvc file (meaning no virtual operation) and then only save the <granular path> to the cache. And prompt if they mismatch. Which is what we have implemented in main already. The only thing missing is that we don't fetch .dir file, but with --no-commit scenarios, there's unlikely to be a .dir cache.

@efiop
Copy link
Contributor

efiop commented May 10, 2023

Thank you, @skshetry ! 🙏

I also had a discussion with @dberenbaum and he also agrees so far that the way dvc commit is used these days is mostly with -f in mind and prompt is just getting in the way.

At the same time that dvc commit -f behaviour is compensating for the problems of dvc add like not supporting dvc add . and not supporting pipeline stages.

But also dvc commit in the original form of just trying to transfer stuff to cache is not a good analogy with git commit and probably should've been called something like dvc cache-save or something.

So maybe we could indeed just break dvc commit old behavior here and maybe think about improving dvc add in the future to eventually drop dvc commit completely.

Maybe @dberenbaum could share additional thoughts as well.

@skshetry
Copy link
Collaborator Author

@efiop, if you are okay with this, do you mind reviewing this PR? Thanks. :)

@dberenbaum
Copy link
Contributor

I guess my question is, should dvc commit data also update the lockfile of stage1 for data dependency?

No, I think it's important that it doesn't so that it's clear that stage1 is out of date.

@dberenbaum
Copy link
Contributor

We show a hint like following in data status (see #8033), and I really wanted to support that usecase.

DVC uncommitted changes:

Should we change that hint from dvc commit to dvc add?

@skshetry Is that the primary motivation for changing dvc commit? I agree with your interpretation of dvc commit and it's how I'ved used it, but looking at this again, I don't see that having virtual dvc commit is actually needed. Can we simply leave dvc commit alone and only support such operations with dvc add?

@skshetry skshetry force-pushed the virtual-operation branch from 556ea1b to ec29a1c Compare May 11, 2023 02:45
@skshetry skshetry changed the title add/commit: support virtual operation add: support virtual operation May 11, 2023
@skshetry skshetry merged commit 2342099 into iterative:main May 11, 2023
@skshetry skshetry deleted the virtual-operation branch May 11, 2023 02:46
@skshetry
Copy link
Collaborator Author

skshetry commented May 11, 2023

I have split this PR and merged dvc add related changes and created #9440 for commit related changes.

I find dvc commit convenient, not only it works with data, but also pipelines. Likewise, I find the current behaviour confusing, we should not have different sets of behaviour in dvc add and completely different behaviour in dvc commit.

DVC does not have staging, so dvc add --no-commit/dvc repro --no-commit does not make much sense to me. DVC is a meta-versioning tool, and users should rely on "commit" on their SCM (like Git, etc.). The architecture, with .dir/cache and the fact that dvc on its own is not a versioning tool, does not give us a way to support --no-commit use case.

Even today, --no-commit scenario is broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cli Related to the CLI A: data-management Related to dvc add/checkout/commit/move/remove feature is a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mechanism to update a dataset w/o downloading it first
4 participants