Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

deprecate old .dvc file based run stages? #3936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
skshetry opened this issue Jun 2, 2020 · 14 comments
Closed

deprecate old .dvc file based run stages? #3936

skshetry opened this issue Jun 2, 2020 · 14 comments
Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC

Comments

@skshetry
Copy link
Collaborator

skshetry commented Jun 2, 2020

It feels like dvc.yaml and old *.dvc file approaches doesn't really mix well, maybe more so when we try to explain in docs, and in terms of UI/UX (as pointed out by @shcheklein lots of times already).

So, maybe, we could move away from *.dvc files produced by dvc run?

The thing that I propose is, making old stage dvcfiles as similar to dvc add-ed files, one that cannot really repro or run, but works for checkouts and friends as well.

To be fair, this does not really improve our codebase, at least in short term, and am creating issue for the debt on the product side.

Also, this can be implemented after 1.0 as well.

cc @dmpetrov

@skshetry skshetry added the discussion requires active participation to reach a conclusion label Jun 2, 2020
@dmpetrov
Copy link
Member

dmpetrov commented Jun 2, 2020

@skshetry thank you for the proposal. Could you please clarify - is it only about backward compatibility (the existing dvc-files with commands)?

@shcheklein
Copy link
Member

@skshetry yep, same question. I'm not sure I 100% understood the question. Could you clarify the idea a bit please?

@skshetry
Copy link
Collaborator Author

skshetry commented Jun 3, 2020

Sorry, it's more of a question. If it's not making much of a sense from UI/UX and product perspective to support old-style pipeline stages, maybe, we should deprecate? I don't know, I just wanted to hear some thoughts from y'all.

The backward compatibility will not be there, but we can consider those stages as a read-only and similar to dvc add-ed files. So, checkouts/push/pull/gc -a will still work and repo compatibility will still be there.

@shcheklein, this proposal more or less arose from the issue you raised on our 1o1 on why we need repro -p and that it does not make sense to you now that we have multi-stage dvcfiles.

@efiop
Copy link
Contributor

efiop commented Jun 3, 2020

My understanding: old-style dvc files will be treated as dvc added outputs, so we would simply ignore commands/dependencies/other params declared in it. So that would provide data management compatibility but no pipeline compatibility.

@dmpetrov
Copy link
Member

dmpetrov commented Jun 3, 2020

So that would provide data management compatibility but no pipeline compatibility.

In this case, an old pipeline won't work - dvc repro. Is this the main point of the proposal? So, we are taking a step towards the depreciation of the old dvc-files but not deprecating it 100%.

What are the other pros and cons?

@efiop
Copy link
Contributor

efiop commented Jun 6, 2020

pros:

  • users will still be able to get access to their pipeline outputs.
  • we don't have to maintain legacy code.

cons:

  • users won't be able to run their pipelines with the newer dvc.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jul 24, 2020

dvc.yaml and old *.dvc file approaches doesn't really mix well, maybe more so when we try to explain in docs

This has indeed become evident when documenting them together! Full explanation and detailed proposal on how dvc.* files would look in #4278

maybe, we could move away from *.dvc files produced by dvc run... making old stage dvcfiles as similar to dvc add-ed files, one that cannot really repro or run, but works for checkouts...
support old-style pipeline stages, maybe, we should deprecate?

This part was confusing but I think by now it's clear the most straightforward option is to get rid of .dvc files altogether, integrating their contents to dvc.yaml and dvc.lock (I came up with other alternatives in #4278 but would imply a separate set of commands for .dvc files)

backward compatibility will not be there
cons: users won't be able to run their pipelines with the newer dvc

Old .dvc files could be left as optional (not encouraged or even mentioned much in docs) for this.

Upvoting this!

@pared
Copy link
Contributor

pared commented Jul 28, 2020

I think that if we decide to drop .dvc stages, we should have some "official way" to migrate to 1.X. So that user are not forced to learn how to do it by themselves.

@jorgeorpinel
Copy link
Contributor

Yes, we should probably have an official migration tool regardless. But .dvc files are no longer considered stages. That's part of the problem 🙂

@shcheklein
Copy link
Member

@jorgeorpinel I think within docs we can safely assume (and we already do) that .dvc are not stages

@karajan1001
Copy link
Contributor

Reminding users that old format would be deprecated when an old .dvc stage file is reproing. Asking if they would like to migrate now. If so adding it to dvc.yaml and remove the old .dvc stage file.

@jorgeorpinel
Copy link
Contributor

within docs we can safely assume (and we already do) that .dvc are not stages

We do (except in outdated docs) @shcheklein, but it's not as safe to assume so IMO. They behave like stages in a way, because they can be reproduced... So it may not be obvious to users.

If so adding it to dvc.yaml and remove the old .dvc stage file

Great idea to have an automatic migration tool built into repro @karajan1001. It would only apply to actual stage files, not "orphan stages" meaning ones without dependencies — those can stay as .dvc files. Or if we go the way of #4278, then yes, all of them.

@jorgeorpinel
Copy link
Contributor

Summary of discussion about this in #4278:

PRO

  • My motivation to suggest this is mostly conceptual right now but maybe it has some very practical implications:
    In DVC 1.x we created the pipelines file dvc.yaml which contains all the stages. From that point on .dvc files stopped being "stage files" and they only remain as placeholders for data files. They're no longer considered any kind of stage
    Why can't dvc.yaml (and lock) be used for this? It's a matter of introducing a new top section. Example below
data:
- corpus.csv
- dataset/

stages:
  cleanup:
    cmd: python clean.py corpus.csv df.h5
    deps:
    - corpus.csv
  ...
  • you have a state where you have some files tracked by dvc.lock/yaml system and some by .dvc files... and that's the confusing inconsistency

  • For displaying the original data folder structure, we have dvc list .

  • a data: section in dvc.yaml achieves the same goal (separating tracked data from stage outputs), and improves the consistency, making DVC less confusing and easier to understand...
    And we wouldn't deprecate .dvc files, they would be left as optional mechanism both for simple scenarios and for backwards compatibility

  • Alternatively, getting rid of dvc.lock and creating .dvc files for all data specified in dvc.yaml would also be more consistent.

CON

  • If for some reason it's better to have explicit .dvc when people deal with data - let's keep it.
  • I like having the .dvc files. It makes sense in my head for the workflow... It helps to keep the whole project structure and is more convenient: to image the structure and explore the paths without downloading all of the data
  • .dvc should not be defining outputs and dependencies. They define files and directories (that's a problem in terminology and docs). There is a clear difference though between those files/directories - those in pipelines are outputs.
  • complicates the data management (non-pipeline) usage. Imagine a simple case: a single data.xml file and you don't care about pipelines
  • since we allow multiple dvc.yamls (one per directory), which one should we modify on dvc add?

@jorgeorpinel
Copy link
Contributor

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC
Projects
None yet
Development

No branches or pull requests

7 participants