Skip to content

dump yaml: .dvc and dvc.lock in 1.1 version, dvc.yaml in 1.2 version #4380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

skshetry
Copy link
Collaborator

@skshetry skshetry commented Aug 12, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
    Requires documentation updates noting some of the things from this description.

The solution feels like a kludge, caused by two different parsers each of them with their own bugs/botched implementation.

For now, the effect of this PR will be that it will change any y/n keys to its string counterpart on dvc.lock and .dvc files. I have tried to see compatibility with pre and post version of this PR, and have found it to be compatible (it's hard to make any guarantees given the differences though).

dvc.yaml should remain a compatible-subset of 1.1 and 1.2 for the time being. This was partly done because we introduced y for plots (which wasn't introduced on .dvc files luckily), which is a boolean in YAML 1.1. Another thing is that, till now, we have no reported issues for dvc.yaml and we only use a subset of functionality there.

Alternatives to consider:

  1. I haven't looked, but we can change the "regex" that pyyaml uses for parsing floats. This way, we'll have partial 1.2 support. But, it won't be 1.1 compatible for sure. Another advantage of this is, it will be very minimal change and should fix the issue till the next one pops-up.

  2. Deprecate now (should have a very minimal effect), and move to YAML 1.2. It's twice as slow, but it does not make much difference until the stages are in four-figure numbers.

Things to note:

  1. If we consider other alternatives, this PR will still be required for the correct roundtrip for experiments/hyper-parameter optimization.
  2. The JSON load issue will be fixed in a separate PR.
  3. It's hard to say if this is fully compatible given the nature of two different libraries. However, I have tested this on a small subset of hand-created YAML files. Seems to be working fine.
  4. For same reasons as 3, It's difficult to say that we won't have similar problems in the future. 3 and 4 issues can be solved to a high degree with not using round-trip loading/dumping on dvc.lock file (we dump a stage without keeping it's older comments in dvc.lock, however other stages' comments do survive).

Fixes #4281

Daniele TrifirΓ² and others added 2 commits August 12, 2020 19:04
* only dump dvc.lock in 1.1, dump dvc.yaml in 1.2
@efiop
Copy link
Contributor

efiop commented Aug 13, 2020

Deprecate now (should have a very minimal effect), and move to YAML 1.2. It's twice as slow, but it does not make much difference until the stages are in four-figure numbers.

This sounds good, but I'm worried about some of our users that use a very large number of dvc-files in their projects. We've tried to optimize collection before, but it is still not instant and this change will make it slower. We've been talking about alternative approaches there, that might even mitigate this change.

Do you also suggest moving to 1.2 for metrics/params? Or would 1.1. stay there?

Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though I wonder whether version in parse and dump methods shouldn't be required. That way we could force choosing a proper version.

Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty ugly to be mixing versions like this, but it seems like there's not much we can do about it for now

@pmrowla
Copy link
Contributor

pmrowla commented Aug 14, 2020

From a discussion w/@efiop: since we are planning on supporting specifying params via command line for the experiments feature, we will eventually want to move to 1.2 for params as well.

For this experiments feature we need to load/edit/dump params files and preserve comments. Preserving comments requires using ruamel, which defaults to 1.2

@skshetry
Copy link
Collaborator Author

@pmrowla, I'd say we provide repo-wide config just in case as with this hack PR, it can still dump in 1.1 format.

@efiop
Copy link
Contributor

efiop commented Aug 14, 2020

@skshetry Repo config option to make dvc use 1.1 or 1.2 in dvcfiles/params/metrics?

@skshetry
Copy link
Collaborator Author

Discussed with @efiop to use YAML 1.2 everywhere. Closing this PR, I'll re-create another one with some minor adjustments.

@skshetry skshetry closed this Aug 14, 2020
@skshetry skshetry deleted the fix-1.1-lock branch August 14, 2020 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LockFile class dump/load uses different libraries
4 participants