Skip to content

QA: params/vars and loops in dvc.yaml (2.0) round 2 #5165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
jorgeorpinel opened this issue Dec 27, 2020 · 12 comments
Closed
1 task done

QA: params/vars and loops in dvc.yaml (2.0) round 2 #5165

jorgeorpinel opened this issue Dec 27, 2020 · 12 comments
Labels
A: templating Related to the templating feature discussion requires active participation to reach a conclusion enhancement Enhances DVC ui user interface / interaction

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 27, 2020

Continues #4854

Bugs?

UI/UX

  • Foreach loops: since do: always contains all the regular stage fields, what about avoiding do and go straight to cmd, etc.? Like this:
    stages:
      something:
        foreach: ...
        cmd: python script.py --thresh ${item...}
        deps: ...

    This way the only syntax change is to add the foreach field, which makes the whole entry into a stage group.

  • If a directory path in vars, repro prints a generic ERROR: unexpected error - [Errno 21] Is a directory: '{path}'. which while informative enough, seems like it could be better handled. (Fix on parametrization: fix error when the file in vars is a directory #5171)

Feature design/ Side-effects

  • There seems to be no way to indicate you don't want any params file to be loaded (even if it exists for other purposes). Earlier on you could use: none.
  • Similarly, there's no way to ONLY use an alternative params file e.g. params.json and avoid the default (params.yaml). I feel like the default behavior should be to skip params.yaml if any params files are specified.
  • vars cannot overwrite any values in params files, but params files accept repeated objects IF they can be merged without conflict. E.g.: params.json {"grp": {"a": 1}} params2.json {"grp": {"b": 2}} can both be included even when grp is overloaded, but you can't define grp at all in vars.
@jorgeorpinel jorgeorpinel added bug Did we break something? enhancement Enhances DVC product: VSCode Integration with VSCode extension ui user interface / interaction labels Dec 27, 2020
@jorgeorpinel jorgeorpinel changed the title [WIP] QA: params/vars and loops in dvc.yaml (2.0) round #2 QA: params/vars and loops in dvc.yaml (2.0) round #2 Dec 27, 2020
@skshetry skshetry changed the title QA: params/vars and loops in dvc.yaml (2.0) round #2 QA: params/vars and loops in dvc.yaml (2.0) round 2 Dec 28, 2020
@skshetry
Copy link
Collaborator

skshetry commented Dec 28, 2020

dvc repro executes stage groups alphabetically e,g, if my foreach list is foo, bar, baz (in that order) still the stages gets reproed (and saved to dvc.locl) in this order: stg@bar, stg@baz, stg@foo

Not a bug. The ordering is based on the DAG. Other than that, it might be difficult to promise to order (and, not that it matters). Lock file has its own way of ordering things wherever it matters, it does not have to work the way you expect.

Foreach loops: since do: always contains all the regular stage fields, what about avoiding do and go straight to cmd, etc.?

The reasoning for this is that we have two levels of vars: local and global. Three if you count the foreach's key and value, but they come from global vars most of the time.

do helps them separate. Not in a clean way that I'd like, as wdir depends on global variables rather than local variables as mentioned in iterative/dvc.org#2052 (comment).

... the whole entry into a stage group.

I don't like this term at all. They are not a group in any way. If you are confusing it with dvc repro <foreach_name>, then that's purely a convenience function.

There seems to be no way to indicate you don't want any params file to be loaded (even if it exists for other purposes). (Earlier on you could use: none.)

I brought that in the discussion with @dmpetrov. The main reasoning was that most people will use params.yaml by default, and if they do use other files, that either means they use separate files like params.json and don't use params.yaml or they do use params.yaml for common configs but use other files for specialized params.

And, if you don't want any parameterization, don't use it at all. If you only want to declare variables in vars section only, then name it a bit differently so they don't collide.

vars cannot overwrite any values in params files, but params files accept repeated objects

vars has the same behaviour whether you use files or declare it in vars itself. It never allows you to overwrite, you can only recursively load them. So, the following should work in vars:

vars:
  grp:
    b: 3

This is useful in the top-level vars, but rarely useful in the stage-level vars (where, key: value is a better choice).

We have decided that if someone wants to have an "overwrite" method, we'd introduce a new keyword as follows in the future:

overwrite_vars: true
vars:
  grp: 3

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Dec 29, 2020

The ordering is based on the DAG

So if there are no DAG connections in a stage group, why use any other order than what is specified in dvc.yaml? We even assign 0...n indices to some lists given to foreach. That seems much more intuitive IMO but @efiop @shcheklein can you please weight in?

it does not have to work the way you expect.

DVC should not work in a predictable/intuitive way? I probably didn't understand what you meant with that @skshetry.

they come from global vars most of the time

I think we are assuming there. OR do we know that users actually prefer global vs. local vars?

do: helps them separate

I don't have a strong opinion on this since it's already implemented that way but I think it looks cleaner without do:, and the change vs. regular stages is minimal: you just add the foreach: field to the spec (it could even be called something else like set:). Again, something for others to weight in, hopefully.

They are not a group in any way

image

https://www.merriam-webster.com/dictionary/group

I see you fixed the "repro prints a generic ERROR" bug, great! 👍

no way to indicate you don't want any params file

most people will use params.yaml... if you don't want any parameterization, don't use it at all... If you only want to declare variables in vars section only, then name it a bit differently

No strong opinion either but those are workarounds. What if I want parameterization with vars with the same names as what is in params.yaml? Since it doesn't let me overwrite them, use: none was useful for that (I'm not saying that exact syntax should be brought back). Up to you guys, I'm just leaving all the "QA" results here for consideation.

no way to ONLY use an alternative params file e.g. params.json... (Maybe) the default behavior should be to skip params.yaml if any params files are specified?

Is there a reasoning to keep that default behavior? Thanks

vars has the same behaviour whether you use files or declare it in vars itself

You're right! I confirmed this does work consistently 👍 (except that grp in your example should be a - list item).

@shcheklein
Copy link
Member

why use any other order than what is specified in dvc.yaml?

any additional guarantee reduces flexibility (let's say we decide to run them in parallel at some point and people already expect certain order). I would wait for someone to ask this, for a specific use case to start considering this at all.

but I think it looks cleaner without do:

I think @skshetry mentioned that do is needed. Not sure what do you suggest in terms of avoiding it?

They are not a group in any way

so, according to that screenshot I tend to agree with @skshetry - since they do not represent a unit in any way

@jorgeorpinel

This comment has been minimized.

@shcheklein
Copy link
Member

@jorgeorpinel

I don't see why it's needed

hmm ... what about the explanation @skshetry provided, with global/local vars?

So you're saying maybe we'll introduce some new field signaling to run them in parallel?

no. We might run them in parallel by default, nothing prevents us from doing this at the moment.

My point is that if there's no way to build a DAG nor an order gets specified ...

I'm not sure I understand this to be honest. Could you clarify please?

Why does that reduce flexibility?

you kinda set specific thing (execution order in this case). This is by definition reduces flexibility - we have to maintain that specification. E.g. it makes harder to run stuff in parallel by default.

In fact I've noticed that changing the command or other parts of regular stages

Yes. We don't guarantee any order.

@jorgeorpinel

This comment has been minimized.

@shcheklein
Copy link
Member

shcheklein commented Dec 30, 2020

On the "group" term

@jorgeorpinel it's a language construct to generate stages to my mind. Not sure how else I would describe it. As I mentioned I would analyze how users describe/ask about it in the first place- and would use that language (it should be correct though, e.g. we don't like group, because group is a very specific term in math/languages, and in our case it's not a group, these stages are not different from any other stages. Also, you can take a look at some other tools that use loops in yaml.

Writing this, I see where this confusion comes from. There are two languages now - dvc.yaml and dvc.lock. On the dvc.yaml level it's indeed looks like a single "entity", but on the dvc.lock level - it's a bunch of stages, not connected to each other in any special way.

I'm btw would be totally fine to use "group" in some informal way. The misleading would be to make it formal (use in title, and everywhere else). At least we should about some better term here.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Dec 30, 2020

My last comments on that

it's not a group, these stages are not different from any other stages

OK I think see what your concern is now. We can def. find another term for the "official name" of the feature.

I would analyze how users describe/ask about it in the first place

Agree on this. I'm just not sure where to find that info...

on the dvc.lock level - it's a bunch of stages, not connected

They're indirectly connected by the stage name prefixes e.g. mystg@1, mystg@2, etc. and always printed next to each other (I think).

p.p.s. I'm not sure why this terminology discussion is happening here (it's not something I reported related to QA). Let's move this part to iterative/dvc.org#2052 (review) please!

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Dec 30, 2020

@shcheklein answers to the other comments (which I just noticed):

I don't see why do: is needed

hmm ... what about the explanation @skshetry provided, with global/local vars?

TBH I didn't understand that explanation 😅. Here's an example with all 3 types of substitutions, and without do:

vars:
- glb: foo

stages:
  mystages:
    vars:
    - loc: bar
    foreach:
    - 1
    - 2
    # do: needed?
    cmd: echo ${glb} ${loc} ${item}

I will create a separate issue on the order of stage execution and summarize/clarify there, since it seems to be more general than the scope of this QA ticket.

@jorgeorpinel
Copy link
Contributor Author

OK, created #5181 for the order thing.

The only matter still in discussion here is about the do syntax, which as I've said from the beginning is not major from my PoV. Please consider my latest comment and example on that and also, feel free to close this ticket if the decision is to keep it.

I'm also building a 3rd (hopefully last) QA ticket around 2.0 pipelines: #5180

@jorgeorpinel jorgeorpinel removed the bug Did we break something? label Dec 31, 2020
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Dec 31, 2020

I lied, there's 2 other questions still pending:

  • no way to indicate you don't want to load params file (previously use: none)
  • no way to only use an alternative params file (skip/overwrite params.yaml)

My proposal would be to address both (not necessarily a priority) with something like

vars:
- none/null # or introduce a special comment like # dvc-skip-params.yaml
- foo: bar # This can now "override" values in params.yaml
- params.json # This will be the only params file loaded.

Or (for the latter one) perhaps if you specify params files then by default DVC shouldn't load params.yaml (so no none needed in that case)

@jorgeorpinel jorgeorpinel added the discussion requires active participation to reach a conclusion label Jan 22, 2021
@jorgeorpinel
Copy link
Contributor Author

Closing in favor of #5312

@skshetry skshetry added the A: templating Related to the templating feature label May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: templating Related to the templating feature discussion requires active participation to reach a conclusion enhancement Enhances DVC ui user interface / interaction
Projects
None yet
Development

No branches or pull requests

3 participants