Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Extract checksums to a common state file #2940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dmpetrov opened this issue Dec 12, 2019 · 5 comments
Closed

Extract checksums to a common state file #2940

dmpetrov opened this issue Dec 12, 2019 · 5 comments
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature

Comments

@dmpetrov
Copy link
Member

Now all the checksums are scattered among DVC-files. It was a design decision to simplify git merge for ML experiments when a single data-file/dvc-stage changes were localized. However, we learned that in many cases -X theirs strategy is the best way to bring ML experiments to another branch without a manual merging and it is a good time to revisit this design decision.

There are two issues with checksums in many DVC-files:

  1. It makes DVC-files not readable by users
  2. DVC (a tool) has to modify files - not the best practice
  3. It could be convenient to have all the changes as a single file for automation tools (like CD4ML) which usually cannot make a Git commit (after dvc repro). The changes in repo (changed dvc-files) need to be copied to somewhere (e.g. GitLab artifacts).

To solve the issues from the above - it might worth to extract all the checksums into a separate "State"-file. For example: Dvc.state or <anyname>.dvcstate or .dvc/state

Note, this is not the same as the current .dvc/state which is an ephemeral (not committed to Git) DB file. The state file needs to be committed to Git.

Example: Terraform keeps all the infrastructure configuration in *.tf files but stores state in a single, separate file terraform.tfstate.

Related issues: This FR might be related to a single dag FR #1871

@dmpetrov dmpetrov added the feature request Requesting a new feature label Dec 12, 2019
@dmpetrov dmpetrov changed the title Extract checksums to state file Extract checksums to a common state file Dec 12, 2019
@dmpetrov
Copy link
Member Author

An idea: it might happen that dvc add file won't be creating any *.dvc files - all information will go to the common state file and *.dvc files will be needed only for stages. This is really good!

However, there is a small disadvantage - it might lead to empty directories which won't be reflected by Git.

@efiop efiop added the discussion requires active participation to reach a conclusion label Dec 12, 2019
@ghost
Copy link

ghost commented Dec 12, 2019

We have several issues that discuss the way we organize the state on files.
It would be great if we could start creating a clear boundary in the source code between the index and the content addressable storage.

With this approach, it doesn't matter how do we identify each block/object of the store (gathering a collection of stage files, querying a single file, or split it across several files -- checksums, pipelines, artifcats.)

@ghost
Copy link

ghost commented Dec 12, 2019

  1. DVC (a tool) has to modify files - not the best practice

Could you elaborate what would be the best practice?

@ghost
Copy link

ghost commented Dec 12, 2019

Example: Terraform keeps all the infrastructure configuration in *.tf files but stores state in a single, separate file terraform.tfstate.

@dmpetrov, by the way, terraform has an option to submit the state to a remote/shared space. It can't be done through GitHub because their state includes sensitive information (API keys?), but with DVC is only checksums paired with files. It would be even simpler if we move to a prefix based approach instead of a path based one, since directories wouldn't have special treatments.

@dmpetrov
Copy link
Member Author

  1. DVC (a tool) has to modify files - not the best practice

Could you elaborate what would be the best practice?

Best practice - files are editable by humans only. No software writes in files that goes under Git control. If software needs to write something and put under Git control it is better to localize the places when the modification happens.

terraform has an option to submit the state to a remote/shared space.

@MrOutis are you talking about Terraform Cloud? If so, it seems like a different use case that can be implemented on top.

It would be great if we could start creating a clear boundary in the source code between the index and the content addressable storage.

Is it only about the internal, code redesign? Yeah, the separation is needed.

@efiop efiop closed this as completed May 3, 2021
@iterative iterative locked and limited conversation to collaborators May 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

2 participants