Skip to content

Use git/dvc APIs instead of actually checking out revisions #1688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shcheklein opened this issue Mar 6, 2019 · 6 comments · Fixed by #1709
Closed

Use git/dvc APIs instead of actually checking out revisions #1688

shcheklein opened this issue Mar 6, 2019 · 6 comments · Fixed by #1709
Assignees
Labels
enhancement Enhances DVC refactoring Factoring and re-factoring

Comments

@shcheklein
Copy link
Member

There are few DVC commands that accept --all-branches and --all-tags options. Namely, dvc metrics show, dvc gc, dvc fetch, etc. For all of them what we need is to being able to analyze content of DVC metafile across different Git revisions. Right now it's done by running git checkout (and then dvc checkout if we need to get content of the file from cache). This approach is fragile, depends on the current state of the working space (e.g there are uncommitted changes) and even dangerous.

We should instead employ git API (like ls-tree, or ls-files?) and dvc API to get direct access to necessary files, directories, etc.

Current implementation is here: https://github.com/iterative/dvc/blob/master/dvc/scm/base.py#L79 and is used in two places: https://github.com/iterative/dvc/blob/master/dvc/repo/__init__.py#L239 and https://github.com/iterative/dvc/blob/master/dvc/repo/metrics/show.py#L144.

Directly related issue: #1009

@ei-grad
Copy link
Contributor

ei-grad commented Mar 6, 2019

Could someone assign me to this as we agreed with @dmpetrov, please?

@ei-grad
Copy link
Contributor

ei-grad commented Mar 6, 2019

I'm in the process of making a list of places in code which use the filesystem interface over the files checkouted by dvc.scm.Base.brancher, and hence should be corrected.

As discussed, I'm not going to make a huge refactoring/improvements while working on this issue, but it has to be a good idea to introduce a better git interface library like libgit2 or dulwich and use it to access git objects directly in all dvc interactions with git, instead of calling the git executables via GitPython wrapper and accessing the filesystem. Though no code changes will be made in this direction for now, it might be a good time to start the discussion of such refactoring.

@ei-grad
Copy link
Contributor

ei-grad commented Mar 6, 2019

Here is a call graph of functions which use the dvc.scm.Base.brancher(). Gray ones don't use filesystem directly.

2019-03-06-172329_1151x476_scrot

It was intended just for me to get understanding of related parts in the DVC codebase. But I think it is better to share it here.

@shcheklein
Copy link
Member Author

@ei-grad I've invited you as a collaborator to the project. I think I'll be able to assign you after you accept the invitation.

@ei-grad ei-grad self-assigned this Mar 6, 2019
@ei-grad
Copy link
Contributor

ei-grad commented Mar 6, 2019

Great! I even could assign myself by myself now :).

@shcheklein shcheklein added enhancement Enhances DVC refactoring Factoring and re-factoring labels Mar 6, 2019
@ei-grad
Copy link
Contributor

ei-grad commented Mar 6, 2019

So as of my current understanding - Repo.stages and Repo.find_outs_by_path should be rewritten to use SCM methods to list files and get their contents, also these methods should be implemented for Base and Git SCM backends. Also there would be some little fixes, like passing file-like objects instead of path into the Stage objects methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC refactoring Factoring and re-factoring
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants