Skip to content

remote: use .dir checksum existence to infer file contents existence #3632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 14, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Apr 13, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

push/gc behavior changes from #3600

  • status: (also affects push/pull) when checking .dir checksum exists on the remote, if the .dir checksum exists, presume that checksums for all files contained in that directory also exist on the remote
  • push: push .dir checksums after file checksums. If any file in the directory fails to be uploaded, do not upload the .dir file
  • gc: remove .dir checksums before file checksums

pmrowla added 4 commits April 13, 2020 17:04
- `used_cache()`/`get_used_cache()` in repo/stage/output now return
  tuples of (dir_cache, file_cache) instead of one flat/merged cache
- affects all commands which use `cache_exists()` (remote status)
@pmrowla pmrowla requested review from efiop, pared and skshetry April 13, 2020 11:17
@@ -429,16 +436,14 @@ def get_used_cache(self, **kwargs):
)
)
logger.warning(msg)
return NamedCache()
return None, NamedCache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would make sense to push dir into NamedCache? So that NamedCache could contain both dirs and files(e.g. `cache["local"].dirs)? Just an idea, not sure if it would be nicer looking in the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NamedCache supports nesting cache items (for .dir checksums) after 02aaf42, the end result is a bit cleaner now

@efiop
Copy link
Contributor

efiop commented Apr 14, 2020

@pmrowla Please check the tests, travis is failing.

@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 14, 2020

@efiop test issue is resolved

@efiop
Copy link
Contributor

efiop commented Apr 14, 2020

@pmrowla Check DS please too, e.g. this https://deepsource.io/gh/iterative/dvc/run/b2b2c666-0af4-4fb4-aed2-0210d2946dfc/python/PYL-R1704/ doesn't look good.

Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor test mocker decision to be made.

@efiop efiop requested review from skshetry and pared April 14, 2020 12:36
Co-Authored-By: Saugat Pachhai <[email protected]>
@pmrowla pmrowla self-assigned this Apr 14, 2020
@pmrowla pmrowla added the performance improvement over resource / time consuming tasks label Apr 14, 2020
@efiop efiop merged commit 7275843 into iterative:master Apr 14, 2020
@pmrowla pmrowla deleted the push-gc-dir branch April 15, 2020 04:29
)
return dict(dir_status, **file_status)

def _status(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit late to the party :) excellent PR overall. will mention a few things that I noticed reviewing that will might help later somehow.

this function definitely wants some split I would say... extract presentation logic out for example, and the that checks dir and excludes it? or may be some other refactoring ... but it's too long and complicated now

overall "internal-client" functions should be easy to read, even if it feels that extraction does not make much sense (e.g. helpers that are used only in a single place) think how will developers read this. It's easier to read story-like main function and go into details if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants