Skip to content

[WIP] cache/remote: drop dos2unix MD5 by default #5337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jan 26, 2021

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

  • Adds stubs for cache/remote ODB class/refactor
    • currently only supports reading/writing a single yaml config file containing schema version
  • Disables dos2unix conversion by default when computing MD5 hashes for new files
    • Can be explicitly re-enabled by configuring core.dos2unix to be true
  • MD5 hashes for existing files will not be re-computed. Any files previously tracked in DVC will still be referred to using their dos2unix-MD5.
    • When pushing a new file with a "true" MD5 that matches an existing dos2unix-MD5 contained in a remote, the old file will not be replaced
  • By default any remotes will be "migrated" to the new 2.0 ODB schema version (the only migration step taken at this time is to write the config/version file)

Will close #4658.
Related to #829.

@pmrowla pmrowla self-assigned this Jan 26, 2021
return schema[version]


class BaseODB:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is now this is really more of an ODBConfig class, but given that nothing else will probably get refactored into ODB for 2.0 I did not spend too much time thinking about how everything should be organized in the long run

Copy link
Collaborator

@skshetry skshetry Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla, is this later going to evolve and be used in chunking?

Copy link
Contributor Author

@pmrowla pmrowla Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry yes, that's the plan for now at least

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe then just ObjectDB?

@@ -450,6 +452,7 @@ def create_taskset(amount):

@index_locked
def push(self, cache, named_cache, jobs=None, show_checksums=False):
self.odb.migrate_config()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating/writing the config should really be done as needed after an initial odb.write/save/etc operation, but since we don't do anything else with ODB at the moment writing the config file to a remote on push should be sufficient for 2.0

@@ -90,6 +90,10 @@ def hash_jobs(self):
or self.HASH_JOBS
)

@cached_property
def enable_dos2unix(self):
return self.repo and self.repo.config["core"].get("dos2unix")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this really belongs in BaseTree long term, but currently we compute MD5's from both RepoTree and LocalTree

def test_caches(tmp_dir, dvc, caplog, mocker):
from dvc.tree.ssh import SSHTree

mocker.patch.object(SSHTree, "exists", return_value=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that we will try to load a config file on remote connection, this has to be mocked out for these kinds of tests with a fake remote URL otherwise the test will hang until the network call times out

@pmrowla pmrowla changed the title cache/remote: drop dos2unix MD5 by default [WIP] cache/remote: drop dos2unix MD5 by default Jan 26, 2021
@efiop efiop self-requested a review January 26, 2021 12:35
@efiop efiop mentioned this pull request Feb 5, 2021
2 tasks
efiop added a commit to efiop/dvc that referenced this pull request Feb 5, 2021
Checkout is an operation that builds object from cache(aka object
database) in a particular tree(aka filesystem). Thus it shouldn't be
part of the object database itself nor of the filesystem.

Pre-requisite for iterative#5337
efiop added a commit that referenced this pull request Feb 5, 2021
Checkout is an operation that builds object from cache(aka object
database) in a particular tree(aka filesystem). Thus it shouldn't be
part of the object database itself nor of the filesystem.

Pre-requisite for #5337
efiop added a commit to efiop/dvc that referenced this pull request Feb 6, 2021
This makes all cache objects go through the same logic, so that all of
the appropriate rules are applied to them (e.g. proper permissions).

Pre-requisite for iterative#5337
efiop added a commit that referenced this pull request Feb 6, 2021
This makes all cache objects go through the same logic, so that all of
the appropriate rules are applied to them (e.g. proper permissions).

Pre-requisite for #5337
@efiop efiop mentioned this pull request Feb 8, 2021
2 tasks
@pmrowla
Copy link
Contributor Author

pmrowla commented Feb 9, 2021

Will refactor this into a new PR after @efiop's recent cache/object changes

@pmrowla pmrowla closed this Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

drop dos2unix behaviour
2 participants