-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Postmortem: Files could not be fetched from remotes in specific circumstances #8967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Another thing that we always discuss with @pmrowla regarding that logic is that it is quite complicated and we hope to replace or improve it by incorporating indexes into odbs and maybe basing our workflow on index building and not on tricky prefix estimations. |
Thanks folks! This is very helpful. A few comments on this: First, how about we introduce the "time to triage" (where we should be identifying this is
I can't come up with some other ideas, but I wonder what actual steps we can do to improve this? More reviews (can slow us down though), creating checklists / sharing the plan (not sure it can help in this case), more comments for tests? Do you remember what actually caused the test to be removed? Did it look too complicated? Did we remove it because we want to migrate to the new logic?
Make sense. Would be really great to have also an epic (story, whatever) for visibility and knowledge sharing, including a checklist of items to check that we could start collecting (from the top of my head- this edge case). And the last question: should we pull the broken version from PyPi, or deprecate it? (I'm not sure we've done this before, but in case we detect a severe issue with data reliability - should we consider deprecating versions?) |
We migrated all of the remote tests to the |
Thanks @pmrowla ! Just a suggestion / an idea for another possible steps- write test descriptions (e.g. link to a ticket, mention that it's an important edge case, etc). Not sure at it would have helped here though. |
I think it’s better to ask for review in these kinds of refactoring. It’ll force you to make small and isolated PRs. At worse, reviews won’t make any difference. At best, we might detect these issues. |
High level summary
Performance-related refactoring in
dvc-objects
re-introduced an old bug where files with an MD5 hash starting with00
would be reported as missing from a remote in specific situations even though the files had been pushed properly. This would result in incorrectdvc status -c
output as well as preventing DVC from fetching the files.Timeline
All times in UTC+9
dvc-objects
dvc-objects
and releases (dvc-objects 0.19.3)[https://github.com/iterative/dvc-objects/releases/tag/0.19.3]dvc[testing]
, and fix merged into DVC main (testing: re-add 00 prefix remote traverse tests #8965)Perf indicators
Impact
Users with large remotes on specific object-storage based clouds could encounter this bug when trying to fetch files with MD5 hashes beginning with
00
. Which clouds were affected depends on individual fsspecfs.find()
implementation (confirmed on S3 and Azure, issue not reproducible with GCS). Encountering the bug is also dependent on triggering the traverse-based existence check behavior in DVC (which is based on the relative difference between total number of files in the remote and total number of files the user wishes to fetch).Estimating a number or percentage of impacted DVC users is not feasible given the specific edge-case nature of the bug.
Root cause analysis
This issue had been encountered and fixed in the past (#4141, #6089), and tests for this edge case were added in DVC when the issue was fixed at that time. However, when the remote plugins were separated in to their own sub-projects and remote tests were refactored/migrated into
dvc[testing]
, the specific tests for this bug were mistakenly removed entirely and not migrated into the new remote test framework (b6f2a8e).Given that this was an issue we have encountered and fixed in the past, it should have been caught. Unfortunately, since the old tests were lost, the bug was not caught when it was re-introduced.
Prevention
The text was updated successfully, but these errors were encountered: