-
Notifications
You must be signed in to change notification settings - Fork 166
Adding lock mechanism to prevent on_disk_cache downloading twice #409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding lock mechanism to prevent on_disk_cache downloading twice #409
Conversation
[ghstack-poisoned]
… twice" [ghstack-poisoned]
… twice" [ghstack-poisoned]
Doing cleanup now, but principle had to remain. The main problem is the fact that we had to put 'promise' files to avoid downloading twice. |
… twice" [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add a dependency to portalocker
. I think our options are:
- Soft dependency - only use the locking mechanism when
portalocker
is installed, but otherwise the DataPipes here still work without it - Semi-soft (?) - raise warnings/error when these DataPipes are used but
portalocker
isn't used. Maybe in DLv2 whenmp
is enabled (potentially over-engineering)? - Hard dependency - always use it and add it to our list of dependencies
… twice" [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!!
Added as hard dependency. |
TODOs: |
… twice" [ghstack-poisoned]
… twice" [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
Had to add promise concept, to make sure that workers waits 'main' worker to finish downloading/unpacking. To avoid DataLoader starvation we also fetch all file names into infinite buffer. |
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you soooo much for adding this fix. I do have one comment on 1-to-n scenario.
self.assertEqual(2, len(all_files)) | ||
self.assertEqual("str", result[0][1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to verify, len(result)
should be 1, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I'm creating one additional file inside of _slow_fn. So it would be 'downloaded' file and 'pid' file.
file_exists = len(data) > 0 | ||
if not file_exists: | ||
result = False | ||
promise_fh.seek(0) | ||
promise_fh.write("[dataloader session uid]") | ||
promise_fh.truncate() | ||
promise_fh.flush() | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the cached op is 1-to-n like decompression from archive, if any decompressed file is missing or has incorrect hash, we can directly return False
and no need to check other files IMHO.
There can be a chance that multiple processes are locking different decompressed files for an archive. Then, both processes will run decompression -> racing condition again.
So, I think we should lock over data
but not filepaths
(data
represents the compressed archive in this case). For the process that observes promise
file over data, they can directly return True.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately data could be url or something else, it is hard to lock on it.
But this situation is covered.
Imagine data generates two file namesL file1 file2.
Initial pass (empty FS) will add two locks file1.promise and file2.promise and will go 'False' route.
Now second (and every next) pass will see that files are missing, but will fail to create promise and go into the 'file exists' route, which will led them to the situation when they are waiting for file1.promise and file2.promise to disappear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is an URL, is it possible to create and lock root/URL.promise
in the file system?
I think we should have a similar lock for NevermindHttpReader
to prevent multiple processes from downloading the same file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. file_exists
flag is used for processes to recognize this file or parent archives are going to be processed by another process.
… twice" Fixes #144 [ghstack-poisoned]
file_exists = len(data) > 0 | ||
if not file_exists: | ||
result = False | ||
promise_fh.seek(0) | ||
promise_fh.write("[dataloader session uid]") | ||
promise_fh.truncate() | ||
promise_fh.flush() | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is an URL, is it possible to create and lock root/URL.promise
in the file system?
I think we should have a similar lock for NevermindHttpReader
to prevent multiple processes from downloading the same file?
… twice" Fixes #144 [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation. LGTM
file_exists = len(data) > 0 | ||
if not file_exists: | ||
result = False | ||
promise_fh.seek(0) | ||
promise_fh.write("[dataloader session uid]") | ||
promise_fh.truncate() | ||
promise_fh.flush() | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. file_exists
flag is used for processes to recognize this file or parent archives are going to be processed by another process.
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Stack from ghstack:
Fixes #144
Differential Revision: D36489060