Skip to content

Adding lock mechanism to prevent on_disk_cache downloading twice #409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

VitalyFedyunin
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin commented May 16, 2022

VitalyFedyunin added a commit that referenced this pull request May 16, 2022
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2022
VitalyFedyunin added a commit that referenced this pull request May 16, 2022
VitalyFedyunin added a commit that referenced this pull request May 16, 2022
@VitalyFedyunin VitalyFedyunin requested a review from ejguan May 16, 2022 19:28
@VitalyFedyunin
Copy link
Contributor Author

Doing cleanup now, but principle had to remain. The main problem is the fact that we had to put 'promise' files to avoid downloading twice.

VitalyFedyunin added a commit that referenced this pull request May 16, 2022
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add a dependency to portalocker. I think our options are:

  1. Soft dependency - only use the locking mechanism when portalocker is installed, but otherwise the DataPipes here still work without it
  2. Semi-soft (?) - raise warnings/error when these DataPipes are used but portalocker isn't used. Maybe in DLv2 when mp is enabled (potentially over-engineering)?
  3. Hard dependency - always use it and add it to our list of dependencies

VitalyFedyunin added a commit that referenced this pull request May 16, 2022
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!!

@VitalyFedyunin
Copy link
Contributor Author

Added as hard dependency.

@VitalyFedyunin
Copy link
Contributor Author

TODOs:
[ ] Remove lambda for full tests coverage
[ ] Add timeout to make sure we exit after waiting promise for two long (default 2 min), as it could happen when filename_fns are different

VitalyFedyunin added a commit that referenced this pull request May 16, 2022
VitalyFedyunin added a commit that referenced this pull request May 16, 2022
VitalyFedyunin added a commit that referenced this pull request May 17, 2022
@VitalyFedyunin
Copy link
Contributor Author

Had to add promise concept, to make sure that workers waits 'main' worker to finish downloading/unpacking. To avoid DataLoader starvation we also fetch all file names into infinite buffer.

@VitalyFedyunin VitalyFedyunin requested a review from ejguan May 17, 2022 20:22
VitalyFedyunin added a commit that referenced this pull request May 17, 2022
VitalyFedyunin added a commit that referenced this pull request May 17, 2022
VitalyFedyunin added a commit that referenced this pull request May 18, 2022
VitalyFedyunin added a commit that referenced this pull request May 18, 2022
VitalyFedyunin added a commit that referenced this pull request May 18, 2022
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you soooo much for adding this fix. I do have one comment on 1-to-n scenario.

Comment on lines +630 to +631
self.assertEqual(2, len(all_files))
self.assertEqual("str", result[0][1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to verify, len(result) should be 1, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I'm creating one additional file inside of _slow_fn. So it would be 'downloaded' file and 'pid' file.

Comment on lines +222 to +230
file_exists = len(data) > 0
if not file_exists:
result = False
promise_fh.seek(0)
promise_fh.write("[dataloader session uid]")
promise_fh.truncate()
promise_fh.flush()

return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the cached op is 1-to-n like decompression from archive, if any decompressed file is missing or has incorrect hash, we can directly return False and no need to check other files IMHO.
There can be a chance that multiple processes are locking different decompressed files for an archive. Then, both processes will run decompression -> racing condition again.

So, I think we should lock over data but not filepaths (data represents the compressed archive in this case). For the process that observes promise file over data, they can directly return True.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately data could be url or something else, it is hard to lock on it.

But this situation is covered.

Imagine data generates two file namesL file1 file2.

Initial pass (empty FS) will add two locks file1.promise and file2.promise and will go 'False' route.

Now second (and every next) pass will see that files are missing, but will fail to create promise and go into the 'file exists' route, which will led them to the situation when they are waiting for file1.promise and file2.promise to disappear.

Copy link
Contributor

@NivekT NivekT May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is an URL, is it possible to create and lock root/URL.promise in the file system?

I think we should have a similar lock for HttpReader to prevent multiple processes from downloading the same file? Nevermind

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. file_exists flag is used for processes to recognize this file or parent archives are going to be processed by another process.

VitalyFedyunin added a commit that referenced this pull request May 18, 2022
Comment on lines +222 to +230
file_exists = len(data) > 0
if not file_exists:
result = False
promise_fh.seek(0)
promise_fh.write("[dataloader session uid]")
promise_fh.truncate()
promise_fh.flush()

return result
Copy link
Contributor

@NivekT NivekT May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is an URL, is it possible to create and lock root/URL.promise in the file system?

I think we should have a similar lock for HttpReader to prevent multiple processes from downloading the same file? Nevermind

VitalyFedyunin added a commit that referenced this pull request May 18, 2022
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation. LGTM

Comment on lines +222 to +230
file_exists = len(data) > 0
if not file_exists:
result = False
promise_fh.seek(0)
promise_fh.write("[dataloader session uid]")
promise_fh.truncate()
promise_fh.flush()

return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. file_exists flag is used for processes to recognize this file or parent archives are going to be processed by another process.

VitalyFedyunin added a commit that referenced this pull request May 18, 2022
VitalyFedyunin added a commit that referenced this pull request May 18, 2022
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin added a commit that referenced this pull request May 18, 2022
VitalyFedyunin added a commit that referenced this pull request May 18, 2022
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin added a commit that referenced this pull request May 18, 2022
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants