Adding lock mechanism to prevent on_disk_cache downloading twice #409

VitalyFedyunin · 2022-05-16T19:23:41Z

Stack from ghstack:

-> Adding lock mechanism to prevent on_disk_cache downloading twice #409

Fixes #144

Differential Revision: D36489060

[ghstack-poisoned]

ghstack-source-id: 2fb2762 Pull Request resolved: #409

… twice" [ghstack-poisoned]

ghstack-source-id: dd94ca8 Pull Request resolved: #409

… twice" [ghstack-poisoned]

ghstack-source-id: ae6dbd8 Pull Request resolved: #409

VitalyFedyunin · 2022-05-16T19:29:47Z

Doing cleanup now, but principle had to remain. The main problem is the fact that we had to put 'promise' files to avoid downloading twice.

… twice" [ghstack-poisoned]

ghstack-source-id: f1b192b Pull Request resolved: #409

NivekT

I think we need to add a dependency to portalocker. I think our options are:

Soft dependency - only use the locking mechanism when portalocker is installed, but otherwise the DataPipes here still work without it
Semi-soft (?) - raise warnings/error when these DataPipes are used but portalocker isn't used. Maybe in DLv2 when mp is enabled (potentially over-engineering)?
Hard dependency - always use it and add it to our list of dependencies

… twice" [ghstack-poisoned]

ghstack-source-id: de27d8c Pull Request resolved: #409

ejguan

LGTM, thank you!!

VitalyFedyunin · 2022-05-16T20:47:04Z

Added as hard dependency.

VitalyFedyunin · 2022-05-16T21:02:48Z

TODOs:
[ ] Remove lambda for full tests coverage
[ ] Add timeout to make sure we exit after waiting promise for two long (default 2 min), as it could happen when filename_fns are different

… twice" [ghstack-poisoned]

ghstack-source-id: a0cda63 Pull Request resolved: #409

… twice" [ghstack-poisoned]

ghstack-source-id: acfa64d Pull Request resolved: #409

… twice" Fixes #144 [ghstack-poisoned]

ghstack-source-id: 03fadd5 Pull Request resolved: #409

VitalyFedyunin · 2022-05-17T20:22:34Z

Had to add promise concept, to make sure that workers waits 'main' worker to finish downloading/unpacking. To avoid DataLoader starvation we also fetch all file names into infinite buffer.

… twice" Fixes #144 [ghstack-poisoned]

ghstack-source-id: f790c95 Pull Request resolved: #409

… twice" Fixes #144 [ghstack-poisoned]

ghstack-source-id: 466cfb0 Pull Request resolved: #409

ghstack-source-id: 2737ef4 Pull Request resolved: #409

… twice" Fixes #144 [ghstack-poisoned]

ghstack-source-id: d131687 Pull Request resolved: #409

… twice" Fixes #144 [ghstack-poisoned]

ghstack-source-id: b73c486 Pull Request resolved: #409

ejguan

Thank you soooo much for adding this fix. I do have one comment on 1-to-n scenario.

ejguan · 2022-05-18T15:29:01Z

test/test_local_io.py

+            self.assertEqual(2, len(all_files))
+            self.assertEqual("str", result[0][1])


Just want to verify, len(result) should be 1, right?

Nope. I'm creating one additional file inside of _slow_fn. So it would be 'downloaded' file and 'pid' file.

ejguan · 2022-05-18T15:46:55Z

torchdata/datapipes/iter/util/cacheholder.py

+                    file_exists = len(data) > 0
+                    if not file_exists:
+                        result = False
+                        promise_fh.seek(0)
+                        promise_fh.write("[dataloader session uid]")
+                        promise_fh.truncate()
+                        promise_fh.flush()
+
+        return result


When the cached op is 1-to-n like decompression from archive, if any decompressed file is missing or has incorrect hash, we can directly return False and no need to check other files IMHO.
There can be a chance that multiple processes are locking different decompressed files for an archive. Then, both processes will run decompression -> racing condition again.

So, I think we should lock over data but not filepaths (data represents the compressed archive in this case). For the process that observes promise file over data, they can directly return True.

WDYT?

Unfortunately data could be url or something else, it is hard to lock on it.

But this situation is covered.

Imagine data generates two file namesL file1 file2.

Initial pass (empty FS) will add two locks file1.promise and file2.promise and will go 'False' route.

Now second (and every next) pass will see that files are missing, but will fail to create promise and go into the 'file exists' route, which will led them to the situation when they are waiting for file1.promise and file2.promise to disappear.

If it is an URL, is it possible to create and lock root/URL.promise in the file system?

~~I think we should have a similar lock for HttpReader to prevent multiple processes from downloading the same file?~~ Nevermind

Oh, I see. file_exists flag is used for processes to recognize this file or parent archives are going to be processed by another process.

… twice" Fixes #144 [ghstack-poisoned]

ghstack-source-id: 4c90294 Pull Request resolved: #409

setup.py

NivekT · 2022-05-18T18:16:17Z

torchdata/datapipes/iter/util/cacheholder.py

+                    file_exists = len(data) > 0
+                    if not file_exists:
+                        result = False
+                        promise_fh.seek(0)
+                        promise_fh.write("[dataloader session uid]")
+                        promise_fh.truncate()
+                        promise_fh.flush()
+
+        return result


If it is an URL, is it possible to create and lock root/URL.promise in the file system?

~~I think we should have a similar lock for HttpReader to prevent multiple processes from downloading the same file?~~ Nevermind

… twice" Fixes #144 [ghstack-poisoned]

ghstack-source-id: 18c6117 Pull Request resolved: #409

ejguan

Thanks for your explanation. LGTM

ejguan · 2022-05-18T18:23:56Z

torchdata/datapipes/iter/util/cacheholder.py

+                    file_exists = len(data) > 0
+                    if not file_exists:
+                        result = False
+                        promise_fh.seek(0)
+                        promise_fh.write("[dataloader session uid]")
+                        promise_fh.truncate()
+                        promise_fh.flush()
+
+        return result


Oh, I see. file_exists flag is used for processes to recognize this file or parent archives are going to be processed by another process.

… twice" Fixes #144 [ghstack-poisoned]

ghstack-source-id: 72913b5 Pull Request resolved: #409

… twice" Fixes #144 [ghstack-poisoned]

ghstack-source-id: 547d120 Pull Request resolved: #409

VitalyFedyunin · 2022-05-18T19:24:38Z

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]

ghstack-source-id: 1f96fb9 Pull Request resolved: #409

… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]

ghstack-source-id: bfc8be4 Pull Request resolved: #409

VitalyFedyunin · 2022-05-18T20:55:11Z

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]

ghstack-source-id: d0934ff Pull Request resolved: #409

VitalyFedyunin · 2022-05-18T20:59:15Z

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Adding lock mechanism to prevent on_disk_cache downloading twice

03511ed

[ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 16, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

1233eac

ghstack-source-id: 2fb2762 Pull Request resolved: #409

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2022

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

10ef06b

… twice" [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 16, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

e21c304

ghstack-source-id: dd94ca8 Pull Request resolved: #409

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

b914f1d

… twice" [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 16, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

20f2c73

ghstack-source-id: ae6dbd8 Pull Request resolved: #409

VitalyFedyunin requested a review from ejguan May 16, 2022 19:28

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

c9aa2c5

… twice" [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 16, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

5c1ee5f

ghstack-source-id: f1b192b Pull Request resolved: #409

NivekT reviewed May 16, 2022

View reviewed changes

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

911afe8

… twice" [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 16, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

2bbcebc

ghstack-source-id: de27d8c Pull Request resolved: #409

ejguan approved these changes May 16, 2022

View reviewed changes

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

1ca1931

… twice" [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 16, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

3cbefa3

ghstack-source-id: a0cda63 Pull Request resolved: #409

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

7cbc3ff

… twice" [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 16, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

ea5eaa8

ghstack-source-id: acfa64d Pull Request resolved: #409

NivekT mentioned this pull request May 17, 2022

Multiprocessing with any DataPipe writing to local file #144

Closed

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

1e588b8

… twice" Fixes #144 [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 17, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

ee7d60f

ghstack-source-id: 03fadd5 Pull Request resolved: #409

VitalyFedyunin requested a review from ejguan May 17, 2022 20:22

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

25df48b

… twice" Fixes #144 [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 17, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

a41b0c5

ghstack-source-id: f790c95 Pull Request resolved: #409

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

bab6bff

… twice" Fixes #144 [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 17, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

758d709

ghstack-source-id: 466cfb0 Pull Request resolved: #409

VitalyFedyunin added a commit that referenced this pull request May 18, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

e84461e

ghstack-source-id: 2737ef4 Pull Request resolved: #409

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

755e841

… twice" Fixes #144 [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 18, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

1eae6da

ghstack-source-id: d131687 Pull Request resolved: #409

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

b02f56f

… twice" Fixes #144 [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 18, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

0ba32b8

ghstack-source-id: b73c486 Pull Request resolved: #409

ejguan reviewed May 18, 2022

View reviewed changes

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

f4c18b6

… twice" Fixes #144 [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 18, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

7a048fa

ghstack-source-id: 4c90294 Pull Request resolved: #409

NivekT reviewed May 18, 2022

View reviewed changes

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

748d4fc

… twice" Fixes #144 [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 18, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

74b616b

ghstack-source-id: 18c6117 Pull Request resolved: #409

ejguan approved these changes May 18, 2022

View reviewed changes

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

ffa0fa3

… twice" Fixes #144 [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 18, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

c011a47

ghstack-source-id: 72913b5 Pull Request resolved: #409

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

58c25aa

… twice" Fixes #144 [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 18, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

d5f371d

ghstack-source-id: 547d120 Pull Request resolved: #409

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

ae05b84

… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 18, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

a506122

ghstack-source-id: 1f96fb9 Pull Request resolved: #409

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

3a9fb53

… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 18, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

b40b0e7

ghstack-source-id: bfc8be4 Pull Request resolved: #409

Update on "Adding lock mechanism to prevent on_disk_cache downloading…

070e292

… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request May 18, 2022

Adding lock mechanism to prevent on_disk_cache downloading twice

53f9a00

ghstack-source-id: d0934ff Pull Request resolved: #409

facebook-github-bot closed this in 42922cf May 19, 2022

Nayef211 mentioned this pull request May 19, 2022

Adding parameterized dataset pickling tests pytorch/text#1732

Merged

facebook-github-bot deleted the gh/VitalyFedyunin/1/head branch May 22, 2022 14:18

ejguan mentioned this pull request May 23, 2022

Issue during import of portalocker on windows #441

Closed

		self.assertEqual(2, len(all_files))
		self.assertEqual("str", result[0][1])

Adding lock mechanism to prevent on_disk_cache downloading twice #409

Adding lock mechanism to prevent on_disk_cache downloading twice #409

Uh oh!

Conversation

VitalyFedyunin commented May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VitalyFedyunin commented May 16, 2022

Uh oh!

NivekT left a comment

Choose a reason for hiding this comment

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin commented May 16, 2022

Uh oh!

VitalyFedyunin commented May 16, 2022

Uh oh!

VitalyFedyunin commented May 17, 2022

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

ejguan May 18, 2022

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin May 18, 2022

Choose a reason for hiding this comment

Uh oh!

ejguan May 18, 2022

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin May 18, 2022

Choose a reason for hiding this comment

Uh oh!

NivekT May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejguan May 18, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NivekT May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

ejguan May 18, 2022

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin commented May 18, 2022

Uh oh!

VitalyFedyunin commented May 18, 2022

Uh oh!

VitalyFedyunin commented May 18, 2022

Uh oh!

Uh oh!

VitalyFedyunin commented May 16, 2022 •

edited

Loading

NivekT May 18, 2022 •

edited

Loading

NivekT May 18, 2022 •

edited

Loading