HADOOP-18177. Document prefetching architecture. #4205

ahmarsuhail · 2022-04-20T11:10:01Z

Description of PR

Documents usage and architecture of the prefetching input stream.

hadoop-yetus · 2022-04-20T14:25:20Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 49s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
			_ feature-HADOOP-18028-s3a-prefetch Compile Tests _
+1 💚	mvninstall	42m 8s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	mvnsite	0m 55s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	shadedclient	66m 19s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 38s		the patch passed
-1 ❌	blanks	0m 0s	/blanks-tabs.txt	The patch 1 line(s) with tabs.
+1 💚	mvnsite	0m 40s		the patch passed
+1 💚	shadedclient	23m 36s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	0m 43s		The patch does not generate ASF License warnings.
		93m 33s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/2/artifact/out/Dockerfile
GITHUB PR	#4205
Optional Tests	dupname asflicense mvnsite codespell markdownlint
uname	Linux dfc2242ac9ef 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	feature-HADOOP-18028-s3a-prefetch / `e5e9ea3`
Max. process+thread count	522 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/2/console
versions	git=2.25.1 maven=3.6.3
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

dannycjones

Looks good, I like the examples explaining the behaviours.

I've added a few comments, I think a bit of info was lost when taking it from a doc writing app to markdown.

I'm not intimately familiar with the prefetching implementation details, so another reviewer may be able to dig deeper there if needed.

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

dannycjones · 2022-04-21T08:55:34Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

+
+
+
+[Image: image.png]


What image?

let's add alt text when embedding the image.

also, not sure what the best practice here for adding images to the docs. should they be stored in repo?

they go into hadoop-tools/hadoop-aws/src/site/resources/images ; other modules have examples of this

sorry, I didn't mean to commit this. I was initially using the image in the blogpost: https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0, but that diagram is very high level so I'm not sure if it adds much here. Do we think having that will help? Or do we think we need a lower level architecture diagram?

dannycjones · 2022-04-21T09:00:39Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

+
+This input stream implements prefetching and caching to improve read performance of the input stream. A high level overview of this feature can also be found on [this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0) blogpost.
+
+With prefetching, we divide the file into blocks of a fixed size (default is 8MB), associate buffers to these blocks, and then read data into these buffers asynchronously. We also potentially cache these blocks.


I might drop the number of references to the default in case we want to change it in the future.

good point, have removed

steveloughran

this is a great start.

can you split lines up so that there's no more than 100 chars per line.
this makes it easier to comment on (you can comment on the line with the specific text)
and will help with diffs in future.

one thing the doc made clear to me is that now there is also local files for cached blocks, it gets hard to distinguish file-on-s3 from file-on-local-disk-containing-part of that s3 file.

maybe we need to distinguish them more clearly. i did comment that maybe the s3 object should be called an object, but that gets confused with java objects, and doesn't apply to other stores

i'm going to propose that in the doc and the source we

use "remote file" to refer to the file in the remote object/file store.
use block file to refer to a local file containing a block of the remote file.

this is a bit like how we use S3ADataBlocks to refer to the data we are buffering before uploading

if we go with this, we will have to rename SingleFilePerBlockCache something else, like BlockFileCache.

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

steveloughran · 2022-04-22T16:16:05Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

+
+|Property    |Meaning    |Default    |
+|---	|---	|---	|
+|fs.s3a.prefetch.enabled    |Enable the prefetch input stream    |TRUE |


use backticks around the configuration names and values; all values must be valid if passed in as the config strings. they will be.

use true for true, rather than the capitalised value

steveloughran · 2022-04-22T16:17:31Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

+|Property    |Meaning    |Default    |
+|---	|---	|---	|
+|fs.s3a.prefetch.enabled    |Enable the prefetch input stream    |TRUE |
+|fs.s3a.prefetch.block.size    |Size of a block    |8MB    |


8MB isn't gong to be a valid value. 8M should be

steveloughran · 2022-04-22T16:46:02Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

+
+
+
+[Image: image.png]


they go into hadoop-tools/hadoop-aws/src/site/resources/images ; other modules have examples of this

steveloughran · 2022-04-22T16:46:37Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

+in.read(buffer, 0, 8MB)
+```
+
+For the first read call, there is no valid buffer yet. `ensureCurrentBuffer()` is called, and for the first read(), prefetch count is set as 1.


add backticks around 'read()'

steveloughran · 2022-04-22T16:47:46Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

+
+### Basic Concepts
+
+* **File** : A binary blob of data stored on some storage device.


should the doc use Object to refer to something in s3, to distinguish it from the local FS files?

hadoop-yetus · 2022-04-25T18:23:51Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	18m 10s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
			_ feature-HADOOP-18028-s3a-prefetch Compile Tests _
+1 💚	mvninstall	41m 49s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	mvnsite	0m 55s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	shadedclient	65m 51s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 39s		the patch passed
-1 ❌	blanks	0m 0s	/blanks-tabs.txt	The patch 1 line(s) with tabs.
+1 💚	mvnsite	0m 39s		the patch passed
+1 💚	shadedclient	23m 6s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	0m 42s		The patch does not generate ASF License warnings.
		109m 59s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/3/artifact/out/Dockerfile
GITHUB PR	#4205
Optional Tests	dupname asflicense mvnsite codespell markdownlint
uname	Linux f50a219ad738 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	feature-HADOOP-18028-s3a-prefetch / `24380d9`
Max. process+thread count	520 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/3/console
versions	git=2.25.1 maven=3.6.3
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2022-04-25T18:37:54Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	18m 3s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
			_ feature-HADOOP-18028-s3a-prefetch Compile Tests _
+1 💚	mvninstall	41m 46s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	mvnsite	0m 56s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	shadedclient	66m 3s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 38s		the patch passed
-1 ❌	blanks	0m 0s	/blanks-tabs.txt	The patch 1 line(s) with tabs.
+1 💚	mvnsite	0m 41s		the patch passed
+1 💚	shadedclient	23m 22s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	0m 42s		The patch does not generate ASF License warnings.
		110m 18s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/4/artifact/out/Dockerfile
GITHUB PR	#4205
Optional Tests	dupname asflicense mvnsite codespell markdownlint
uname	Linux ba91b572c6c8 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	feature-HADOOP-18028-s3a-prefetch / `24380d9`
Max. process+thread count	521 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/4/console
versions	git=2.25.1 maven=3.6.3
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

dannycjones

Looking good, mainly just looking at the markdown line breaks for future maintainability

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

hadoop-yetus · 2022-04-26T10:37:59Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 52s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
			_ feature-HADOOP-18028-s3a-prefetch Compile Tests _
+1 💚	mvninstall	41m 38s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	mvnsite	0m 56s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	shadedclient	66m 33s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 41s		the patch passed
-1 ❌	blanks	0m 0s	/blanks-tabs.txt	The patch 1 line(s) with tabs.
+1 💚	mvnsite	0m 45s		the patch passed
+1 💚	shadedclient	24m 10s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	0m 41s		The patch does not generate ASF License warnings.
		94m 31s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/5/artifact/out/Dockerfile
GITHUB PR	#4205
Optional Tests	dupname asflicense mvnsite codespell markdownlint
uname	Linux 9472cd5389b4 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	feature-HADOOP-18028-s3a-prefetch / `ba1d26a`
Max. process+thread count	518 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/5/console
versions	git=2.25.1 maven=3.6.3
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2022-04-26T11:17:59Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 54s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 1s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
			_ feature-HADOOP-18028-s3a-prefetch Compile Tests _
+1 💚	mvninstall	42m 23s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	mvnsite	0m 53s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	shadedclient	66m 45s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 39s		the patch passed
-1 ❌	blanks	0m 0s	/blanks-eol.txt	The patch has 30 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
-1 ❌	blanks	0m 0s	/blanks-tabs.txt	The patch 1 line(s) with tabs.
+1 💚	mvnsite	0m 40s		the patch passed
+1 💚	shadedclient	23m 36s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	0m 43s		The patch does not generate ASF License warnings.
		94m 11s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/6/artifact/out/Dockerfile
GITHUB PR	#4205
Optional Tests	dupname asflicense mvnsite codespell markdownlint
uname	Linux afff6318d18d 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	feature-HADOOP-18028-s3a-prefetch / `125152e`
Max. process+thread count	589 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/6/console
versions	git=2.25.1 maven=3.6.3
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

dannycjones

Last things to address are the Yetus failures due to trailing spaces and use of tabs. With that, +1 from me.

dannycjones · 2022-04-26T11:23:43Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

+Once a buffer has been acquired by `CachingBlockManager`, if the buffer is in a *READY* state, it is
+returned. 
+This means that data was already read into this buffer asynchronously by a prefetch. 
+If it’s state is *BLANK,* then data is read into it using 


we can move out the comma or drop it here

Suggested change

If it’s state is *BLANK,* then data is read into it using

If it’s state is *BLANK* then data is read into it using

dannycjones · 2022-04-26T11:25:58Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

+The size of this pool is = prefetch block count + 1. 
+If the prefetch block count is 8, the buffer pool has a size of 9.
+* If the pool is not yet at capacity, create a new buffer and add it to the pool.
+* If it’s at capacity, check if any buffers with state = done can be released. 


I noticed Yetus formats these ’ weirdly. Can we switch to standard single quote where-ever ’ occurs?

hadoop-yetus · 2022-04-26T13:49:00Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 50s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
			_ feature-HADOOP-18028-s3a-prefetch Compile Tests _
+1 💚	mvninstall	42m 6s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	mvnsite	0m 55s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	shadedclient	66m 12s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 38s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	mvnsite	0m 40s		the patch passed
+1 💚	shadedclient	23m 34s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	0m 43s		The patch does not generate ASF License warnings.
		93m 28s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/7/artifact/out/Dockerfile
GITHUB PR	#4205
Optional Tests	dupname asflicense mvnsite codespell markdownlint
uname	Linux 918724e3c41c 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	feature-HADOOP-18028-s3a-prefetch / `658396a`
Max. process+thread count	601 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/7/console
versions	git=2.25.1 maven=3.6.3
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2022-04-26T14:22:19Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 54s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
			_ feature-HADOOP-18028-s3a-prefetch Compile Tests _
+1 💚	mvninstall	41m 59s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	mvnsite	0m 56s		feature-HADOOP-18028-s3a-prefetch passed
+1 💚	shadedclient	66m 18s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 38s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	mvnsite	0m 41s		the patch passed
+1 💚	shadedclient	23m 32s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	0m 43s		The patch does not generate ASF License warnings.
		93m 36s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/8/artifact/out/Dockerfile
GITHUB PR	#4205
Optional Tests	dupname asflicense mvnsite codespell markdownlint
uname	Linux 416a76a29726 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	feature-HADOOP-18028-s3a-prefetch / `9558361`
Max. process+thread count	525 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4205/8/console
versions	git=2.25.1 maven=3.6.3
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

dannycjones

lgtm!

steveloughran

+1, thanks. merging

Contributed by Ahmar Suhail

This is the the a rollup patch of the HADOOP-18028 S3A performance input stream feature branch. Contains HADOOP-18028. High performance S3A input stream (apache#4109) This is the the merge of the HADOOP-18028 S3A performance input stream. This patch on its own is incomplete and must be accompanied by all other commits with HADOOP-18028 in their git commit message. Consult the JIRA for that list Contributed by Bhalchandra Pandit. HADOOP-18180. Replace use of twitter util-core with java futures in S3A prefetching stream (apache#4115) Contributed by PJ Fanning. HADOOP-18177. Document prefetching architecture. (apache#4205) Contributed by Ahmar Suhail HADOOP-18175. fix test failures with prefetching s3a input stream (apache#4212) Contributed by Monthon Klongklaew HADOOP-18231. S3A prefetching: fix failing tests & drain stream async. (apache#4386) * adds in new test for prefetching input stream * creates streamStats before opening stream * updates numBlocks calculation method * fixes ITestS3AOpenCost.testOpenFileLongerLength * drains stream async * fixes failing unit test Contributed by Ahmar Suhail HADOOP-18254. Disable S3A prefetching by default. (apache#4469) Contributed by Ahmar Suhail HADOOP-18190. Collect IOStatistics during S3A prefetching (apache#4458) This adds iOStatisticsConnection to the S3PrefetchingInputStream class, with new statistic names in StreamStatistics. This stream is not (yet) IOStatisticsContext aware. Contributed by Ahmar Suhail. Change-Id: I48f217086531c12d6e2f0f91e39f17054a74d20f

This is the the preview release of the HADOOP-18028 S3A performance input stream. It is still stabilizing, but ready to test. Contains HADOOP-18028. High performance S3A input stream (apache#4109) Contributed by Bhalchandra Pandit. HADOOP-18180. Replace use of twitter util-core with java futures (apache#4115) Contributed by PJ Fanning. HADOOP-18177. Document prefetching architecture. (apache#4205) Contributed by Ahmar Suhail HADOOP-18175. fix test failures with prefetching s3a input stream (apache#4212) Contributed by Monthon Klongklaew HADOOP-18231. S3A prefetching: fix failing tests & drain stream async. (apache#4386) * adds in new test for prefetching input stream * creates streamStats before opening stream * updates numBlocks calculation method * fixes ITestS3AOpenCost.testOpenFileLongerLength * drains stream async * fixes failing unit test Contributed by Ahmar Suhail HADOOP-18254. Disable S3A prefetching by default. (apache#4469) Contributed by Ahmar Suhail HADOOP-18190. Collect IOStatistics during S3A prefetching (apache#4458) This adds iOStatisticsConnection to the S3PrefetchingInputStream class, with new statistic names in StreamStatistics. This stream is not (yet) IOStatisticsContext aware. Contributed by Ahmar Suhail HADOOP-18379 rebase feature/HADOOP-18028-s3a-prefetch to trunk HADOOP-18187. Convert s3a prefetching to use JavaDoc for fields and enums. HADOOP-18318. Update class names to be clear they belong to S3A prefetching Contributed by Steve Loughran Change-Id: I6511c51c3580c57eb72e8ea686c88e3917d12a06

This is the the preview release of the HADOOP-18028 S3A performance input stream. It is still stabilizing, but ready to test. Contains HADOOP-18028. High performance S3A input stream (#4109) Contributed by Bhalchandra Pandit. HADOOP-18180. Replace use of twitter util-core with java futures (#4115) Contributed by PJ Fanning. HADOOP-18177. Document prefetching architecture. (#4205) Contributed by Ahmar Suhail HADOOP-18175. fix test failures with prefetching s3a input stream (#4212) Contributed by Monthon Klongklaew HADOOP-18231. S3A prefetching: fix failing tests & drain stream async. (#4386) * adds in new test for prefetching input stream * creates streamStats before opening stream * updates numBlocks calculation method * fixes ITestS3AOpenCost.testOpenFileLongerLength * drains stream async * fixes failing unit test Contributed by Ahmar Suhail HADOOP-18254. Disable S3A prefetching by default. (#4469) Contributed by Ahmar Suhail HADOOP-18190. Collect IOStatistics during S3A prefetching (#4458) This adds iOStatisticsConnection to the S3PrefetchingInputStream class, with new statistic names in StreamStatistics. This stream is not (yet) IOStatisticsContext aware. Contributed by Ahmar Suhail HADOOP-18379 rebase feature/HADOOP-18028-s3a-prefetch to trunk HADOOP-18187. Convert s3a prefetching to use JavaDoc for fields and enums. HADOOP-18318. Update class names to be clear they belong to S3A prefetching Contributed by Steve Loughran

This is the the preview release of the HADOOP-18028 S3A performance input stream. It is still stabilizing, but ready to test. Contains HADOOP-18028. High performance S3A input stream (apache#4109) Contributed by Bhalchandra Pandit. HADOOP-18180. Replace use of twitter util-core with java futures (apache#4115) Contributed by PJ Fanning. HADOOP-18177. Document prefetching architecture. (apache#4205) Contributed by Ahmar Suhail HADOOP-18175. fix test failures with prefetching s3a input stream (apache#4212) Contributed by Monthon Klongklaew HADOOP-18231. S3A prefetching: fix failing tests & drain stream async. (apache#4386) * adds in new test for prefetching input stream * creates streamStats before opening stream * updates numBlocks calculation method * fixes ITestS3AOpenCost.testOpenFileLongerLength * drains stream async * fixes failing unit test Contributed by Ahmar Suhail HADOOP-18254. Disable S3A prefetching by default. (apache#4469) Contributed by Ahmar Suhail HADOOP-18190. Collect IOStatistics during S3A prefetching (apache#4458) This adds iOStatisticsConnection to the S3PrefetchingInputStream class, with new statistic names in StreamStatistics. This stream is not (yet) IOStatisticsContext aware. Contributed by Ahmar Suhail HADOOP-18379 rebase feature/HADOOP-18028-s3a-prefetch to trunk HADOOP-18187. Convert s3a prefetching to use JavaDoc for fields and enums. HADOOP-18318. Update class names to be clear they belong to S3A prefetching Contributed by Steve Loughran

This is the the preview release of the HADOOP-18028 S3A performance input stream. It is still stabilizing, but ready to test. Contains HADOOP-18028. High performance S3A input stream (apache#4109) Contributed by Bhalchandra Pandit. HADOOP-18180. Replace use of twitter util-core with java futures (apache#4115) Contributed by PJ Fanning. HADOOP-18177. Document prefetching architecture. (apache#4205) Contributed by Ahmar Suhail HADOOP-18175. fix test failures with prefetching s3a input stream (apache#4212) Contributed by Monthon Klongklaew HADOOP-18231. S3A prefetching: fix failing tests & drain stream async. (apache#4386) * adds in new test for prefetching input stream * creates streamStats before opening stream * updates numBlocks calculation method * fixes ITestS3AOpenCost.testOpenFileLongerLength * drains stream async * fixes failing unit test Contributed by Ahmar Suhail HADOOP-18254. Disable S3A prefetching by default. (apache#4469) Contributed by Ahmar Suhail HADOOP-18190. Collect IOStatistics during S3A prefetching (apache#4458) This adds iOStatisticsConnection to the S3PrefetchingInputStream class, with new statistic names in StreamStatistics. This stream is not (yet) IOStatisticsContext aware. Contributed by Ahmar Suhail HADOOP-18379 rebase feature/HADOOP-18028-s3a-prefetch to trunk HADOOP-18187. Convert s3a prefetching to use JavaDoc for fields and enums. HADOOP-18318. Update class names to be clear they belong to S3A prefetching Contributed by Steve Loughran Change-Id: I3eca19564dc0c0cb83184f4a42605dbafd908937

This is the the preview release of the HADOOP-18028 S3A performance input stream. It is still stabilizing, but ready to test. Contains HADOOP-18028. High performance S3A input stream (#4109) Contributed by Bhalchandra Pandit. HADOOP-18180. Replace use of twitter util-core with java futures (#4115) Contributed by PJ Fanning. HADOOP-18177. Document prefetching architecture. (#4205) Contributed by Ahmar Suhail HADOOP-18175. fix test failures with prefetching s3a input stream (#4212) Contributed by Monthon Klongklaew HADOOP-18231. S3A prefetching: fix failing tests & drain stream async. (#4386) * adds in new test for prefetching input stream * creates streamStats before opening stream * updates numBlocks calculation method * fixes ITestS3AOpenCost.testOpenFileLongerLength * drains stream async * fixes failing unit test Contributed by Ahmar Suhail HADOOP-18254. Disable S3A prefetching by default. (#4469) Contributed by Ahmar Suhail HADOOP-18190. Collect IOStatistics during S3A prefetching (#4458) This adds iOStatisticsConnection to the S3PrefetchingInputStream class, with new statistic names in StreamStatistics. This stream is not (yet) IOStatisticsContext aware. Contributed by Ahmar Suhail HADOOP-18379 rebase feature/HADOOP-18028-s3a-prefetch to trunk HADOOP-18187. Convert s3a prefetching to use JavaDoc for fields and enums. HADOOP-18318. Update class names to be clear they belong to S3A prefetching Contributed by Steve Loughran

This is the the preview release of the HADOOP-18028 S3A performance input stream. It is still stabilizing, but ready to test. Contains HADOOP-18028. High performance S3A input stream (apache#4109) Contributed by Bhalchandra Pandit. HADOOP-18180. Replace use of twitter util-core with java futures (apache#4115) Contributed by PJ Fanning. HADOOP-18177. Document prefetching architecture. (apache#4205) Contributed by Ahmar Suhail HADOOP-18175. fix test failures with prefetching s3a input stream (apache#4212) Contributed by Monthon Klongklaew HADOOP-18231. S3A prefetching: fix failing tests & drain stream async. (apache#4386) * adds in new test for prefetching input stream * creates streamStats before opening stream * updates numBlocks calculation method * fixes ITestS3AOpenCost.testOpenFileLongerLength * drains stream async * fixes failing unit test Contributed by Ahmar Suhail HADOOP-18254. Disable S3A prefetching by default. (apache#4469) Contributed by Ahmar Suhail HADOOP-18190. Collect IOStatistics during S3A prefetching (apache#4458) This adds iOStatisticsConnection to the S3PrefetchingInputStream class, with new statistic names in StreamStatistics. This stream is not (yet) IOStatisticsContext aware. Contributed by Ahmar Suhail HADOOP-18379 rebase feature/HADOOP-18028-s3a-prefetch to trunk HADOOP-18187. Convert s3a prefetching to use JavaDoc for fields and enums. HADOOP-18318. Update class names to be clear they belong to S3A prefetching Contributed by Steve Loughran

ahmarsuhail added 2 commits April 20, 2022 12:05

adds in prefetching architecture

b089cf6

fixes formatting errors

e5e9ea3

dannycjones suggested changes Apr 21, 2022

View reviewed changes

steveloughran requested changes Apr 22, 2022

View reviewed changes

ahmarsuhail added 2 commits April 25, 2022 17:31

updates doc as per review comments

11d33c3

fixes typo

24380d9

adds in block file description

ba1d26a

dannycjones suggested changes Apr 26, 2022

View reviewed changes

updates formatting

125152e

dannycjones suggested changes Apr 26, 2022

View reviewed changes

ahmarsuhail added 2 commits April 26, 2022 13:13

fixes yetus errors

658396a

update quotation marks

9558361

dannycjones approved these changes Apr 26, 2022

View reviewed changes

apache deleted a comment from hadoop-yetus Apr 26, 2022

steveloughran approved these changes Apr 26, 2022

View reviewed changes

steveloughran merged commit f4d016f into apache:feature-HADOOP-18028-s3a-prefetch Apr 26, 2022

ahmarsuhail added a commit to ahmarsuhail/hadoop that referenced this pull request May 19, 2022

HADOOP-18177. Document prefetching architecture. (apache#4205)

b653cfa

Contributed by Ahmar Suhail

steveloughran pushed a commit to steveloughran/hadoop that referenced this pull request Jul 28, 2022

HADOOP-18177. Document prefetching architecture. (apache#4205)

538ddf8

Contributed by Ahmar Suhail

steveloughran mentioned this pull request Aug 17, 2022

HADOOP-18028. High performance S3A input stream #4752

Merged

4 tasks

ahmarsuhail deleted the HADOOP-18177-document-prefetching branch October 7, 2022 08:35

steveloughran mentioned this pull request Apr 14, 2023

HADOOP-18028. High performance S3A input stream (#4752) #5559

Closed

4 tasks


		This input stream implements prefetching and caching to improve read performance of the input stream. A high level overview of this feature can also be found on [this](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0) blogpost.

		With prefetching, we divide the file into blocks of a fixed size (default is 8MB), associate buffers to these blocks, and then read data into these buffers asynchronously. We also potentially cache these blocks.


		### Basic Concepts

		* File : A binary blob of data stored on some storage device.

	If it’s state is BLANK, then data is read into it using
	If it’s state is BLANK then data is read into it using




		[Image: image.png]




		[Image: image.png]

HADOOP-18177. Document prefetching architecture. #4205

HADOOP-18177. Document prefetching architecture. #4205

Uh oh!

Conversation

ahmarsuhail commented Apr 20, 2022

Description of PR

Uh oh!

hadoop-yetus commented Apr 20, 2022

Uh oh!

dannycjones left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadoop-yetus commented Apr 25, 2022

Uh oh!

hadoop-yetus commented Apr 25, 2022

Uh oh!

dannycjones left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hadoop-yetus commented Apr 26, 2022

Uh oh!

hadoop-yetus commented Apr 26, 2022

Uh oh!

dannycjones left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadoop-yetus commented Apr 26, 2022

Uh oh!

hadoop-yetus commented Apr 26, 2022

Uh oh!

dannycjones left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dannycjones left a comment •

edited

Loading