HADOOP-13278. S3AFileSystem mkdirs does not need to validate parent path components #100

apetresc · 2016-06-15T21:33:20Z

According to S3 semantics, there is no conflict if a bucket contains a key named a/b and also a directory named a/b/c. "Directories" in S3 are, after all, nothing but prefixes.

However, the mkdirs call in S3AFileSystem does go out of its way to traverse every parent path component for the directory it's trying to create, making sure there's no file with that name. This is suboptimal for three main reasons:

Wasted API calls, since the client is getting metadata for each path component
This can cause major problems with buckets whose permissions are being managed by IAM, where access may not be granted to the root bucket, but only to some prefix. When you call mkdirs, even on a prefix that you have access to, the traversal up the path will cause you to eventually hit the root bucket, which will fail with a 403 - even though the directory creation call would have succeeded.
Some people might actually have a file that matches some other file's prefix... I can't see why they would want to do that, but it's not against S3's rules.

I've opened a ticket on the Hadoop JIRA. This pull request is a simple patch that just removes this portion of the check. I have tested it with my team's instance of Spark + Luigi, and can confirm it works, and resolves the aforementioned permissions issue for a bucket on which we only had prefix access.

This is my first ticket/pull request against Hadoop, so let me know if I'm not following some convention properly :)

Given S3 semantics, there's actually no problem with having a/b/c be a prefix even if a/b or a is already a file. So there's no need to check for it - it wastes API calls and can lead to problems with access control if the caller only has permissions starting at some prefix.

In the processor graph, the topological sort missed adding to the visited set during graph traversal. This caused wrong graph being generated for single-node loop. This is fixed in the patch. Also fixed the maxPartition method not handling empty collection correctly. Added a few new unit tests for these. Also adjust the timing of previous async commit unit tests so it can run more reliably. Long term wise we need to fix the timer inside the AsyncRunLoop tests. Author: Xinyu Liu <[email protected]> Reviewers: Jacob Maes <[email protected]> Closes apache#100 from xinyuiscool/SAMZA-1172

Also, cleaning up PRs: closes apache#206, closes apache#47, closes apache#72, closes apache#87, closes apache#96, closes apache#100, closes apache#107, and closes apache#112.

asfgit force-pushed the trunk branch from 25ac447 to 4d1fac5 Compare April 15, 2017 19:10

apetresc closed this Oct 10, 2019

apetresc deleted the s3a-root-path-components branch October 10, 2019 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HADOOP-13278. S3AFileSystem mkdirs does not need to validate parent path components #100

HADOOP-13278. S3AFileSystem mkdirs does not need to validate parent path components #100

Uh oh!

apetresc commented Jun 15, 2016

Uh oh!

Uh oh!

HADOOP-13278. S3AFileSystem mkdirs does not need to validate parent path components #100

HADOOP-13278. S3AFileSystem mkdirs does not need to validate parent path components #100

Uh oh!

Conversation

apetresc commented Jun 15, 2016

Uh oh!

Uh oh!