HADOOP-13371 S3a Globber, WiP #204

steveloughran · 2017-03-18T18:35:34Z

This is what I'd done up to the point I stopped looking at it; copies over the FileSystem.globber and then adds an initial scale test.

I think we could do a great scale test against the landsat repo
I chose not to try and tweak the FileSystem.globber class for subclassing, so we can do much more in here, and to set things up for having a globStatus call which would return a remote iterator rather than an array

kazuyukitanimura · 2017-03-20T17:00:58Z

Hi @steveloughran
Thank you for sharing this S3A globber. I started reading the code, but at high level what are things already done, and needs to be done?

steveloughran · 2017-03-27T11:01:55Z

What an s3a globber can do is skip the recursive treewalk of pseudo-directories as it does the regexp. Instead it could ask for all children of a path, before doing the filtering. Look at HADOOP-13208 for the details.

This patch is the preamble: a copy of the existing globber, and the initial performance test which would be the baseline for measuring speedups. Look at some of the other dir operations to see how we measure that: it's not elapsed time, but in number of HTTP requests made.

I should warn that this is not trivial to do well. I put the work aside once I came up in my head of some example directory layouts which would overload a a naive "ask for all then filter" scheme; any deep& wide tree where the wildcard was near the top of the tree and very selective would end up asking for way too much data, and discarding it all. Listing is complex

for now we're doing s3guard, which switches to dynamo DB for consistency as well as performance. If you do want to improve client performance, this is somewhere where we would all benefit from you getting involved. I don't see any of us going near other bits of speedup until after s3guard is in, because everything will be stamping on the same lines of code and making merging a pain. And s3guard is probably going to obsolete a lot of the speedup proposed here on any bucket which has the DDB backing tables.

We might also find that on a s3guard-enabled system, the glob algorithm changes, so again' that'd be something we'd want to include in the work.

steveloughran · 2018-02-12T13:11:37Z

wontfix. S3guard is needed for consistency, and as it delivers the speedup we need at the same time, making traumatic changes to the core code is hard to justify right now

Author: Jacob Maes <[email protected]> Reviewers: Yi Pan (Data Infrastructure) <[email protected]>, Boris Shkolnik <[email protected]> Closes apache#204 from jmakes/samza-1236-tutorial-1

steveloughran added 2 commits March 18, 2017 17:59

HADOOP-13371 S3A globber

e4c7c50

HADOOP-13371 sync up with latest branch-2

a99ffb8

This was referenced Mar 20, 2017

HADOOP-13371. S3A globber to use bulk listObject call over recursive directory scan #203

Closed

HADOOP-14235. S3A Path does not understand colon (:) when globbing #206

Closed

steveloughran closed this Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HADOOP-13371 S3a Globber, WiP #204

HADOOP-13371 S3a Globber, WiP #204

Uh oh!

steveloughran commented Mar 18, 2017

Uh oh!

kazuyukitanimura commented Mar 20, 2017

Uh oh!

steveloughran commented Mar 27, 2017 •

edited

Loading

Uh oh!

steveloughran commented Feb 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HADOOP-13371 S3a Globber, WiP #204

HADOOP-13371 S3a Globber, WiP #204

Uh oh!

Conversation

steveloughran commented Mar 18, 2017

Uh oh!

kazuyukitanimura commented Mar 20, 2017

Uh oh!

steveloughran commented Mar 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveloughran commented Feb 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steveloughran commented Mar 27, 2017 •

edited

Loading