Skip to content

Conversation

steveloughran
Copy link
Contributor

This is what I'd done up to the point I stopped looking at it; copies over the FileSystem.globber and then adds an initial scale test.

  1. I think we could do a great scale test against the landsat repo
  2. I chose not to try and tweak the FileSystem.globber class for subclassing, so we can do much more in here, and to set things up for having a globStatus call which would return a remote iterator rather than an array

@kazuyukitanimura
Copy link

Hi @steveloughran
Thank you for sharing this S3A globber. I started reading the code, but at high level what are things already done, and needs to be done?

@steveloughran
Copy link
Contributor Author

steveloughran commented Mar 27, 2017

What an s3a globber can do is skip the recursive treewalk of pseudo-directories as it does the regexp. Instead it could ask for all children of a path, before doing the filtering. Look at HADOOP-13208 for the details.

This patch is the preamble: a copy of the existing globber, and the initial performance test which would be the baseline for measuring speedups. Look at some of the other dir operations to see how we measure that: it's not elapsed time, but in number of HTTP requests made.

I should warn that this is not trivial to do well. I put the work aside once I came up in my head of some example directory layouts which would overload a a naive "ask for all then filter" scheme; any deep& wide tree where the wildcard was near the top of the tree and very selective would end up asking for way too much data, and discarding it all. Listing is complex

for now we're doing s3guard, which switches to dynamo DB for consistency as well as performance. If you do want to improve client performance, this is somewhere where we would all benefit from you getting involved. I don't see any of us going near other bits of speedup until after s3guard is in, because everything will be stamping on the same lines of code and making merging a pain. And s3guard is probably going to obsolete a lot of the speedup proposed here on any bucket which has the DDB backing tables.

We might also find that on a s3guard-enabled system, the glob algorithm changes, so again' that'd be something we'd want to include in the work.

@steveloughran
Copy link
Contributor Author

wontfix. S3guard is needed for consistency, and as it delivers the speedup we need at the same time, making traumatic changes to the core code is hard to justify right now

shanthoosh pushed a commit to shanthoosh/hadoop that referenced this pull request Oct 15, 2019
Author: Jacob Maes <[email protected]>

Reviewers: Yi Pan (Data Infrastructure) <[email protected]>, Boris Shkolnik <[email protected]>

Closes apache#204 from jmakes/samza-1236-tutorial-1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants