Skip to content

An S3 backend. #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
Closed

An S3 backend. #23

wants to merge 10 commits into from

Conversation

csm
Copy link

@csm csm commented Sep 2, 2016

This is all pretty experimental, and I'm not sure I understand everything a backend needs to do. It's totally naive, but this might be the start of something useful. Kind of cool that it only took ~100 lines of code to do 8-)

(defrecord S3Backend [#_service bucket]
core/IBackend
(new-session [_]
(atom {:writes 0 :deletes 0}))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the s3 API tell you about how much money you're spending on I/O? This could be really interesting if we could track the # of bytes uploaded, for example :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API doesn't track that, it's something you need to track manually.

However, all data transfer in to S3 is free, and there's a $0.005 fee per 1000 PUTs. It's also neat if you're fetching data from an EC2 instance in the same region, since the data transfer is also free, you just pay for requests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting--does Amazonica batch writes? If not, the tracing GC algo would allow you to avoid needing any auxiliary objects, to reduce the number of puts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't batch writes -- it's a thin shim on the Java SDK, and that doesn't do any batching. And yeah, doing something more sensible for GC would eliminate additional requests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think confetti does some diffing for you though, code here: https://github.com/confetti-clj/s3-deploy/blob/master/src/confetti/s3_deploy.clj. This is a great idea, keep it up!

@dgrnbrg
Copy link
Collaborator

dgrnbrg commented Sep 2, 2016

This is awesome, thank you!

I see that you implemented bidirectional reference tracking--I am wondering whether a tracing GC might be more appropriate. To do this, we'd only need to store the unidirectional reference from index to child, as well as storing the set of anchored roots. Then, to do a GC, we'd start from each root, recursively storing all the IDs of nodes that are reachable from those roots. Finally, we would stream the list of all keys in the bucket, and delete anything we didn't store in the set we built up before.

This could be really interesting to look at with B=1000+ and buffer-size=5000+--multi MB nodes could actually make this an interesting way to store indexed data w/o a huge overhead :)

@csm
Copy link
Author

csm commented Sep 2, 2016

I was also considering doing some of the bookkeeping in (e.g.) DynamoDB instead, and not having empty marker objects, but just wanted to get something that sort-of works first. I'll think some more on a better approach.

I have a use case where I'm going to be storing potentially lots of data, which might not ever be queried, and queries wouldn't necessarily need to worry about inconsistent latencies. But, we'd want storage to be as cheap as we can get away with, which is why I'm considering S3. So yeah, multi-MB nodes would fit this fairly well :)

@dgrnbrg
Copy link
Collaborator

dgrnbrg commented Sep 2, 2016

I just threw up a draft of a tracing GC at #24--I haven't tested it hardly at all, but at least it compiles. I think that a bit of polish, that could handle space reclamation without needing to bring in dynamo or use extra API calls.

The test builds two random b-trees, and then uses one of
them as a gc root, and including the other in the lazy
sequence of "all keys" to be dealt with by the collector.
After running a gc, we assert that the `deleted-fn` has
been invoked against all the "dead" nodes.
@dgrnbrg
Copy link
Collaborator

dgrnbrg commented Sep 6, 2016

Thanks to @cddr, #24 now has been tested and had some bugs ironed out. (plus #25 massively reduced the I/O cost of tracing GC)

Would you like to take a stab at integrating that into the S3 backend? It should allow the S3 backend to avoid any explicit bookkeeping, and do so without too much IO overhead. To do so, you should be able to merge or rebase #24 into this branch, and then just call the tracing GC.

@dgrnbrg
Copy link
Collaborator

dgrnbrg commented Sep 19, 2016

It looks like the test depends on a program called fakes3? You can probably add that to the .travis.yml file to get the CI build to pass.

@csm
Copy link
Author

csm commented Sep 19, 2016

I did some digging into the Travis docs, and I'm not sure if fakes3 is available as an install (it's usually installed as a gem).

I can go ahead and give it a try, though :/

@dgrnbrg
Copy link
Collaborator

dgrnbrg commented Sep 19, 2016

I think a gem install in the before_install of the .travis file should work
(sorry about formatting, I'm on mobile)
On Mon, Sep 19, 2016 at 4:02 PM Casey Marshall [email protected]
wrote:

I did some digging into the Travis docs, and I'm not sure if fakes3 is
available as an install (it's usually installed as a gem).

I can go ahead and give it a try, though :/


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#23 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAlwC04Y0KJfWu2t--dpiy79UzrMiVC1ks5qrwZZgaJpZM4Jz3xc
.

@csm
Copy link
Author

csm commented Nov 10, 2019

@csm csm closed this Nov 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants