An S3 backend. #23

csm · 2016-09-02T16:24:29Z

This is all pretty experimental, and I'm not sure I understand everything a backend needs to do. It's totally naive, but this might be the start of something useful. Kind of cool that it only took ~100 lines of code to do 8-)

dgrnbrg · 2016-09-02T17:40:14Z

src/hitchhiker/s3.clj

+(defrecord S3Backend [#_service bucket]
+  core/IBackend
+  (new-session [_]
+    (atom {:writes 0 :deletes 0}))


Does the s3 API tell you about how much money you're spending on I/O? This could be really interesting if we could track the # of bytes uploaded, for example :)

The API doesn't track that, it's something you need to track manually.

However, all data transfer in to S3 is free, and there's a $0.005 fee per 1000 PUTs. It's also neat if you're fetching data from an EC2 instance in the same region, since the data transfer is also free, you just pay for requests.

Interesting--does Amazonica batch writes? If not, the tracing GC algo would allow you to avoid needing any auxiliary objects, to reduce the number of puts.

It shouldn't batch writes -- it's a thin shim on the Java SDK, and that doesn't do any batching. And yeah, doing something more sensible for GC would eliminate additional requests.

I think confetti does some diffing for you though, code here: https://github.com/confetti-clj/s3-deploy/blob/master/src/confetti/s3_deploy.clj. This is a great idea, keep it up!

dgrnbrg · 2016-09-02T17:47:28Z

This is awesome, thank you!

I see that you implemented bidirectional reference tracking--I am wondering whether a tracing GC might be more appropriate. To do this, we'd only need to store the unidirectional reference from index to child, as well as storing the set of anchored roots. Then, to do a GC, we'd start from each root, recursively storing all the IDs of nodes that are reachable from those roots. Finally, we would stream the list of all keys in the bucket, and delete anything we didn't store in the set we built up before.

This could be really interesting to look at with B=1000+ and buffer-size=5000+--multi MB nodes could actually make this an interesting way to store indexed data w/o a huge overhead :)

csm · 2016-09-02T18:23:45Z

I was also considering doing some of the bookkeeping in (e.g.) DynamoDB instead, and not having empty marker objects, but just wanted to get something that sort-of works first. I'll think some more on a better approach.

I have a use case where I'm going to be storing potentially lots of data, which might not ever be queried, and queries wouldn't necessarily need to worry about inconsistent latencies. But, we'd want storage to be as cheap as we can get away with, which is why I'm considering S3. So yeah, multi-MB nodes would fit this fairly well :)

dgrnbrg · 2016-09-02T23:04:28Z

I just threw up a draft of a tracing GC at #24--I haven't tested it hardly at all, but at least it compiles. I think that a bit of polish, that could handle space reclamation without needing to bring in dynamo or use extra API calls.

The test builds two random b-trees, and then uses one of them as a gc root, and including the other in the lazy sequence of "all keys" to be dealt with by the collector. After running a gc, we assert that the `deleted-fn` has been invoked against all the "dead" nodes.

dgrnbrg · 2016-09-06T16:30:29Z

Thanks to @cddr, #24 now has been tested and had some bugs ironed out. (plus #25 massively reduced the I/O cost of tracing GC)

Would you like to take a stab at integrating that into the S3 backend? It should allow the S3 backend to avoid any explicit bookkeeping, and do so without too much IO overhead. To do so, you should be able to merge or rebase #24 into this branch, and then just call the tracing GC.

# Conflicts: # project.clj

This pulls in datacrypt-project#24, since I'd like to test out the GC too.

dgrnbrg · 2016-09-19T21:42:39Z

It looks like the test depends on a program called fakes3? You can probably add that to the .travis.yml file to get the CI build to pass.

csm · 2016-09-19T22:02:00Z

I did some digging into the Travis docs, and I'm not sure if fakes3 is available as an install (it's usually installed as a gem).

I can go ahead and give it a try, though :/

dgrnbrg · 2016-09-19T22:18:30Z

I think a gem install in the before_install of the .travis file should work
(sorry about formatting, I'm on mobile)
On Mon, Sep 19, 2016 at 4:02 PM Casey Marshall [email protected]
wrote:

I did some digging into the Travis docs, and I'm not sure if fakes3 is
available as an install (it's usually installed as a gem).

I can go ahead and give it a try, though :/

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#23 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAlwC04Y0KJfWu2t--dpiy79UzrMiVC1ks5qrwZZgaJpZM4Jz3xc
.

csm · 2019-11-10T08:03:57Z

I came up with a better approach https://github.com/csm/konserve-ddb-s3, which builds off the work of https://github.com/replikativ/konserve, https://github.com/replikativ/hitchhiker-tree, and https://github.com/replikativ/datahike.

Thanks so much for this code!

Add S3 backend.

261e764

dgrnbrg reviewed Sep 2, 2016
View reviewed changes

Initial tracing gc draft without testing

6b3da7a

dgrnbrg and others added 7 commits September 6, 2016 10:31

Merge branch 'master' into tracing-gc

b021062

Reduce IO costs for tracing GC

5824058

Merge remote-tracking branch 'upstream/master' into s3

bee42e4

Merge remote-tracking branch 'upstream/master' into s3

52e80d2

# Conflicts: # project.clj

Merge remote-tracking branch 'origin/pr/24' into s3

1f1c441

Add S3 test, remove S3 object "refs".

f2dcd64

This pulls in datacrypt-project#24, since I'd like to test out the GC too.

Ignore IDEA files.

198c1e3

csm closed this Nov 10, 2019

An S3 backend. #23

An S3 backend. #23

Uh oh!

Conversation

csm commented Sep 2, 2016

Uh oh!

dgrnbrg Sep 2, 2016

Choose a reason for hiding this comment

Uh oh!

csm Sep 2, 2016

Choose a reason for hiding this comment

Uh oh!

dgrnbrg Sep 2, 2016

Choose a reason for hiding this comment

Uh oh!

csm Sep 2, 2016

Choose a reason for hiding this comment

Uh oh!

arichiardi Sep 2, 2016

Choose a reason for hiding this comment

Uh oh!

dgrnbrg commented Sep 2, 2016

Uh oh!

csm commented Sep 2, 2016

Uh oh!

dgrnbrg commented Sep 2, 2016

Uh oh!

dgrnbrg commented Sep 6, 2016

Uh oh!

dgrnbrg commented Sep 19, 2016

Uh oh!

csm commented Sep 19, 2016

Uh oh!

dgrnbrg commented Sep 19, 2016

Uh oh!

csm commented Nov 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants