Skip to content

Assess integrating ownership #77

Open
@frankmcsherry

Description

@frankmcsherry

Timely dataflow exclusively moves owned data around. Although operators are defined on &Stream references, the closures and such they take as arguments can rely on getting buffers of owned data. The map operator will, no matter what, be handed owned data, and if this means that the contents of a stream need to be cloned, timely will do that.

This seems a bit aggressive in cases where you want to determine the length of some strings, or where you have a stream of large structs (e.g. 75 field DB records) and various consumers want to pick out a few fields that interest them.

It seems possible (though not obviously a good idea) that this ownership could be promoted so that timely dataflow programmers can control cloning and such by having

  1. Owned streams, Stream<Data> with methods that consume the stream and provide owned elements of type Data.

  2. Reference streams, &Stream<Data> whose methods provide only access to &Data elements, which the observer could then immediately clone if they want to reconstruct the old behavior. Maybe a .cloned() method analogous to Iterators .cloned() method?

I think this sounds good in principle, but it involves a lot of careful detail in the innards of timely. Disruption is fine if it fixes or improves things, but there are several consequences of this sort of change. For example,

  1. Right now Exchange pacts require shuffling the data, and this essentially requires owned data, when we want to shuffle to other threads in process. It does not require it for single thread execution (because the exchange is a no-op) nor for inter-process exchange (because we serialize the data, which we can do from a reference).

  2. Similarly, exchanged data emerges as "unowned" references when it comes out of Abomonation. Most operators will still hit this with a clone, though for large value types the compiler can in principle optimize this down. To determine the length of a String, the code would almost certainly still clone the string to check its length (it does this already, so this isn't a defect so much as a missable opportunity).

  3. Anything we do with references needs to happen as the data are produced at their source, or involve Rc<_> types wrapping buffers and preventing the transfer of owned data. This is because once we enqueue buffers for other operators, we shift the flow of control upward and lose an understanding of whether the references remain valid. This could be fine, because we can register actions the way we currently register "listeners" with the Tee pusher, but would need some thought.

  4. Operators like concat currently just forward batches of owned data. If instead they wanted to move references around, it seems like they would need to act as a proxy for any interested listeners, forwarding their interests upstream to each of their sources (requiring whatever the closure used to handle data be cloneable or something, if we want this behavior).

  5. Such a "register listeners" approach could work very well for operators like map, filter, flat_map and maybe others, where their behavior would be to wrap a supplied closure with another closure, so that a sequence of such operators turn in to a (complicated) closure that the compiler could at least work to optimize down.

Mostly, shifting around where computation happens, as well as whether parts of the timely dataflow graph are just fake elements that are actually a stack of closures, is a bit of a disruption for timely dataflow, but it could be that doing it sanely ends up with a system where we do less copying, more per-batch work with data in the cache, eager filtering and projection, and lots of other good stuff.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions