Description
Timely dataflow exclusively moves owned data around. Although operators are defined on &Stream
references, the closures and such they take as arguments can rely on getting buffers of owned data. The map
operator will, no matter what, be handed owned data, and if this means that the contents of a stream need to be cloned, timely will do that.
This seems a bit aggressive in cases where you want to determine the length of some strings, or where you have a stream of large structs (e.g. 75 field DB records) and various consumers want to pick out a few fields that interest them.
It seems possible (though not obviously a good idea) that this ownership could be promoted so that timely dataflow programmers can control cloning and such by having
-
Owned streams,
Stream<Data>
with methods that consume the stream and provide owned elements of typeData
. -
Reference streams,
&Stream<Data>
whose methods provide only access to&Data
elements, which the observer could then immediately clone if they want to reconstruct the old behavior. Maybe a.cloned()
method analogous toIterator
s.cloned()
method?
I think this sounds good in principle, but it involves a lot of careful detail in the innards of timely. Disruption is fine if it fixes or improves things, but there are several consequences of this sort of change. For example,
-
Right now
Exchange
pacts require shuffling the data, and this essentially requires owned data, when we want to shuffle to other threads in process. It does not require it for single thread execution (because the exchange is a no-op) nor for inter-process exchange (because we serialize the data, which we can do from a reference). -
Similarly,
exchanged
data emerges as "unowned" references when it comes out of Abomonation. Most operators will still hit this with aclone
, though for large value types the compiler can in principle optimize this down. To determine the length of aString
, the code would almost certainly still clone the string to check its length (it does this already, so this isn't a defect so much as a missable opportunity). -
Anything we do with references needs to happen as the data are produced at their source, or involve
Rc<_>
types wrapping buffers and preventing the transfer of owned data. This is because once we enqueue buffers for other operators, we shift the flow of control upward and lose an understanding of whether the references remain valid. This could be fine, because we can register actions the way we currently register "listeners" with theTee
pusher, but would need some thought. -
Operators like
concat
currently just forward batches of owned data. If instead they wanted to move references around, it seems like they would need to act as a proxy for any interested listeners, forwarding their interests upstream to each of their sources (requiring whatever the closure used to handle data be cloneable or something, if we want this behavior). -
Such a "register listeners" approach could work very well for operators like
map
,filter
,flat_map
and maybe others, where their behavior would be to wrap a supplied closure with another closure, so that a sequence of such operators turn in to a (complicated) closure that the compiler could at least work to optimize down.
Mostly, shifting around where computation happens, as well as whether parts of the timely dataflow graph are just fake elements that are actually a stack of closures, is a bit of a disruption for timely dataflow, but it could be that doing it sanely ends up with a system where we do less copying, more per-batch work with data in the cache, eager filtering and projection, and lots of other good stuff.