-
Notifications
You must be signed in to change notification settings - Fork 1.7k
data: add DataProvider
gRPC service spec
#4314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: This patch adds a gRPC interface that mirrors the `DataProvider` Python interface. The intent is to define a `GrpcDataProvider` Python type that is backed by a gRPC service. The particular service definition here is what I’ve used in drafts of a data provider implementation [in Go][go] and [in Rust][rust], so I can attest that it’s at least mostly reasonable on both the client and the server. It’s mostly a one-to-one port of the Python interface and admits an easy Python client. [go]: https://github.com/wchargin/tensorboard-data-server [rust]: https://github.com/wchargin/rustboard-server Test Plan: It builds: `bazel build //tensorboard/data/proto:protos_all_py_pb2_grpc`. wchargin-branch: data-provider-proto wchargin-source: 5fc30be05498a9cd1ff6beeb18ddaea33a92a74d
I’ve included proto support for all three data classes here, but I’m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good, just have some fairly low-level questions.
string name = 2; | ||
// Wall time of earliest recorded event, as floating-point seconds since | ||
// epoch (same as event file format). | ||
double start_time = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used google.protobuf.Timestamp
for this for TB.dev - maybe a bit nicer? I guess we might not want that for the actual wall_time
values where it adds conversion overhead and I could see an argument for consistency in timestamp treatment throughout and/or with DataProvider
, but where it's not particularly performance or overhead sensitive it does seem nicer to use a proper type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s indeed much slower for individual wall times:
Empirically, I tested this change on a log directory with one tag and
1000 runs, each time series with 2000 steps (downsampled to 1000). For
aReadScalars
RPC to read the entire dataset:
- with
Timestamp
s:
grpc_cli
request time: 602.0 ms ± 46.3 ms- response size: 23719710 bytes
- with
double
s:
grpc_cli
request time: 273.4 ms ± 39.6 ms- response size: 17992890 bytes
The request time is faster by a factor of 2.2, and the response payload
is 24% smaller. Python request deserialization times are on par with
grpc_cli
request times.
wchargin/tensorboard-data-server-go@ff0a517
We use float64
everywhere: the wire format source of truth, the data
provider interface, in RustBoard memory, on the frontend. A Timestamp
also has greater precision than float64
, so embedding a float wall
time suggests precision that is not there. Technically, it’s also a
lossy conversion, though for times after 1970-02-21 this won’t matter.
So while I certainly appreciate Timestamp
on its merits, and later in
RustBoard we’ll use std::time::Instant
for some internal bookkeeping,
I’m pretty disinclined to introduce Timestamp
s here just because.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can believe it makes a substantial performance difference for the datum-level wall times; I had already suggested as much in my original comment. The ListRunsResponse
results have far fewer timestamps, but it's fine to leave it as-is here too, since that's your preference.
I mentioned it just since in the grand scheme of things, when they're not a significant performance cost, I think having timestamp types is preferable to not having them. Every double
timestamp we put in a protobuf makes it take more effort for calling code to use those types, so I'd rather that we at least weigh the tradeoffs when it comes up versus just making floating point timestamps our go-to everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can believe it makes a substantial performance difference for
the datum-level wall times; I had already suggested as much in my
original comment.
yep :-) (reading my own comment, I guess it wasn’t explicit, but I was
replying to the suggestion to use them here specifically, and understand
that you weren’t suggesting that we use them everywhere)
I mentioned it just since in the grand scheme of things, when they're
not a significant performance cost, I think having timestamp types is
preferable to not having them. Everydouble
timestamp we put in a
protobuf makes it take more effort for calling code to use those
types,
I buy this general argument, but when I switched from Timestamp
s to
double
s, I actually found that it simplified both the client and the
server code, dropping representation-converting helpers on both sides.
So I think we can probably agree on the following tradeoffs, from the
perspective of using Timestamp
s:
- Pro: More semantically appropriate type, improving readability
and self-documentation. - Pro: More consistent with TensorBoard.dev messages.
- Neutral: For
ListRuns
, no significant performance impact. - Con: Less consistent with the data provider API, increasing
client complexity. - Con: Less consistent with other messages in the same file, where
using packed primitives is a significant performance boost. - Con: Not actually the native representation, increasing server
complexity and potentially either suggesting higher-than-actual
precision or (theoretically, not realistically) repesenting lossily.- Somewhat undermines (1), since the type is more semantically
appropriate from first principles but not actually a faithful
representation.
- Somewhat undermines (1), since the type is more semantically
…and while as resident powerful-type-systems evangelist I’m certainly
sympathetic to (1), I’m finding myself most swayed by (5) and by the
sub-point of (6).
Leaving it as is.
message ListScalarsRequest { | ||
// ID of experiment in which to query data. | ||
string experiment_id = 1; | ||
// Required filter for plugin name. If omitted, an empty message is implied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment confuses me a bit. If it's required, how can it be omitted without resulting in an error? Or at least, what is it that makes this "Required" while RunTagFilter is "Optional"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. What I meant is, “your results will always be filtered by plugin
name, so if you omit this then you’re filtering on plugin_name == ""
;
contrariwise, if you omit RunTagFilter
, we won’t filter on runs or
tags”. I can clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.. though given that the behavior if omitted is pretty much never going to do what the caller intends, maybe it's better to stick with calling it "required" and just return an invalid request error if plugin_name
is the empty string? (I'd tend to argue the same thing for Downsample
as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agh… I’m so torn. I agree, and have been burned by this myself when
using grpc_cli
to send requests (to TensorBoard.dev, which has the
same semantics) without specifying downsample
and wondering why the
response was empty. But I’ve also been burned so many times by systems
that treat zero as not a number when there’s clearly a continuous
interpretation of the specification.
I’m happy to make plugin_name
an error, since requiring that plugin
names not be empty seems perfectly reasonable. We already require that
dashboard names not be empty, which is related, though distinct. But
empty and non-empty strings differ in relevant ways (e.g., disappearing
from path components, not making inductive progress) that zero and
positive integers don’t. And it’s so easy to imagine user flows that
do want to set downsample { num_points: 0 }
:
TensorBoard Export-Your-Data™ Wizard
How many points do you want from each of these time series?
- accuracy:
1000
- gan_output/image/0:
10
- layer0/weights:
0
💥 kaboom!Download Archive
I’ll break the tie by deferring to your intuition and making both cases
an error, but I would be happy to change downsample
back to continuous
if we change our minds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I wasn't suggesting that we make downsample { num_points: 0 }
an error; I was suggesting only that we make the absence of downsample
entirely an error (sorry if this was ambiguous). So code that populates the Downsample
message has no discontinuity at zero; it's only code that doesn't set Downsample
entirely. That takes care of the grpc_cli
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. Yeah, that sounds pretty reasonable. I generally like the
semantics of “omitted messages are implied to be empty”, but I’m happy
to deviate from it where required or sensible (e.g., RunTagFilter
),
and this seems like such an occasion. Sent you #4333.
// A column-major sequence of scalar points. Arrays `step`, `wall_time`, and | ||
// `value` have the same lengths. | ||
message ScalarData { | ||
repeated int64 step = 1 [packed = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In proto3 packed=true
is on by default, so this isn't really needed: https://developers.google.com/protocol-buffers/docs/encoding#packed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! TIL; thanks.
I kind of like having it anyway for readability. I can remove it if
you’d prefer. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm moderately opposed to leaving it in, since I feel like it'll create confusion regarding its effect (namely, none, for proto3 files, and all of our proto files in OSS are proto3). If some repeated varint fields are labeled and some aren't, it could mislead the reader into assuming there's a functional difference or at least some meaningful difference in intent. Of course we could resolve that by requiring the annotation everywhere, but that seems overzealous given it has no functional effect, and surveying our existing code, the only places that use it are a handful of older TF protos (newer TF protos generally don't seem to use it):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, no problem. Removed annotations and noted packing in a comment.
} | ||
} | ||
|
||
// A column-major sequence of tensor points. Arrays `step`, `wall_time`, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the main motivation for doing this column-major unlike TB.dev is compactness of the wire representation? Or was there another motivation?
I imagine it's significant space overhead for scalars where each point is very small, but it seems less significant for tensors and blobs, so I'd be a bit curious to hear some reasoning for this, since it does have the downside of not guarding against having arrays of different lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compactness and consistency. I don’t feel strongly about it. FWIW, the
Python client just calls zip(x.step, x.wall_time, x.value)
, so
guarding against different lengths doesn’t complicate the code.
Can make this repeated TensorDatum
instead, if you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine as-is, I was just curious. I forgot zip()
can do more than 2 iterables which indeed makes it pretty straightforward to reconstruct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool.
wchargin-branch: data-provider-proto wchargin-source: dd71499e2fb2217daa8428817de4cc80623fbc70
wchargin-branch: data-provider-proto wchargin-source: dd71499e2fb2217daa8428817de4cc80623fbc70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Took most of your suggestions wholesale; left the others
unresolved for easy reading (but replied to all).
string name = 2; | ||
// Wall time of earliest recorded event, as floating-point seconds since | ||
// epoch (same as event file format). | ||
double start_time = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s indeed much slower for individual wall times:
Empirically, I tested this change on a log directory with one tag and
1000 runs, each time series with 2000 steps (downsampled to 1000). For
aReadScalars
RPC to read the entire dataset:
- with
Timestamp
s:
grpc_cli
request time: 602.0 ms ± 46.3 ms- response size: 23719710 bytes
- with
double
s:
grpc_cli
request time: 273.4 ms ± 39.6 ms- response size: 17992890 bytes
The request time is faster by a factor of 2.2, and the response payload
is 24% smaller. Python request deserialization times are on par with
grpc_cli
request times.
wchargin/tensorboard-data-server-go@ff0a517
We use float64
everywhere: the wire format source of truth, the data
provider interface, in RustBoard memory, on the frontend. A Timestamp
also has greater precision than float64
, so embedding a float wall
time suggests precision that is not there. Technically, it’s also a
lossy conversion, though for times after 1970-02-21 this won’t matter.
So while I certainly appreciate Timestamp
on its merits, and later in
RustBoard we’ll use std::time::Instant
for some internal bookkeeping,
I’m pretty disinclined to introduce Timestamp
s here just because.
message ListScalarsRequest { | ||
// ID of experiment in which to query data. | ||
string experiment_id = 1; | ||
// Required filter for plugin name. If omitted, an empty message is implied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. What I meant is, “your results will always be filtered by plugin
name, so if you omit this then you’re filtering on plugin_name == ""
;
contrariwise, if you omit RunTagFilter
, we won’t filter on runs or
tags”. I can clarify.
// A column-major sequence of scalar points. Arrays `step`, `wall_time`, and | ||
// `value` have the same lengths. | ||
message ScalarData { | ||
repeated int64 step = 1 [packed = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! TIL; thanks.
I kind of like having it anyway for readability. I can remove it if
you’d prefer. What do you think?
} | ||
} | ||
|
||
// A column-major sequence of tensor points. Arrays `step`, `wall_time`, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compactness and consistency. I don’t feel strongly about it. FWIW, the
Python client just calls zip(x.step, x.wall_time, x.value)
, so
guarding against different lengths doesn’t complicate the code.
Can make this repeated TensorDatum
instead, if you prefer?
string name = 2; | ||
// Wall time of earliest recorded event, as floating-point seconds since | ||
// epoch (same as event file format). | ||
double start_time = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can believe it makes a substantial performance difference for the datum-level wall times; I had already suggested as much in my original comment. The ListRunsResponse
results have far fewer timestamps, but it's fine to leave it as-is here too, since that's your preference.
I mentioned it just since in the grand scheme of things, when they're not a significant performance cost, I think having timestamp types is preferable to not having them. Every double
timestamp we put in a protobuf makes it take more effort for calling code to use those types, so I'd rather that we at least weigh the tradeoffs when it comes up versus just making floating point timestamps our go-to everywhere.
message ListScalarsRequest { | ||
// ID of experiment in which to query data. | ||
string experiment_id = 1; | ||
// Required filter for plugin name. If omitted, an empty message is implied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.. though given that the behavior if omitted is pretty much never going to do what the caller intends, maybe it's better to stick with calling it "required" and just return an invalid request error if plugin_name
is the empty string? (I'd tend to argue the same thing for Downsample
as well.)
// A column-major sequence of scalar points. Arrays `step`, `wall_time`, and | ||
// `value` have the same lengths. | ||
message ScalarData { | ||
repeated int64 step = 1 [packed = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm moderately opposed to leaving it in, since I feel like it'll create confusion regarding its effect (namely, none, for proto3 files, and all of our proto files in OSS are proto3). If some repeated varint fields are labeled and some aren't, it could mislead the reader into assuming there's a functional difference or at least some meaningful difference in intent. Of course we could resolve that by requiring the annotation everywhere, but that seems overzealous given it has no functional effect, and surveying our existing code, the only places that use it are a handful of older TF protos (newer TF protos generally don't seem to use it):
} | ||
} | ||
|
||
// A column-major sequence of tensor points. Arrays `step`, `wall_time`, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine as-is, I was just curious. I forgot zip()
can do more than 2 iterables which indeed makes it pretty straightforward to reconstruct.
wchargin-branch: data-provider-proto wchargin-source: e145c5aadedadbf28fbaa1444375946a13bee269
wchargin-branch: data-provider-proto wchargin-source: e145c5aadedadbf28fbaa1444375946a13bee269
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
// A column-major sequence of scalar points. Arrays `step`, `wall_time`, and | ||
// `value` have the same lengths. | ||
message ScalarData { | ||
repeated int64 step = 1 [packed = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, no problem. Removed annotations and noted packing in a comment.
string name = 2; | ||
// Wall time of earliest recorded event, as floating-point seconds since | ||
// epoch (same as event file format). | ||
double start_time = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can believe it makes a substantial performance difference for
the datum-level wall times; I had already suggested as much in my
original comment.
yep :-) (reading my own comment, I guess it wasn’t explicit, but I was
replying to the suggestion to use them here specifically, and understand
that you weren’t suggesting that we use them everywhere)
I mentioned it just since in the grand scheme of things, when they're
not a significant performance cost, I think having timestamp types is
preferable to not having them. Everydouble
timestamp we put in a
protobuf makes it take more effort for calling code to use those
types,
I buy this general argument, but when I switched from Timestamp
s to
double
s, I actually found that it simplified both the client and the
server code, dropping representation-converting helpers on both sides.
So I think we can probably agree on the following tradeoffs, from the
perspective of using Timestamp
s:
- Pro: More semantically appropriate type, improving readability
and self-documentation. - Pro: More consistent with TensorBoard.dev messages.
- Neutral: For
ListRuns
, no significant performance impact. - Con: Less consistent with the data provider API, increasing
client complexity. - Con: Less consistent with other messages in the same file, where
using packed primitives is a significant performance boost. - Con: Not actually the native representation, increasing server
complexity and potentially either suggesting higher-than-actual
precision or (theoretically, not realistically) repesenting lossily.- Somewhat undermines (1), since the type is more semantically
appropriate from first principles but not actually a faithful
representation.
- Somewhat undermines (1), since the type is more semantically
…and while as resident powerful-type-systems evangelist I’m certainly
sympathetic to (1), I’m finding myself most swayed by (5) and by the
sub-point of (6).
Leaving it as is.
message ListScalarsRequest { | ||
// ID of experiment in which to query data. | ||
string experiment_id = 1; | ||
// Required filter for plugin name. If omitted, an empty message is implied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agh… I’m so torn. I agree, and have been burned by this myself when
using grpc_cli
to send requests (to TensorBoard.dev, which has the
same semantics) without specifying downsample
and wondering why the
response was empty. But I’ve also been burned so many times by systems
that treat zero as not a number when there’s clearly a continuous
interpretation of the specification.
I’m happy to make plugin_name
an error, since requiring that plugin
names not be empty seems perfectly reasonable. We already require that
dashboard names not be empty, which is related, though distinct. But
empty and non-empty strings differ in relevant ways (e.g., disappearing
from path components, not making inductive progress) that zero and
positive integers don’t. And it’s so easy to imagine user flows that
do want to set downsample { num_points: 0 }
:
TensorBoard Export-Your-Data™ Wizard
How many points do you want from each of these time series?
- accuracy:
1000
- gan_output/image/0:
10
- layer0/weights:
0
💥 kaboom!Download Archive
I’ll break the tie by deferring to your intuition and making both cases
an error, but I would be happy to change downsample
back to continuous
if we change our minds.
} | ||
} | ||
|
||
// A column-major sequence of tensor points. Arrays `step`, `wall_time`, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool.
wchargin-branch: data-provider-proto wchargin-source: 909e850898036d5a078d871b7886805af0dffb08
Summary: In #4314, we added a proto definition for the data provider interface. We now generate the corresponding Rust bindings. Generated with `bazel run //tensorboard/data/server:update_protos`. Test Plan: Run the `:rustboard_core_doc_server`, and note that the `proto` package has bindings for both the normal messages and the gRPC service. wchargin-branch: rust-data-protos wchargin-source: 27b45c9fe4afefdbec8113c9323b620776b47db0
Summary: In #4314, we added a proto definition for the data provider interface. We now generate the corresponding Rust bindings. Generated with `bazel run //tensorboard/data/server:update_protos`. Test Plan: Run the `:rustboard_core_doc_server`, and note that the `proto` package has bindings for both the normal messages and the gRPC service. wchargin-branch: rust-data-protos
Summary: Suggested by @nfelt on #4314. This hits a flexibility/usability middle ground between “omitted messages are implied to be empty” (which makes it easy to accidentally set `num_points: 0` and wonder why the result is empty) and “you may not set `num_points: 0`” (which excludes legitimate use cases). Test Plan: None needed; this is just a doc update for now, as the service does not yet have any clients or servers. wchargin-branch: data-downsample-required wchargin-source: fc16494b1fb98129c5265b18aa42a1e3ba2ba71c
Summary: Suggested by @nfelt on #4314. This hits a flexibility/usability middle ground between “omitted messages are implied to be empty” (which makes it easy to accidentally set `num_points: 0` and wonder why the result is empty) and “you may not set `num_points: 0`” (which excludes legitimate use cases). Test Plan: None needed; this is just a doc update for now, as the service does not yet have any clients or servers. wchargin-branch: data-downsample-required
Summary: The data provider API uses `list_plugins` to populate the list of active dashboards. I forgot to add a corresponding RPC in #4314, since I’d hacked around it in my prototypes, but we do want to be able to implement this cleanly. Test Plan: It builds: `bazel build //tensorboard/data/proto:protos_all_py_pb2_grpc`. wchargin-branch: data-list-plugins-rpc wchargin-source: 6ab2a2acb154faac3310c4bac690554d7a9e1e74
Summary: The data provider API uses `list_plugins` to populate the list of active dashboards. I forgot to add a corresponding RPC in #4314, since I’d hacked around it in my prototypes, but we do want to be able to implement this cleanly. Test Plan: It builds: `bazel build //tensorboard/data/proto:protos_all_py_pb2_grpc`. wchargin-branch: data-list-plugins-rpc
Summary:
This patch adds a gRPC interface that mirrors the
DataProvider
Pythoninterface. The intent is to define a
GrpcDataProvider
Python type thatis backed by a gRPC service.
The particular service definition here is what I’ve used in drafts of
data provider implementations in Go and in Rust, so I can
attest that it’s at least mostly reasonable on both the client and the
server. It’s mostly a one-to-one port of the Python interface and admits
an easy Python client.
Test Plan:
It builds:
bazel build //tensorboard/data/proto:protos_all_py_pb2_grpc
.wchargin-branch: data-provider-proto