Skip to content

uploader: request ServerInfo from frontend #2879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Nov 15, 2019

Conversation

wchargin
Copy link
Contributor

@wchargin wchargin commented Nov 2, 2019

Summary:
This commit integrates the new ServerInfo RPC with the uploader. It’s
not currently enabled by default: the current behavior is the same as
the existing behavior, except that experiment URLs now properly have a
trailing slash. We’ll soon remove the hard-coded API backend endpoint
behavior to enable this by default.

Test Plan:
Running a test frontend and a test backend, we observe the following
behavior with different arguments:

--origin --api_endpoint URL origin Backend
empty empty prod prod
empty prod prod prod
empty test prod test
test empty test test
test test test test
test prod test prod

Here, “test” in the --origin column is like http://localhost:8080,
and “test” in the --api_endpoint column is like localhost:10000.
Note that the no-argument case is equivalent to the explicitly-empty
argument case because both arguments have empty default values.

Explicitly specifying --origin https://tensorboard.dev, with any value
of --api_endpoint, fails with “Corrupt response from backend” because
server-side support has not yet been rolled out. This is expected.

Specifying --origin http://localhost:0 or any other unreachable host
fails with ECONNREFUSED and a nice message.

My test frontend is configured to reject clients below version 2.0.0 and
warn on clients below version 2.0.1. Changing the local version.py to
2.0.0a0 or 2.0.1a0 exercises these cases.

Finally, double-checked that building the Pip package, installing it,
and running tensorboard dev list properly uses the production backend
and prints URLs that resolve to the production frontend.

wchargin-branch: uploader-serverinfo-request

Summary:
This commit adds an RPC definition by which the uploader can connect to
the frontend web server at the start of an upload session. This resolves
a number of outstanding issues:

  - The frontend can tell the uploader which backend server to connect
    to, rather than requiring a hard-coded endpoint in the uploader.
  - The frontend can tell the uploader how to generate experiment URLs,
    rather than requiring the backend server to provide this information
    (which it can’t, really, in general).
  - The frontend can check whether the uploader client is recent enough
    and instruct the end user to update if it’s not.
  - The frontend can warn the user about transient issues in case the
    service is down, degraded, under maintenance, etc.

An endpoint `https://tensorboard.dev/api/uploader` on the server will
provide this information.

Test Plan:
Unit tests suffice.

wchargin-branch: uploader-serverinfo-protos
Summary:
This commit integrates the new `ServerInfo` RPC with the uploader. It’s
not currently enabled by default: the current behavior is the same as
the existing behavior, except that experiment URLs now properly have a
trailing slash. To enable the behavior, change the `--api_endpoint` flag
to an empty string. We’ll soon make this the default.

Test Plan:
Write a simple WSGI server with a placeholder response:

```python
from wsgiref import simple_server

import werkzeug

data = (
    b"\x0a\x02"  # compatibility
    b"\x08\x01"  # 1.verdict
    b"\x12\x19"  # api_server
    b"\x0a\x17api.tensorboard.dev:443"  # 1.endpoint
    b"\x1a\x36"  # url_format
    b"\x0a\x2bhttps://www.example.com/experiment/{{EID}}/"  # 1.template
    b"\x12\x07{{EID}}"  # 1.placeholder
)

@werkzeug.Request.application
def app(request):
  return werkzeug.Response(data)

simple_server.make_server("localhost", 1234, app).serve_forever()
```

Then run with `--origin http://localhost:1234 --api_endpoint ""` and
note that the uploader uploads to the production backend but prints URLs
under `https://www.example.com`. Try the following changes:

  - After `1.verdict`, add `b"\x12\x05ahoy!"` to add a details message,
    and then in `compatibility` change `\x02` to `\x09` (length tag).
  - In `1.verdict`, change `\x01` (OK) to `\x02` (warning) or `\x03`
    (error).
  - Remove the `api_server` and `1.endpoint` lines and verify that the
    uploader fails on the client side.
  - Change the whole response to just `\x0a\x01` and verify that the
    uploader fails with a “truncated message” parse error.

wchargin-branch: uploader-serverinfo-request
wchargin-branch: uploader-serverinfo-protos
wchargin-source: 16d41149a8ce9e2e816df874217a7da10d83810a
wchargin-branch: uploader-serverinfo-protos
wchargin-source: 16d41149a8ce9e2e816df874217a7da10d83810a
wchargin-branch: uploader-serverinfo-request
wchargin-source: f59f688bf6b944d162bfd40b4daf760ed2220011
wchargin-branch: uploader-serverinfo-protos
wchargin-source: f4f3624d7a37f1f2229970b0e80c28052f8b000a
wchargin-branch: uploader-serverinfo-request
wchargin-source: 0f24c0533a9e1c2a304f357117a20a7a76bd5c2c
wchargin-branch: uploader-serverinfo-protos
wchargin-source: d11627f29373cb13ececc774071c3cce365a7136
wchargin-branch: uploader-serverinfo-request
wchargin-source: 9a504c52657f3297c45ffe15b2204a49023142f2
wchargin-branch: uploader-serverinfo-request
wchargin-source: 9a504c52657f3297c45ffe15b2204a49023142f2
type=str,
default='api.tensorboard.dev:443',
help='URL for the API server accepting write requests.')
help='Experimental. Direct URL for the API server accepting '
'write requests. If set, will skip initial server handshake.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on the other PR, but if passing --origin in addition to --api_endpoint it makes sense to me to still do the server handshake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def _handle_server_info(info):
compat = info.compatibility
if compat.verdict == server_info_pb2.VERDICT_WARN:
sys.stderr.write('Warning: %s\n' % compat.details)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but maybe "Server warning" just to make it a bit clearer in debugging where this is coming from?

Same, also optional, for "Error".

Copy link
Contributor Author

@wchargin wchargin Nov 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm; I like that idea, but “Server error” makes it sound like the server
has an error. We expect these messages usually to be “your client is out
of date”, which is more like a client error than a server error.

What do you think of using “Warning [from server]:” instead?

Warning [from server]: Experiencing transient errors; be patient.

Error [from server]: Please upgrade to TensorBoard 3.0.0 or later.

else:
# OK or unknown; assume OK.
if compat.details:
sys.stderr.write('%s\n' % compat.details)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.stderr.flush() too?

Though now that I actually go look at the docs, apparently both it and stdout are supposed to be line-buffered, so maybe the fact that I usually flush() after doing a raw write() is just cargo culting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating. I had always assumed that stderr was unbuffered, like it is
in every other environment, because that’s one of the main points of
stderr. And so it was in Python 2, but indeed this appears to have
changed in Python 3. I’ll keep this in mind from now on; thanks.

(It’s not clear to me whether we should flush here: superfluous in the
interactive case, but not when outputting to a file. I guess the perf
isn’t really a problem since this isn’t in a loop, so I’ll go ahead and
add it.)

wchargin-branch: uploader-serverinfo-request
wchargin-source: dc8743d2e4a8896a10b5606768d44dea23868cd2
wchargin-branch: uploader-serverinfo-request
wchargin-source: dc8743d2e4a8896a10b5606768d44dea23868cd2
@wchargin wchargin changed the base branch from wchargin-uploader-serverinfo-protos to master November 9, 2019 02:21
@wchargin
Copy link
Contributor Author

wchargin commented Nov 9, 2019

Re-review not yet required.

wchargin-branch: uploader-serverinfo-request
wchargin-source: 4530572af22b41a4f61b573eb6998b30475b6e97
wchargin-branch: uploader-serverinfo-request
wchargin-source: 4530572af22b41a4f61b573eb6998b30475b6e97
@wchargin wchargin requested a review from nfelt November 14, 2019 18:40
@wchargin
Copy link
Contributor Author

Changes addressed, mostly straightforwardly, and tested with a deployed
dev server, which correctly connects and prints working links.

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like submitting this as-is would mean that tb-nightly's uploader would see a regression in behavior against current tensorboard.dev (until a new deploy with the server-side change) because it won't compute the correct URL pattern and would instead erroneously use localhost.

I'd prefer to avoid that so we don't have to worry as much about sequencing the changes. Can we just do that by replacing the http://localhost:8080 passed to create_server_info() with https://tensorboard.dev?

Thinking about it more I'm kind of inclined to avoid http://localhost:8080 as a default value anyway, even after there's no backwards compatibility concern, since we really have no idea what might be running on http://localhost:8080. It seems better to not even attempt to emit a URL if we have no clue whether it will resolve, and instead just emit the raw EID in that case or something, since if a user is developing with no frontend the URL is meaningless anyway, and if they do have a frontend, it would be better to explicitly pass it into --origin.

"http://localhost:8080", flags.api_endpoint)
origin = flags.origin or _DEFAULT_ORIGIN
server_info = server_info_lib.fetch_server_info(
flags.origin or _DEFAULT_ORIGIN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should just be origin here right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; thanks.

Copy link
Contributor Author

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, you’re right. It connects correctly, but emits URLs with the wrong
origin. This certainly isn’t intentional; it was meant to be both
forward and backward compatible.

So I think the right table is

Origin given No origin
API given Handshake, then override endpoint Construct fake info without handshake
No API Handshake Set origin to default, then proceed as ↖

where currently we’re in the top-right cell by default—as it’s the only
cell that doesn’t end with a handshake—and use https://tensorboard.dev
as the fake origin. After the server-side change rolls out, we move to
the bottom-right cell by default, and can optionally change the behavior
of the top-right to emit no URL instead of either tensorboard.dev or
localhost. But we need that behavior for now for compatibility.

Thanks for the catch; will fix.

"http://localhost:8080", flags.api_endpoint)
origin = flags.origin or _DEFAULT_ORIGIN
server_info = server_info_lib.fetch_server_info(
flags.origin or _DEFAULT_ORIGIN)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; thanks.

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that matches my understanding.

wchargin-branch: uploader-serverinfo-request
wchargin-source: 3e74ef13b1117399321272c1bb487e000664b872
wchargin-branch: uploader-serverinfo-request
wchargin-source: 3e74ef13b1117399321272c1bb487e000664b872
@wchargin wchargin requested a review from nfelt November 14, 2019 22:30
@wchargin
Copy link
Contributor Author

(updated test plan)

origin = flags.origin or _DEFAULT_ORIGIN
server_info = server_info_lib.fetch_server_info(
flags.origin or _DEFAULT_ORIGIN)
if flags.api_endpoint and not flags.origin:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually should also remove the argparse-facing default value for --api_endpoint too, because otherwise passing --origin=https://staging.tensorboard.dev will not behave as desired, since the flags.api_endpoint override behavior will still kick in and default to the prod API. We can instead just set the hardcoded API endpoint as a fallback behavior used for the create_server_info() path only, which gets us closer to our desired final state.

Then we'd have the following code for now:

_DEFAULT_ORIGIN = 'https://tensorboard.dev'
_HARDCODED_API_ENDPOINT = 'api.tensorboard.dev:443'

def _get_server_info(flags):
  origin = flags.origin or _DEFAULT_ORIGIN
  if not flags.origin:
    # Temporary fallback to hardcoded API endpoint in the unspecified case.
    api_endpoint = flags.api_endpoint or _HARDCODED_API_ENDPOINT
    return server_info_lib.create_server_info(origin, api_endpoint)
  # same from here down

And after the server-side rollout, we'd remove _HARDCODED_API_ENDPOINT and update it to:

if not flags.origin and flags.api_endpoint:
  return server_info_lib.create_server_info(origin, flags.api_endpoint)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually should also remove the argparse-facing default
value for --api_endpoint too, because otherwise passing
--origin=https://staging.tensorboard.dev will not behave as desired,
since the flags.api_endpoint override behavior will still kick in
and default to the prod API

Correct (this is why the test plan passes --api_endpoint "" explicitly
in each case). This was fine with me, as it’s only temporary and only
for devs, so I preferred to keep the minor inconvenience so that the
later change could just be a simple flag flip rather than a flag flip
plus removing some logic.

Does this make sense? I’m happy to make the change if you still prefer
it that way.

Copy link
Contributor

@nfelt nfelt Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sorry, should have also read the updated test plan.

That said, even though it's only dev-facing, I don't really love that it's possible to run --origin=https://staging.tensorboard.dev and encounter pretty misleading behavior (data uploaded to the wrong place, printed URL doesn't resolve). It seems pretty easy for one of us to accidentally hit that during testing. I'd be more comfortable making that work correctly now; it seems fine to me that the later change removes fallback logic since I think that's actually more reflective of what we're doing than just flipping a flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem; that makes sense. Done. I think that you’re probably right,
as this simplifies the test plan, which is a good sign. :-)

wchargin-branch: uploader-serverinfo-request
wchargin-source: 0622d92604e72a5d661759d8ce39002433e354ef
Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and actually read the test plan this time too :)

@wchargin
Copy link
Contributor Author

Thanks for your patience and helpful reviews!

@wchargin wchargin merged commit e76c6a9 into master Nov 15, 2019
@wchargin wchargin deleted the wchargin-uploader-serverinfo-request branch November 15, 2019 22:13
wchargin added a commit that referenced this pull request Nov 21, 2019
Summary:
We’ve deployed production servers that support the handshake protocol
specified in #2878 and implemented on the client in #2879. This commit
enables that protocol by default.

Test Plan:
Running `bazel run //tensorboard -- dev list` still properly connects
and prints valid URLs. Re-running with the TensorBoard version patched
to `2.0.0a0` (in `version/version.py`) properly causes a handshake
failure. Setting `--origin` to point to a non-prod frontend properly
connects to the appropriate backend. Setting `--api_endpoint` to point
to a non-prod backend connects directly, skipping the handshake, and
printing `https://tensorboard.dev` URLs. Specifying both `--origin` and
`--api_endpoint` performs the handshake and overrides the backend
server only, printing URLs corresponding to the listed frontend.

Running `git grep api.tensorboard.dev` no longer finds any code results.

As a double check, building the Pip package and running it in a new
virtualenv still has a working `tensorboard dev upload` flow.

wchargin-branch: uploader-handshake
wchargin added a commit that referenced this pull request Nov 22, 2019
Summary:
We’ve deployed production servers that support the handshake protocol
specified in #2878 and implemented on the client in #2879. This commit
enables that protocol by default.

Test Plan:
Running `bazel run //tensorboard -- dev list` still properly connects
and prints valid URLs. Re-running with the TensorBoard version patched
to `2.0.0a0` (in `version/version.py`) properly causes a handshake
failure. Setting `--origin` to point to a non-prod frontend properly
connects to the appropriate backend. Setting `--api_endpoint` to point
to a non-prod backend connects directly, skipping the handshake, and
printing `https://tensorboard.dev` URLs. Specifying both `--origin` and
`--api_endpoint` performs the handshake and overrides the backend
server only, printing URLs corresponding to the listed frontend.

Running `git grep api.tensorboard.dev` no longer finds any code results.

As a double check, building the Pip package and running it in a new
virtualenv still has a working `tensorboard dev upload` flow.

wchargin-branch: uploader-handshake
wchargin added a commit to wchargin/tensorboard that referenced this pull request Nov 25, 2019
Summary:
This commit integrates the new `ServerInfo` RPC with the uploader. It’s
not currently enabled by default: the current behavior is the same as
the existing behavior, except that experiment URLs now properly have a
trailing slash. We’ll soon remove the hard-coded API backend endpoint
behavior to enable this by default.

Test Plan:
Running a test frontend and a test backend, we observe the following
behavior with different arguments:

| `--origin` | `--api_endpoint` | → | URL origin | Backend |
|------------|------------------|---|------------|---------|
| empty      | empty            |   | prod       | prod    |
| empty      | prod             |   | prod       | prod    |
| empty      | test             |   | prod       | test    |
| test       | empty            |   | test       | test    |
| test       | test             |   | test       | test    |
| test       | prod             |   | test       | prod    |

Here, “test” in the `--origin` column is like `http://localhost:8080`,
and “test” in the `--api_endpoint` column is like `localhost:10000`.
Note that the no-argument case is equivalent to the explicitly-empty
argument case because both arguments have empty default values.

Explicitly specifying `--origin https://tensorboard.dev`, with any value
of `--api_endpoint`, fails with “Corrupt response from backend” because
server-side support has not yet been rolled out. This is expected.

Specifying `--origin http://localhost:0` or any other unreachable host
fails with `ECONNREFUSED` and a nice message.

My test frontend is configured to reject clients below version 2.0.0 and
warn on clients below version 2.0.1. Changing the local `version.py` to
`2.0.0a0` or `2.0.1a0` exercises these cases.

Finally, double-checked that building the Pip package, installing it,
and running `tensorboard dev list` properly uses the production backend
and prints URLs that resolve to the production frontend.

wchargin-branch: uploader-serverinfo-request
wchargin added a commit to wchargin/tensorboard that referenced this pull request Nov 25, 2019
Summary:
We’ve deployed production servers that support the handshake protocol
specified in tensorflow#2878 and implemented on the client in tensorflow#2879. This commit
enables that protocol by default.

Test Plan:
Running `bazel run //tensorboard -- dev list` still properly connects
and prints valid URLs. Re-running with the TensorBoard version patched
to `2.0.0a0` (in `version/version.py`) properly causes a handshake
failure. Setting `--origin` to point to a non-prod frontend properly
connects to the appropriate backend. Setting `--api_endpoint` to point
to a non-prod backend connects directly, skipping the handshake, and
printing `https://tensorboard.dev` URLs. Specifying both `--origin` and
`--api_endpoint` performs the handshake and overrides the backend
server only, printing URLs corresponding to the listed frontend.

Running `git grep api.tensorboard.dev` no longer finds any code results.

As a double check, building the Pip package and running it in a new
virtualenv still has a working `tensorboard dev upload` flow.

wchargin-branch: uploader-handshake
@wchargin wchargin mentioned this pull request Nov 25, 2019
wchargin added a commit that referenced this pull request Nov 25, 2019
Summary:
This commit integrates the new `ServerInfo` RPC with the uploader. It’s
not currently enabled by default: the current behavior is the same as
the existing behavior, except that experiment URLs now properly have a
trailing slash. We’ll soon remove the hard-coded API backend endpoint
behavior to enable this by default.

Test Plan:
Running a test frontend and a test backend, we observe the following
behavior with different arguments:

| `--origin` | `--api_endpoint` | → | URL origin | Backend |
|------------|------------------|---|------------|---------|
| empty      | empty            |   | prod       | prod    |
| empty      | prod             |   | prod       | prod    |
| empty      | test             |   | prod       | test    |
| test       | empty            |   | test       | test    |
| test       | test             |   | test       | test    |
| test       | prod             |   | test       | prod    |

Here, “test” in the `--origin` column is like `http://localhost:8080`,
and “test” in the `--api_endpoint` column is like `localhost:10000`.
Note that the no-argument case is equivalent to the explicitly-empty
argument case because both arguments have empty default values.

Explicitly specifying `--origin https://tensorboard.dev`, with any value
of `--api_endpoint`, fails with “Corrupt response from backend” because
server-side support has not yet been rolled out. This is expected.

Specifying `--origin http://localhost:0` or any other unreachable host
fails with `ECONNREFUSED` and a nice message.

My test frontend is configured to reject clients below version 2.0.0 and
warn on clients below version 2.0.1. Changing the local `version.py` to
`2.0.0a0` or `2.0.1a0` exercises these cases.

Finally, double-checked that building the Pip package, installing it,
and running `tensorboard dev list` properly uses the production backend
and prints URLs that resolve to the production frontend.

wchargin-branch: uploader-serverinfo-request
wchargin added a commit that referenced this pull request Nov 25, 2019
Summary:
We’ve deployed production servers that support the handshake protocol
specified in #2878 and implemented on the client in #2879. This commit
enables that protocol by default.

Test Plan:
Running `bazel run //tensorboard -- dev list` still properly connects
and prints valid URLs. Re-running with the TensorBoard version patched
to `2.0.0a0` (in `version/version.py`) properly causes a handshake
failure. Setting `--origin` to point to a non-prod frontend properly
connects to the appropriate backend. Setting `--api_endpoint` to point
to a non-prod backend connects directly, skipping the handshake, and
printing `https://tensorboard.dev` URLs. Specifying both `--origin` and
`--api_endpoint` performs the handshake and overrides the backend
server only, printing URLs corresponding to the listed frontend.

Running `git grep api.tensorboard.dev` no longer finds any code results.

As a double check, building the Pip package and running it in a new
virtualenv still has a working `tensorboard dev upload` flow.

wchargin-branch: uploader-handshake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants