notebook: don’t filter polled instances by PID (#4407)

wchargin · web-flow · commit 546d1b6e42b9 · 2020-12-01T10:47:14.000-08:00
Summary: When the `%tensorboard` cell magic is invoked, we compute a cache key for the “hermetic environment”, primarily args to `%tensorboard` and the working directory. We first check whether any running TensorBoard instances match that cache key, and launch a new instance if none do. But then, while polling for the new instance to have launched, we had a different matching criterion, checking for a process ID match instead of a cache key match. The idea was that “is this TensorBoard instance’s PID equal to the PID of the subprocess that we just spawned?” would be a more reliable check. But on Windows ((╯°□°）╯︵ ┻━┻) this is not the case, presumably because the `tensorboard` console script has some kind of wrapper process in certain versions of Python. This manifested as “`%tensorboard` always times out on the first invocation, but works immediately when I invoke it again”, since invoking it again triggers the cache key check rather than the PID check. So we now just check by cache key in all cases, and the logic is consistent, if a bit less precise overall. Fixes #4300. Test Plan: Still works for me on Linux, with both new and existing TensorBoard processes across multiple (concurrent) cache keys. @stephanwlee can repro the bug and fix on Windows with Python 3.8. wchargin-branch: notebook-poll-no-pid-filter
diff --git a/tensorboard/manager.py b/tensorboard/manager.py
@@ -401,13 +401,10 @@ def start(arguments, timeout=datetime.timedelta(seconds=60)):
       A `StartReused`, `StartLaunched`, `StartFailed`, or `StartTimedOut`
       object.
     """
-    match = _find_matching_instance(
-        cache_key(
-            working_directory=os.getcwd(),
-            arguments=arguments,
-            configure_kwargs={},
-        ),
+    this_cache_key = cache_key(
+        working_directory=os.getcwd(), arguments=arguments, configure_kwargs={},
     )
+    match = _find_matching_instance(this_cache_key)
     if match:
         return StartReused(info=match)
 
@@ -438,9 +435,11 @@ def start(arguments, timeout=datetime.timedelta(seconds=60)):
                 stdout=_maybe_read_file(stdout_path),
                 stderr=_maybe_read_file(stderr_path),
             )
-        for info in get_all():
-            if info.pid == p.pid and info.start_time >= start_time_seconds:
-                return StartLaunched(info=info)
+        info = _find_matching_instance(this_cache_key)
+        if info:
+            # Don't check that `info.pid == p.pid`, since on Windows that may
+            # not be the case: see #4300.
+            return StartLaunched(info=info)
     else:
         return StartTimedOut(pid=p.pid)