[WIP][DO NOT ACCEPT][Runtime] Add runtime Executor class #2133

SplitInfinity · 2018-12-06T22:20:49Z

Description
This commit adds Executor, an interface for a component of the Glow
runtime that is capable of running a partitioned directed acyclic Glow
graph and returning the results of that execution. This commit also adds
ThreadPoolExecutor, an implementation of this interface that runs
graphs using a thread pool whose workers usher every received request to
completion through a series of state transitions.

Testing:
It compiles.

Documentation
The code has comments.

Description: This commit adds `Executor`, an interface for a component of the Glow runtime that is capable of running a partitioned directed acyclic Glow graph and returning the results of that execution. This commit also adds `ThreadPoolExecutor`, an implementation of this interface that runs graphs using a thread pool whose workers usher every received request to completion through a series of state transitions. Testing: It compiles. Documentation: Comments.

nickgg · 2018-12-06T22:59:46Z

include/glow/Runtime/Executor/Executor.h

+#include <string>
+
+namespace glow {
+namespace runtime {


Considered adding this also and think it's a good idea.

I like the clear divide it makes. I think it would be the only sub namespace in the codebase. @opti-mix we've discussed namespaces in the past, what are your thoughts?

I think using separate namespaces for runtime and the like is fine.

opti-mix · 2018-12-06T23:53:31Z

@SplitInfinity Question about Functions. The current APIs use Functions, which means that they are also responsible for compiling them. But may be they should actually use CompiledFunctions and let the compilation be controlled by the client of this API? WDYT?

I'm thinking e.g. about the use-case, where we have a bunch of pre-compiled functions and run some of them based on incoming requests.

opti-mix · 2018-12-07T00:01:58Z

@SplitInfinity A couple more questions:

What about cancellation of a running DAG? Should we support this scenario? If so, what would be the API? IMHO, the run method (and may be any other async method) should return a token associated with the action performed by such a method. This would allow us to refer to this action later, which can be very useful for different reasons.
Your APIs seem to be callback-based. I'm not 100% convinced that it is a right approach. For example, we may run multiple DAGs at the same time and/or we may need to wait for completion (or any other events) from multiple DAGs/actions. This is rather difficult to achieve with callbacks-based APIs, but it probably much simpler using something like tokens described above and an API like epoll where you wait for events on a bunch of tokens...

opti-mix · 2018-12-07T00:04:29Z

@SplitInfinity Do we have a design document with some of the envisioned typical use-cases and workflows described? I think it would be very useful for this kind of APIs. Otherwise it is too easy to miss/forget some important use-cases.

SplitInfinity · 2018-12-07T00:20:58Z

@opti-mix

Thanks for the quick feedback!

Question about Functions. The current APIs use Functions, which means that they are also responsible for compiling them. But may be they should actually use CompiledFunctions and let the compilation be controlled by the client of this API? WDYT?

I used Function because I did not understand exactly the role of the Executor in the runtime. After discussing more with the team today, I realized that this is only used in the execution path and not in the compilation path, so using CompiledFunction instead makes sense.

What about cancellation of a running DAG? Should we support this scenario? If so, what would be the API? IMHO, the run method (and may be any other async method) should return a token associated with the action performed by such a method. This would allow us to refer to this action later, which can be very useful for different reasons.

Your APIs seem to be callback-based. I'm not 100% convinced that it is a right approach. For example, we may run multiple DAGs at the same time and/or we may need to wait for completion (or any other events) from multiple DAGs/actions. This is rather difficult to achieve with callbacks-based APIs, but it probably much simpler using something like tokens described above and an API like epoll where you wait for events on a bunch of tokens...

@nickgg and I talked about using std::future instead of passing around callbacks, but decided against it because the standard library futures are not as powerful as something like folly::Future. I would be open to using futures instead though.

Maybe it is a good idea to make a design document instead of jumping right into coding.

nickgg · 2018-12-06T23:34:26Z

include/glow/Runtime/Executor/Executor.h

+  std::map<Function *, std::list<Function *>> outgoing;
+  // Output placeholder names for each function.
+  std::map<Function *, std::list<std::string>> outputs;
+};


Why did you go with multiple maps of Function* -> Thing rather than one map of Function* -> Struct of Things? These are likely very small maps, but still, more lookups.

I know you said you hadn't updated this since the discussion this morning but just reminding Function*isn't an identifier available at this level: need something like (DeviceNetworkID + FunctionName + DeviceID?).

nickgg · 2018-12-06T23:39:32Z

include/glow/Runtime/Executor/Executor.h

+/// This enum lists the available executors.
+enum class ExecutorKind {
+  ThreadPool, // Executor backed by a thread pool.
+};


What other ExecutorKinds do you envisage?

nickgg · 2018-12-07T00:20:14Z

lib/Runtime/Executor/ThreadPoolExecutor.h

+  /// A mutex to guard accesses to class members.
+  /// TODO: This can probably be eliminated by making certain members
+  /// std::atomic.
+  std::mutex mtx_;


Potentially, in the future, could avoid this via using a Stranded thread pool which keeps accesses on the same thread.

nickgg · 2018-12-07T00:22:30Z

lib/Runtime/Executor/ThreadPoolExecutor.cpp

+ThreadPoolExecutor::ThreadPoolExecutor(unsigned numWorkers) {
+  // Intialize all workers and make each one run workerMain.
+  for (unsigned i = 0; i < numWorkers; i++) {
+    std::thread th(std::bind(&ThreadPoolExecutor::workerMain, this));


std::bind is hard to debug around, how about a lambda here?

nickgg · 2018-12-07T00:25:52Z

include/glow/Runtime/Executor/Executor.h

+/// manage execution.
+struct ExecutorFunctionDAG {
+  // The list of functions to run for this DAG, topologically sorted.
+  std::list<Function *> functions;


Need to store also which device each function is loaded into

Why list instead of vector or SmallVector?

nickgg · 2018-12-07T00:28:41Z

lib/Runtime/Executor/ThreadPoolExecutor.cpp

+
+    // If isMoreWork() returned true but getNext() returned nothing, this work
+    // item has more work that needs doing but not until some dependencies are
+    // fulfilled. Requeue it so that another worker will look at it again.


This feels like spinning over getNext(). Could you instead trigger processWorkItem() in markSuccess?

nickgg · 2018-12-07T00:41:58Z

@opti-mix @SplitInfinity - Both Function and CompiledFunction are needed in the DeviceManager (for compile and execute respectively), and if at all possible I think we should not share them with the Executor. Any object sharing in this context will require solving thread safety in the Graph.

The interface in #2134 uses identifiers for Module and Function (DeviceNetworkIDand function name) which are used to reference the particular sub-graph-function to execute.

Personal opinion is that callbacks are fine here, at least for the prototype. We should make sure were getting a benefit before jumping in on something more complicated (and maybe not cross platform?).

jackm321 · 2018-12-07T00:42:31Z

@SplitInfinity @nickgg There seems to be a some commonality between this and #2134 in structure. Both have implemented a basic work queue structure but with some specific details like the dag structure in this case. Are these implementation specific details so great that we aren't able to use a generic executor in both cases and avoid all the extra code for synchronization and stuff? Can we just reenqueue the next piece of work here and can #2134's CPUDeviceManager be implemented with just a generic executor with 1 thread?

SplitInfinity · 2018-12-07T00:46:21Z

Good point @jackm321

I think we should move the ONNXIFI thread pool class into somewhere common and use that everywhere.

opti-mix · 2018-12-07T01:08:40Z

@nickgg

Personal opinion is that callbacks are fine here, at least for the prototype. We should make sure were getting a benefit before jumping in on something more complicated (and maybe not cross platform?).

Callbacks should be fine for a prototype as long as most/all important use-cases can be expressed in their terms. I provided some exampled where the use of call-backs could be rather difficult. But it may very well be that we don't need to handle such use-cases. That's why I asked about a doc describing the high-level design, responsibilities of the entity (be it Executor or DeviceManager) and envisioned important use-cases.

Both Function and CompiledFunction are needed in the DeviceManager (for compile and execute respectively)

What are the exact responsibilities of DeviceManager? Is it described anywhere, e.g. at least in the high-level class comments? Why does it need to control compilation?

nickgg · 2018-12-07T01:15:39Z

@opti-mix - there's some design notes n the top level task #2045 and @gcatron can point you at the design doc. Can you give me an example of something specific that cannot be done, or is difficult to achieve with the callbacks pattern?

The DeviceManager piece is the interface to the physical (or in the case of CPU backends, logical) device. Specifically it holds the backend, which does the compilation, and so in the design it's the part that controls the compilation. It makes sense to have compile & run in the same place since they're device specific, I think.

opti-mix · 2018-12-07T01:40:33Z

@nickgg

@opti-mix - there's some design notes n the top level task #2045 and @gcatron can point you at the design doc.

Thanks for the links! Appreciated.

Can you give me an example of something specific that cannot be done, or is difficult to achieve with the callbacks pattern?

For example, waiting for a set of multiple events to happen in a specific or any order (e.g. one of the actions/all actions are completed). Events may stem from the execution of different NN models running on the device. Or, e.g. you transfer data to the device using DMA (asynchronously) and you need to check that all/some of transfers have completed before you can proceed. Generally, IMHO, events (because they are data) can be composed better than callbacks (which is code).

BTW, you may want to have a look at e.g. OpenCL for an example of APIs which are very much event-based, rather than callback-based.

The DeviceManager piece is the interface to the physical (or in the case of CPU backends, logical) device. Specifically it holds the backend, which does the compilation, and so does the compilation.

Oh, I see. The idea that DeviceManager holds the backend is a possible solution, but it could be argued that the backend could also live outside. And whatever it produces (e.g. CompiledFunctions) could be then run on different DeviceManagers "compatible" with this backend (e.g. you have multiple devices of the same kind available). It would be interesting to see the pros/cons of different approaches considered. But this is a longer discussion.

yinghai

I scanned the code. It seems that the ThreadPoolExecutor takes a DAG as an work item and keep scanning those items for one pass and re-enqueuing them. This is a lot of overhead. And the granularity of the task is a bit coarse.

IMO, the work item could be a node in the DAG. When you get a DAG to schedule, you can enqueue all its root nodes. And use callbacks to mark its successor nodes and if the dependency counter of the successor node goes to 0, enqueue it to the task queue. In this case, all the nodes in the DAG will be scheduled in topological order onto the thread pool. And you don't need to do scanning and save some bookkeeping like completedFunctions_ and inflightFunctions_.

yinghai · 2018-12-07T05:47:59Z

include/glow/Runtime/Executor/Executor.h

+/// manage execution.
+struct ExecutorFunctionDAG {
+  // The list of functions to run for this DAG, topologically sorted.
+  std::list<Function *> functions;


Why list instead of vector or SmallVector?

yinghai · 2018-12-07T06:03:12Z

lib/Runtime/Executor/ThreadPoolExecutor.cpp

+        Context *postCtx = (ctx_->contexts).at(postrequisite);
+        Placeholder *p;
+        if ((p = postModule->getPlaceholderByName(output))) {
+          postCtx->insert(p, ctx->get(p)->clone());


Does this involve copy of tensor data?

yinghai · 2018-12-07T06:32:21Z

Your APIs seem to be callback-based. I'm not 100% convinced that it is a right approach. For example, we may run multiple DAGs at the same time and/or we may need to wait for completion (or any other events) from multiple DAGs/actions. This is rather difficult to achieve with callbacks-based APIs, but it probably much simpler using something like tokens described above and an API like epoll where you wait for events on a bunch of tokens...

I agree with @opti-mix. It's really just IO here instead of actual CPU time because you are just sending the workload to accel and waiting. Event based async interface is more suitable here.

nickgg · 2018-12-07T18:16:28Z

@opti-mix @yinghai I think you may be conflating the DeviceManager with accessing the real Device. For any given device we may want an event based interface, but the specifics of those interfaces will differ in syntax and semantics in ways we don't currently know. Maybe for a particular device they do use epoll/select, maybe the firmware does blocking calls, maybe they do something really strange, they could even use callbacks (?!). The safest and most flexible thing we can do here is to have as minimal and generic an interface to the device as is practical, and isolate the platform specific code to one place - in this design the accelerator specific DeviceManager implementations.

SplitInfinity added the WORK IN PROGRESS label Dec 6, 2018

facebook-github-bot added the CLA Signed label Dec 6, 2018

nickgg reviewed Dec 6, 2018

View reviewed changes

SplitInfinity changed the title ~~[WIP][Runtime] Add runtime Executor class~~ [WIP][DO NOT ACCEPT][Runtime] Add runtime Executor class Dec 7, 2018

nickgg reviewed Dec 7, 2018

View reviewed changes

yinghai reviewed Dec 7, 2018

View reviewed changes

SplitInfinity closed this Jan 10, 2019

SplitInfinity deleted the runtime-executor branch January 15, 2019 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][DO NOT ACCEPT][Runtime] Add runtime Executor class #2133

[WIP][DO NOT ACCEPT][Runtime] Add runtime Executor class #2133

SplitInfinity commented Dec 6, 2018

nickgg Dec 6, 2018

gcatron Dec 6, 2018

opti-mix Dec 6, 2018

opti-mix commented Dec 6, 2018

opti-mix commented Dec 7, 2018

opti-mix commented Dec 7, 2018

SplitInfinity commented Dec 7, 2018 •

edited

Loading

nickgg Dec 6, 2018

nickgg Dec 6, 2018

nickgg Dec 7, 2018

nickgg Dec 7, 2018

nickgg Dec 7, 2018

yinghai Dec 7, 2018

nickgg Dec 7, 2018

nickgg commented Dec 7, 2018

jackm321 commented Dec 7, 2018

SplitInfinity commented Dec 7, 2018

opti-mix commented Dec 7, 2018

nickgg commented Dec 7, 2018 •

edited

Loading

opti-mix commented Dec 7, 2018

yinghai left a comment

yinghai Dec 7, 2018

yinghai Dec 7, 2018

yinghai commented Dec 7, 2018

nickgg commented Dec 7, 2018

[WIP][DO NOT ACCEPT][Runtime] Add runtime Executor class #2133

[WIP][DO NOT ACCEPT][Runtime] Add runtime Executor class #2133

Conversation

SplitInfinity commented Dec 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

opti-mix commented Dec 6, 2018

opti-mix commented Dec 7, 2018

opti-mix commented Dec 7, 2018

SplitInfinity commented Dec 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickgg commented Dec 7, 2018

jackm321 commented Dec 7, 2018

SplitInfinity commented Dec 7, 2018

opti-mix commented Dec 7, 2018

nickgg commented Dec 7, 2018 • edited Loading

opti-mix commented Dec 7, 2018

yinghai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yinghai commented Dec 7, 2018

nickgg commented Dec 7, 2018

SplitInfinity commented Dec 7, 2018 •

edited

Loading

nickgg commented Dec 7, 2018 •

edited

Loading