-
Notifications
You must be signed in to change notification settings - Fork 699
[WIP][DO NOT ACCEPT][Runtime] Add runtime Executor class #2133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Description: This commit adds `Executor`, an interface for a component of the Glow runtime that is capable of running a partitioned directed acyclic Glow graph and returning the results of that execution. This commit also adds `ThreadPoolExecutor`, an implementation of this interface that runs graphs using a thread pool whose workers usher every received request to completion through a series of state transitions. Testing: It compiles. Documentation: Comments.
#include <string> | ||
|
||
namespace glow { | ||
namespace runtime { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considered adding this also and think it's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the clear divide it makes. I think it would be the only sub namespace in the codebase. @opti-mix we've discussed namespaces in the past, what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using separate namespaces for runtime
and the like is fine.
@SplitInfinity Question about I'm thinking e.g. about the use-case, where we have a bunch of pre-compiled functions and run some of them based on incoming requests. |
@SplitInfinity A couple more questions:
|
@SplitInfinity Do we have a design document with some of the envisioned typical use-cases and workflows described? I think it would be very useful for this kind of APIs. Otherwise it is too easy to miss/forget some important use-cases. |
Thanks for the quick feedback!
I used
@nickgg and I talked about using Maybe it is a good idea to make a design document instead of jumping right into coding. |
std::map<Function *, std::list<Function *>> outgoing; | ||
// Output placeholder names for each function. | ||
std::map<Function *, std::list<std::string>> outputs; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you go with multiple maps of Function* -> Thing rather than one map of Function* -> Struct of Things? These are likely very small maps, but still, more lookups.
I know you said you hadn't updated this since the discussion this morning but just reminding Function*
isn't an identifier available at this level: need something like (DeviceNetworkID + FunctionName + DeviceID?).
/// This enum lists the available executors. | ||
enum class ExecutorKind { | ||
ThreadPool, // Executor backed by a thread pool. | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other ExecutorKinds do you envisage?
/// A mutex to guard accesses to class members. | ||
/// TODO: This can probably be eliminated by making certain members | ||
/// std::atomic. | ||
std::mutex mtx_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially, in the future, could avoid this via using a Stranded thread pool which keeps accesses on the same thread.
ThreadPoolExecutor::ThreadPoolExecutor(unsigned numWorkers) { | ||
// Intialize all workers and make each one run workerMain. | ||
for (unsigned i = 0; i < numWorkers; i++) { | ||
std::thread th(std::bind(&ThreadPoolExecutor::workerMain, this)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::bind is hard to debug around, how about a lambda here?
/// manage execution. | ||
struct ExecutorFunctionDAG { | ||
// The list of functions to run for this DAG, topologically sorted. | ||
std::list<Function *> functions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to store also which device each function is loaded into
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why list
instead of vector
or SmallVector
?
|
||
// If isMoreWork() returned true but getNext() returned nothing, this work | ||
// item has more work that needs doing but not until some dependencies are | ||
// fulfilled. Requeue it so that another worker will look at it again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like spinning over getNext(). Could you instead trigger processWorkItem() in markSuccess?
@opti-mix @SplitInfinity - Both Function and CompiledFunction are needed in the DeviceManager (for compile and execute respectively), and if at all possible I think we should not share them with the Executor. Any object sharing in this context will require solving thread safety in the Graph. The interface in #2134 uses identifiers for Module and Function ( Personal opinion is that callbacks are fine here, at least for the prototype. We should make sure were getting a benefit before jumping in on something more complicated (and maybe not cross platform?). |
@SplitInfinity @nickgg There seems to be a some commonality between this and #2134 in structure. Both have implemented a basic work queue structure but with some specific details like the dag structure in this case. Are these implementation specific details so great that we aren't able to use a generic executor in both cases and avoid all the extra code for synchronization and stuff? Can we just reenqueue the next piece of work here and can #2134's CPUDeviceManager be implemented with just a generic executor with 1 thread? |
Good point @jackm321 I think we should move the ONNXIFI thread pool class into somewhere common and use that everywhere. |
Callbacks should be fine for a prototype as long as most/all important use-cases can be expressed in their terms. I provided some exampled where the use of call-backs could be rather difficult. But it may very well be that we don't need to handle such use-cases. That's why I asked about a doc describing the high-level design, responsibilities of the entity (be it Executor or DeviceManager) and envisioned important use-cases.
What are the exact responsibilities of DeviceManager? Is it described anywhere, e.g. at least in the high-level class comments? Why does it need to control compilation? |
@opti-mix - there's some design notes n the top level task #2045 and @gcatron can point you at the design doc. Can you give me an example of something specific that cannot be done, or is difficult to achieve with the callbacks pattern? The DeviceManager piece is the interface to the physical (or in the case of CPU backends, logical) device. Specifically it holds the backend, which does the compilation, and so in the design it's the part that controls the compilation. It makes sense to have compile & run in the same place since they're device specific, I think. |
Thanks for the links! Appreciated.
For example, waiting for a set of multiple events to happen in a specific or any order (e.g. one of the actions/all actions are completed). Events may stem from the execution of different NN models running on the device. Or, e.g. you transfer data to the device using DMA (asynchronously) and you need to check that all/some of transfers have completed before you can proceed. Generally, IMHO, events (because they are data) can be composed better than callbacks (which is code). BTW, you may want to have a look at e.g. OpenCL for an example of APIs which are very much event-based, rather than callback-based.
Oh, I see. The idea that DeviceManager holds the backend is a possible solution, but it could be argued that the backend could also live outside. And whatever it produces (e.g. CompiledFunctions) could be then run on different DeviceManagers "compatible" with this backend (e.g. you have multiple devices of the same kind available). It would be interesting to see the pros/cons of different approaches considered. But this is a longer discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I scanned the code. It seems that the ThreadPoolExecutor
takes a DAG as an work item and keep scanning those items for one pass and re-enqueuing them. This is a lot of overhead. And the granularity of the task is a bit coarse.
IMO, the work item could be a node in the DAG. When you get a DAG to schedule, you can enqueue all its root nodes. And use callbacks to mark its successor nodes and if the dependency counter of the successor node goes to 0, enqueue it to the task queue. In this case, all the nodes in the DAG will be scheduled in topological order onto the thread pool. And you don't need to do scanning and save some bookkeeping like completedFunctions_
and inflightFunctions_
.
/// manage execution. | ||
struct ExecutorFunctionDAG { | ||
// The list of functions to run for this DAG, topologically sorted. | ||
std::list<Function *> functions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why list
instead of vector
or SmallVector
?
Context *postCtx = (ctx_->contexts).at(postrequisite); | ||
Placeholder *p; | ||
if ((p = postModule->getPlaceholderByName(output))) { | ||
postCtx->insert(p, ctx->get(p)->clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this involve copy of tensor data?
I agree with @opti-mix. It's really just IO here instead of actual CPU time because you are just sending the workload to accel and waiting. Event based async interface is more suitable here. |
@opti-mix @yinghai I think you may be conflating the DeviceManager with accessing the real Device. For any given device we may want an event based interface, but the specifics of those interfaces will differ in syntax and semantics in ways we don't currently know. Maybe for a particular device they do use epoll/select, maybe the firmware does blocking calls, maybe they do something really strange, they could even use callbacks (?!). The safest and most flexible thing we can do here is to have as minimal and generic an interface to the device as is practical, and isolate the platform specific code to one place - in this design the accelerator specific DeviceManager implementations. |
Description
This commit adds
Executor
, an interface for a component of the Glowruntime that is capable of running a partitioned directed acyclic Glow
graph and returning the results of that execution. This commit also adds
ThreadPoolExecutor
, an implementation of this interface that runsgraphs using a thread pool whose workers usher every received request to
completion through a series of state transitions.
Testing:
It compiles.
Documentation
The code has comments.