Skip to content

[WIP][DO NOT ACCEPT][Runtime] Add runtime Executor class #2133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

SplitInfinity
Copy link
Contributor

Description
This commit adds Executor, an interface for a component of the Glow
runtime that is capable of running a partitioned directed acyclic Glow
graph and returning the results of that execution. This commit also adds
ThreadPoolExecutor, an implementation of this interface that runs
graphs using a thread pool whose workers usher every received request to
completion through a series of state transitions.

Testing:
It compiles.

Documentation
The code has comments.

Description:
This commit adds `Executor`, an interface for a component of the Glow
runtime that is capable of running a partitioned directed acyclic Glow
graph and returning the results of that execution. This commit also adds
`ThreadPoolExecutor`, an implementation of this interface that runs
graphs using a thread pool whose workers usher every received request to
completion through a series of state transitions.

Testing:
It compiles.

Documentation:
Comments.
#include <string>

namespace glow {
namespace runtime {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considered adding this also and think it's a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the clear divide it makes. I think it would be the only sub namespace in the codebase. @opti-mix we've discussed namespaces in the past, what are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using separate namespaces for runtime and the like is fine.

@opti-mix
Copy link
Contributor

opti-mix commented Dec 6, 2018

@SplitInfinity Question about Functions. The current APIs use Functions, which means that they are also responsible for compiling them. But may be they should actually use CompiledFunctions and let the compilation be controlled by the client of this API? WDYT?

I'm thinking e.g. about the use-case, where we have a bunch of pre-compiled functions and run some of them based on incoming requests.

@opti-mix
Copy link
Contributor

opti-mix commented Dec 7, 2018

@SplitInfinity A couple more questions:

  • What about cancellation of a running DAG? Should we support this scenario? If so, what would be the API? IMHO, the run method (and may be any other async method) should return a token associated with the action performed by such a method. This would allow us to refer to this action later, which can be very useful for different reasons.

  • Your APIs seem to be callback-based. I'm not 100% convinced that it is a right approach. For example, we may run multiple DAGs at the same time and/or we may need to wait for completion (or any other events) from multiple DAGs/actions. This is rather difficult to achieve with callbacks-based APIs, but it probably much simpler using something like tokens described above and an API like epoll where you wait for events on a bunch of tokens...

@opti-mix
Copy link
Contributor

opti-mix commented Dec 7, 2018

@SplitInfinity Do we have a design document with some of the envisioned typical use-cases and workflows described? I think it would be very useful for this kind of APIs. Otherwise it is too easy to miss/forget some important use-cases.

@SplitInfinity SplitInfinity changed the title [WIP][Runtime] Add runtime Executor class [WIP][DO NOT ACCEPT][Runtime] Add runtime Executor class Dec 7, 2018
@SplitInfinity
Copy link
Contributor Author

SplitInfinity commented Dec 7, 2018

@opti-mix

Thanks for the quick feedback!

Question about Functions. The current APIs use Functions, which means that they are also responsible for compiling them. But may be they should actually use CompiledFunctions and let the compilation be controlled by the client of this API? WDYT?

I used Function because I did not understand exactly the role of the Executor in the runtime. After discussing more with the team today, I realized that this is only used in the execution path and not in the compilation path, so using CompiledFunction instead makes sense.

What about cancellation of a running DAG? Should we support this scenario? If so, what would be the API? IMHO, the run method (and may be any other async method) should return a token associated with the action performed by such a method. This would allow us to refer to this action later, which can be very useful for different reasons.

Your APIs seem to be callback-based. I'm not 100% convinced that it is a right approach. For example, we may run multiple DAGs at the same time and/or we may need to wait for completion (or any other events) from multiple DAGs/actions. This is rather difficult to achieve with callbacks-based APIs, but it probably much simpler using something like tokens described above and an API like epoll where you wait for events on a bunch of tokens...

@nickgg and I talked about using std::future instead of passing around callbacks, but decided against it because the standard library futures are not as powerful as something like folly::Future. I would be open to using futures instead though.

Maybe it is a good idea to make a design document instead of jumping right into coding.

std::map<Function *, std::list<Function *>> outgoing;
// Output placeholder names for each function.
std::map<Function *, std::list<std::string>> outputs;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you go with multiple maps of Function* -> Thing rather than one map of Function* -> Struct of Things? These are likely very small maps, but still, more lookups.

I know you said you hadn't updated this since the discussion this morning but just reminding Function*isn't an identifier available at this level: need something like (DeviceNetworkID + FunctionName + DeviceID?).

/// This enum lists the available executors.
enum class ExecutorKind {
ThreadPool, // Executor backed by a thread pool.
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other ExecutorKinds do you envisage?

/// A mutex to guard accesses to class members.
/// TODO: This can probably be eliminated by making certain members
/// std::atomic.
std::mutex mtx_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, in the future, could avoid this via using a Stranded thread pool which keeps accesses on the same thread.

ThreadPoolExecutor::ThreadPoolExecutor(unsigned numWorkers) {
// Intialize all workers and make each one run workerMain.
for (unsigned i = 0; i < numWorkers; i++) {
std::thread th(std::bind(&ThreadPoolExecutor::workerMain, this));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::bind is hard to debug around, how about a lambda here?

/// manage execution.
struct ExecutorFunctionDAG {
// The list of functions to run for this DAG, topologically sorted.
std::list<Function *> functions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to store also which device each function is loaded into

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why list instead of vector or SmallVector?


// If isMoreWork() returned true but getNext() returned nothing, this work
// item has more work that needs doing but not until some dependencies are
// fulfilled. Requeue it so that another worker will look at it again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like spinning over getNext(). Could you instead trigger processWorkItem() in markSuccess?

@nickgg
Copy link
Contributor

nickgg commented Dec 7, 2018

@opti-mix @SplitInfinity - Both Function and CompiledFunction are needed in the DeviceManager (for compile and execute respectively), and if at all possible I think we should not share them with the Executor. Any object sharing in this context will require solving thread safety in the Graph.

The interface in #2134 uses identifiers for Module and Function (DeviceNetworkIDand function name) which are used to reference the particular sub-graph-function to execute.

Personal opinion is that callbacks are fine here, at least for the prototype. We should make sure were getting a benefit before jumping in on something more complicated (and maybe not cross platform?).

@jackm321
Copy link
Contributor

jackm321 commented Dec 7, 2018

@SplitInfinity @nickgg There seems to be a some commonality between this and #2134 in structure. Both have implemented a basic work queue structure but with some specific details like the dag structure in this case. Are these implementation specific details so great that we aren't able to use a generic executor in both cases and avoid all the extra code for synchronization and stuff? Can we just reenqueue the next piece of work here and can #2134's CPUDeviceManager be implemented with just a generic executor with 1 thread?

@SplitInfinity
Copy link
Contributor Author

Good point @jackm321

I think we should move the ONNXIFI thread pool class into somewhere common and use that everywhere.

@opti-mix
Copy link
Contributor

opti-mix commented Dec 7, 2018

@nickgg

Personal opinion is that callbacks are fine here, at least for the prototype. We should make sure were getting a benefit before jumping in on something more complicated (and maybe not cross platform?).

Callbacks should be fine for a prototype as long as most/all important use-cases can be expressed in their terms. I provided some exampled where the use of call-backs could be rather difficult. But it may very well be that we don't need to handle such use-cases. That's why I asked about a doc describing the high-level design, responsibilities of the entity (be it Executor or DeviceManager) and envisioned important use-cases.

Both Function and CompiledFunction are needed in the DeviceManager (for compile and execute respectively)

What are the exact responsibilities of DeviceManager? Is it described anywhere, e.g. at least in the high-level class comments? Why does it need to control compilation?

@nickgg
Copy link
Contributor

nickgg commented Dec 7, 2018

@opti-mix - there's some design notes n the top level task #2045 and @gcatron can point you at the design doc. Can you give me an example of something specific that cannot be done, or is difficult to achieve with the callbacks pattern?

The DeviceManager piece is the interface to the physical (or in the case of CPU backends, logical) device. Specifically it holds the backend, which does the compilation, and so in the design it's the part that controls the compilation. It makes sense to have compile & run in the same place since they're device specific, I think.

@opti-mix
Copy link
Contributor

opti-mix commented Dec 7, 2018

@nickgg

@opti-mix - there's some design notes n the top level task #2045 and @gcatron can point you at the design doc.

Thanks for the links! Appreciated.

Can you give me an example of something specific that cannot be done, or is difficult to achieve with the callbacks pattern?

For example, waiting for a set of multiple events to happen in a specific or any order (e.g. one of the actions/all actions are completed). Events may stem from the execution of different NN models running on the device. Or, e.g. you transfer data to the device using DMA (asynchronously) and you need to check that all/some of transfers have completed before you can proceed. Generally, IMHO, events (because they are data) can be composed better than callbacks (which is code).

BTW, you may want to have a look at e.g. OpenCL for an example of APIs which are very much event-based, rather than callback-based.

The DeviceManager piece is the interface to the physical (or in the case of CPU backends, logical) device. Specifically it holds the backend, which does the compilation, and so does the compilation.

Oh, I see. The idea that DeviceManager holds the backend is a possible solution, but it could be argued that the backend could also live outside. And whatever it produces (e.g. CompiledFunctions) could be then run on different DeviceManagers "compatible" with this backend (e.g. you have multiple devices of the same kind available). It would be interesting to see the pros/cons of different approaches considered. But this is a longer discussion.

Copy link
Contributor

@yinghai yinghai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I scanned the code. It seems that the ThreadPoolExecutor takes a DAG as an work item and keep scanning those items for one pass and re-enqueuing them. This is a lot of overhead. And the granularity of the task is a bit coarse.

IMO, the work item could be a node in the DAG. When you get a DAG to schedule, you can enqueue all its root nodes. And use callbacks to mark its successor nodes and if the dependency counter of the successor node goes to 0, enqueue it to the task queue. In this case, all the nodes in the DAG will be scheduled in topological order onto the thread pool. And you don't need to do scanning and save some bookkeeping like completedFunctions_ and inflightFunctions_.

/// manage execution.
struct ExecutorFunctionDAG {
// The list of functions to run for this DAG, topologically sorted.
std::list<Function *> functions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why list instead of vector or SmallVector?

Context *postCtx = (ctx_->contexts).at(postrequisite);
Placeholder *p;
if ((p = postModule->getPlaceholderByName(output))) {
postCtx->insert(p, ctx->get(p)->clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this involve copy of tensor data?

@yinghai
Copy link
Contributor

yinghai commented Dec 7, 2018

Your APIs seem to be callback-based. I'm not 100% convinced that it is a right approach. For example, we may run multiple DAGs at the same time and/or we may need to wait for completion (or any other events) from multiple DAGs/actions. This is rather difficult to achieve with callbacks-based APIs, but it probably much simpler using something like tokens described above and an API like epoll where you wait for events on a bunch of tokens...

I agree with @opti-mix. It's really just IO here instead of actual CPU time because you are just sending the workload to accel and waiting. Event based async interface is more suitable here.

@nickgg
Copy link
Contributor

nickgg commented Dec 7, 2018

@opti-mix @yinghai I think you may be conflating the DeviceManager with accessing the real Device. For any given device we may want an event based interface, but the specifics of those interfaces will differ in syntax and semantics in ways we don't currently know. Maybe for a particular device they do use epoll/select, maybe the firmware does blocking calls, maybe they do something really strange, they could even use callbacks (?!). The safest and most flexible thing we can do here is to have as minimal and generic an interface to the device as is practical, and isolate the platform specific code to one place - in this design the accelerator specific DeviceManager implementations.

@SplitInfinity SplitInfinity deleted the runtime-executor branch January 15, 2019 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants