Skip to content

Model Execution API #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pyu10055 opened this issue Aug 28, 2020 · 69 comments
Closed

Model Execution API #87

pyu10055 opened this issue Aug 28, 2020 · 69 comments

Comments

@pyu10055
Copy link
Collaborator

There are couple questions on the existing execution API:

  1. The current model execution API requires users to provide output buffers before execution, this is not very convenient since this is an extra step for the user and user might not know the shape of the output before hand. Also, for many model this output shape is based on the input shape, it is an extra burden for users to find that out.

  2. The current execution is build on the compilation of the full graph, while the execution API does not prevent users to execution the sub-graph of the model, it is not clear why the pre-compilation is needed and should it be internal of the execution, so it can take care of sub-graph execution.

@huningxin
Copy link
Contributor

huningxin commented Aug 31, 2020

  1. this is not very convenient since this is an extra step for the user and user might not know the shape of the output before hand.

Good catch! One possible solution is allowing user code to query the output shape, such as Android NNAPI @miaowang14.

2. The current execution is build on the compilation of the full graph, while the execution API does not prevent users to execution the sub-graph of the model,

I don't quite get this issue. Could you please elaborate the case that the API allows to compile the full graph while execute a sub-graph?

@gramalingam
Copy link
Contributor

gramalingam commented Aug 31, 2020

But the output shape may not always be statically known. It would be good if the API allows for the possibility of the output buffer being allocated during execution when its size is known. Can we treat the call to "setOutput" as optional (not required)?

@huningxin
Copy link
Contributor

But the output shape may not always be statically known. It would be good if the API allows for the possibility of the output buffer being allocated during execution when its size is known.

+1. The API may also need to allow specifying the input shape that is not supported currently.

Can we treat the call to "setOutput" as optional (not required)?

Then how to get the execution results? Could you please elaborate the idea?

@gramalingam
Copy link
Contributor

The execution API essentialy needs to allow a way to specify the inputs, optionally specify output buffers, and get back the outputs. I think various possibilities exist for this. In the long run, it may actually be useful to support multiple signatures to support different ways of doing this (mostly as a syntactic convenience).

One option would be to have a return result (for example, a dictionary mapping output-name to its value).

@wchao1115
Copy link
Collaborator

One option would be to have a return result (for example, a dictionary mapping output-name to its value).

+1.

I'm also looking at a few related Operand type issues. Maybe we can make one sweeping change.

  1. The Operand type should be limited to tensor types i.e. only tensor-float32 type is allowed and not float32. Non-tensor arrays should use sequences, lists, or TypedArray for clarity since a tensor can also be a GPU resource.
  2. Quantized tensor as a type should be avoided as it forces all the defined operators to support quantized type, which isn't realistic (OperandType of gemm / matmul return  #84). It was previously agreed that quantized types should be factored out (Remove quantization-specific params from OperandDescriptor #44).
  3. Operand should support sync and async download of tensor data to a TypedArray. This should address the first issue above.

@huningxin
Copy link
Contributor

huningxin commented Sep 8, 2020

The execution API essentially needs to allow a way to specify the inputs, optionally specify output buffers, and get back the outputs.

+1.

According to current Execution API, the missing functionalities are specifying the input shape, get the output shape and data. So semantically, we may extend the setInput method with an optional shape parameter and add getOutputShape and readOutput methods, such as

 interface Execution {
-  void setInput(DOMString name, ArrayBufferView data);
+  void setInput(DOMString name, ArrayBufferView data, optional sequence<long> shape);
   void setOutput(DOMString name, ArrayBufferView data);
   Promise<void> startCompute();
+  sequence<long> getOutputShape(DOMString name);
+  Promise<void> readOutput(DOMString name, ArrayBufferView buffer);
 };

With that, the sample code for dynamic input/output shape would be

// Create an Execution object for the compiled model.
const execution = compilation.createExecution();

const inputShape = [2, 2, 2, 2];
// Setup the input buffers with value 1.
const inputBuffer1 = new Float32Array(sizeOfShape(inputShape)).fill(1);
const inputBuffer2 = new Float32Array(sizeOfShape(inputShape)).fill(1);

// Specify the input data and its shape.
execution.setInput('input1', inputBuffer1, inputShape);
execution.setInput('input2', inputBuffer2, inputShape);

// Start the asynchronous computation.
await execution.startCompute();

// Get the output shape and allocate the buffer.
const outputShape = execution.getOutputShape('output');
const outputBuffer = new Float32Array(sizeOfShape(outputShape));

// Read the output values into the allocated buffer.
await execution.readOutput('output', outputBuffer);

// The computed result is now in outputBuffer.
console.log(outputBuffer);

One option would be to have a return result (for example, a dictionary mapping output-name to its value).

This could be a syntactic improvement. One difference is above proposal allows to get the value of a single output instead of returning them all.

@huningxin
Copy link
Contributor

I'm also looking at a few related Operand type issues.

@wchao1115 , I agree there are rooms to improve Operand. However, Operand is only used for graph building in current API. Do you propose to use it also for graph execution?

@wchao1115
Copy link
Collaborator

@huningxin I understand that Operand as it is defined now acts more like a graph edge and not necessarily a data type like tensor. But in practice the two concepts are inseparable; it would be much simpler for a graph API to think of them as one. Both TensorFlow and PyTorch handle it this way.

Concretely, a few things stand out for me:

  1. The operator API already accepts both attribute and tensor arguments, with the former specified as native data types e.g. float, bool, etc. Therefore there is no point at also defining an operand of a float32 data type as a separate notion from an operand of a tensor-float32 data type. An operand with float32 type should just mean a graph edge that would carry a tensor of type float32.
  2. Although requiring a CPU buffer view to construct an input operand may make sense, an output operand could be backed by a GPU tensor that is constructed on the fly by the graph execution process. So, requiring the app to always download the output tensor back to the CPU buffer view doesn't make sense, since it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc. A graph execution may also produce multiple outputs, and the app may choose to ignore some of them.
  3. Quantized tensor further clouds the Operand type with more procedural data that doesn't really contribute to the identity of the operand type (as a graph edge). As previously pointed out, it also unnecessarily raises the requirement that all the operator API must support quantized types; a requirement that actually cannot be fulfilled in all cases.

A leaner design in my mind would be one with an operand exposing a method for the app to optionally download its content to an ArrayBuffer only if they want to. This call would fail if it's made before the graph is executed since the output tensor won't be known then. With that API addition, it turns out that execution.setOutput becomes entirely unnecessary since a model would be previously created with the named dictionary of the output operands (see nn.createModel) and that an ArrayBuffer of an output operand is created by the execution and not the construction of the graph.

@kpu
Copy link

kpu commented Sep 8, 2020

2. Quantized tensor as a type should be avoided as it forces all the defined operators to support quantized type, which isn't realistic (#84). It was previously agreed that quantized types should be factored out (#44).

In #17 it was "RESOLVED: The specification will reference a subset of the ONNX operations". The ONNX operations include type constraints i.e. https://github.com/onnx/onnx/blob/master/docs/Operators.md#MatMulInteger . Can we just follow them?

@huningxin
Copy link
Contributor

huningxin commented Sep 9, 2020

So, requiring the app to always download the output tensor back to the CPU buffer view doesn't make sense, since it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc.

That's a good point. Besides passing the GPU-backed output tensor to other API, I think the Execution API should be able to accept the output tensor as input. With that, it would enable pipelining multiple graph executions without moving data cross devices.

A leaner design in my mind would be one with an operand exposing a method for the app to optionally download its content to an ArrayBuffer only if they want to. This call would fail if it's made before the graph is executed since the output tensor won't be known then.

My understanding is the input and output tensor of Execution API should be device dependent, say GPU resource. However, Operand is device independent, for example, the nn.input is just a symbolic one that never has values, and nn.constant doesn't know which device to store its values until the compilation.

So I am thinking about adding a Tensor interface that is backed by device memory. The Execution would accept input Tensors and output Tensors. The Tensor interface would have methods for async shape querying and data downloading. As Compilation has target device, we may allow creating Tensor from Compilation. A code sketch would be

// Create an Execution object for the compiled model.
const execution = compilation.createExecution();

const inputShape = [2, 2, 2, 2];
// Setup the input buffers with value 1.
const inputBuffer1 = new Float32Array(sizeOfShape(inputShape)).fill(1);
const inputBuffer2 = new Float32Array(sizeOfShape(inputShape)).fill(1);

// Create the input and output tensors for compilation device.
const inputTensor1 = compilation.device.tensor(float32TensorType, inputBuffer1);
const inputTensor2 = compilation.device.tensor(float32TensorType, inputBuffer2);
const outputTensor = compilation.device.tensor();

// Set input and output tensors to an execution.
execution.setInput('input1', inputTensor1);
execution.setInput('input2', inputTensor2);
execution.setOutput('output', outputTensor);

// Start the computation, no async anymore, as tensor reading back would be async.
execution.startCompute(inputs, outputs);

// Query the output shape, await for shape inference done.
const outputShape = await outputTensor.getDimensions();
const outputBuffer = new Float32Array(sizeOfShape(outputShape));

// Read the data back, await for inference done.
await outputTensor.readInto(outputBuffer);
console.log(outputBuffer);

The execution.setOutput may be still useful, as it would allow to reuse a Tensor from one execution to another.

@gramalingam
Copy link
Contributor

Hi @wchao1115 : one clarification about the attributes and operands: I think that the primary distinction between attributes and operands are that attributes represent values that are known statically, at the time the graph/model is built; on the other hand, operands represent values that typically will not be known statically (though, in some special cases they might be).

So, one key question we should address is how we want to handle scalar values that are known only at runtime (and not statically). Many frameworks do use tensors of rank 0 to represent such scalars. We could do the same. In this case, restricting operand types to be tensor-float332 and not float32 would work. So, this is doable, though it may incur a small performance penalty. But since these scalars are not going to be the performance bottleneck, it should be okay.

@wchao1115
Copy link
Collaborator

@gramalingam That's a good point. We haven't fully defined the notion of tensor scalar. From the graph execution standpoint, tensors are device-specific resources, either as inputs or outputs of the process.

@wchao1115
Copy link
Collaborator

@huningxin I think this is too complicated since the developers now need to deal with 3 different notions -- operand, tensor, and buffer. Besides, the tensors are now created by compilation, which raises more questions than it answers.

Even though tensors are device-dependent, it doesn't mean that we have to surface it as a type to the users. I think the current notion of nn.input is perfectly fine. If in the future we need an ability to construct an operand out of other device resources e.g. video frame, there is nothing stopping us from defining a nn.inputFromVideoFrame method. We don't need to expose the notion of device-specific resource as a type to the users; we just need to give them a way to work with it. TF.js also does it this way. Their so-called Tensor type in spite of its name is still a conceptual type and not a device-dependent resource type.

Same for the output operand, as long as there is a way to retrieve the computed content in it and put it in a CPU buffer or in a future resource type, there is no need to expose the notion of the device-specific resource type to the users.

Also, if you look at how nn.createModel is defined, we already require through this API that the users specify the dictionary of all output operands, even before compilation. So all that we need is to expose a way to get the content out of it and we won't even need setOutput anymore. We also don't need to pass the output operands to startCompute, as it's already defined in the model.

@huningxin
Copy link
Contributor

huningxin commented Sep 10, 2020

Also, if you look at how nn.createModel is defined, we already require through this API that the users specify the dictionary of all output operands, even before compilation. So all that we need is to expose a way to get the content out of it and we won't even need setOutput anymore. We also don't need to pass the output operands to startCompute, as it's already defined in the model.

Users may compile the same Model to multiple Compilations with different CompilationOptions and target different underlying devices. In this case, there is only one output Operand associated with the Model. This Operand might not be able to represent the different device resources. For example

const model = await nn.createModel([{name: 'output', operand: output}]);

// e.g. target iGPU
const compilation1 = await model.createCompilation({powerPreference: 'low-power'});
// e.g. target dGPU
const compilation2 = await model.createCompilation({powerPreference: 'high-performance'});

// Create Execution objects for the compiled models.
const execution1 = await compilation1.createExecution();
const execution2 = await compilation2.createExecution();

execution1.startCompute();
execution2.startCompute();

// Which device to read back, iGPU or dGPU?
await output.readInto(buffer);

@gramalingam
Copy link
Contributor

Hi @huningxin : to read the output value into a buffer, we would need to invoke a "getValue(operand)" method on either execution or on a result returned by execution.startCompute(), rather than an operand.

@wchao1115 : I assume that the existing buffer abstractions suffice to represent memory on different devices? Assuming that, I think the primary distinction between a "Tensor" and a "buffer" would be that a "Tensor" attaches a "shape" to a buffer. I think that is a useful addition ... in most frameworks there is such a distinction. I assume the existing Buffer abstraction does not have a shape associated with it?

@huningxin
Copy link
Contributor

to read the output value into a buffer, we would need to invoke a "getValue(operand)" method on either execution

This might not allow passing the GPU-backed output to other API or a following execution without moving data between GPU and CPU.

or on a result returned by execution.startCompute(),

This would allow above use case. However, as execution is for one-shot inference, this might not allow reusing the device memory for multiple executions/inferences.

rather than an operand.

I agree not read the value from an operand.

I assume that the existing buffer abstractions suffice to represent memory on different devices?

If you mean ArrayBufferView, I suppose it only represents CPU memory. WebGPU is defining GPUBuffer for GPU memory.

I assume the existing Buffer abstraction does not have a shape associated with it?

Neither ArrayBufferView nor GPUBuffer has a shape.

@wchao1115
Copy link
Collaborator

Can this work?
(I don't recall why we would need execution in addition to compilation; also simplify the method names for readability)

// model produces 2 outputs
const model = await nn.createModel([{name: 'output1', operand: output1}, {name: 'output2', operand: output2}]);
const compilation = await model.compile();
const buffers = await compilation.compute(['output1']);
// 'buffers' is sequence<ArrayBufferView> 
// 'buffers[0]' holds a filled CPU buffer of just the content of output1

@huningxin
Copy link
Contributor

I don't recall why we would need execution in addition to compilation

The previous discussion is at #39 (comment). Probably we need to revisit that.

also simplify the method names for readability

+1

Can this work?

Two opens:

  1. The inputs are not set for a compute. Is it missed?
  2. If the output is on GPU, returning sequence<ArrayBufferView> would download the data to CPU memory. It might not able to allow the use case @wchao1115 mentioned in Model Execution API #87 (comment)

it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc.

@gramalingam
Copy link
Contributor

gramalingam commented Sep 10, 2020

I prefer a single "compilation.compute(...)" method (as suggested by Chai).

Re. Ningxin's point 1: yes: we need to supply the input-values as parameters to the "compute" method.

Re. Ningxin's point 2: I think if we want to support this usecase, we have two options.
(a) The caller specifies (using parameters to the "compute" method) whether they want the output copied to CPU or not.
(b) We don't do any copying in "compute" and the caller copies the results to the CPU separately (if they want to).

I prefer option (b). So, basically, generalize Chai's suggestion to:

const outputs = await compilation.compute(input_values, desired_output_names);

where outputs is a sequence of output_type, and output_type is a union that includes tensors in GPU as well as CPU etc.

I think it is useful to have a type representing a tensor, regardless of where it is located.

@wchao1115
Copy link
Collaborator

The inputs are not set for a compute. Is it missed?

I omitted the nn.input part in my sample code above for brevity. The input side is the same as before. If we want to accept an external resource for an input operand, we could add new methods e.g. nn.inputFromXXX in the future.

we need to supply the input-values as parameters to the "compute" method.

Not sure why. I don't think we need to pass the input operands to compute as it's already known and labeled in the graph.

it may want to pass the GPU resource along to other API that may accept it e.g. with ML models that emit output images, etc.

I looked at both canvas-2d and canvas-WebGL to understand how to move the output tensor data to either endpoint, and it looks like both paths do start with ArrayBufferView, which unfortunately would require a copy. But in most known scenarios the bottlenecks are on how to get the data in to the model. So requiring copy-out may not be too bad. Future resource sharing with WebGPU may be a possibility, but that would probably need to happen at the context level to ensure cross-adapter resource sharing. I think assuming selective CPU downloads on outputs is probably a safe bet.

@gramalingam
Copy link
Contributor

Not sure why. I don't think we need to pass the input operands to compute as it's already known and labeled in the graph.

We know what the inputs (that is, the input-variables) of the graph are (when we build the graph), but not the values of these inputs (which are determined only when we call "compute" and will vary from one call of "compute" to another.

@wchao1115
Copy link
Collaborator

@gramalingam the input values are known, right? That's why it's the input.

(copied from the spec sample)

// Setup the input buffers with value 1.
const inputBuffer1 = new Float32Array(TENSOR_SIZE).fill(1);
const inputBuffer2 = new Float32Array(TENSOR_SIZE).fill(1);

// Associate the input buffers to model’s inputs.
execution.setInput('input1', inputBuffer1);
execution.setInput('input2', inputBuffer2);

@gramalingam
Copy link
Contributor

Yes, we can specify the input values using execution.setInput(...) . But if we want to eliminate execution and directly call compute from a compilation, we need some mechanism to achieve the same goal as execution.setInput.

@wchao1115
Copy link
Collaborator

wchao1115 commented Sep 10, 2020

Yes, we can specify the input values using execution.setInput(...) . But if we want to eliminate execution and directly call compute from a compilation, we need some mechanism to achieve the same goal as execution.setInput.

That would be Compilation.setInput. I just thought Compilation and Execution are redundant. It looks like we can make do with just 1 stateful object instead of 2.

interface Compilation {
  void setInput(DOMString name, ArrayBufferView data);
  Promise<sequence<ArrayBufferView>> compute(sequence<DOMString>);
};

@huningxin
Copy link
Contributor

huningxin commented Sep 11, 2020

I prefer a single "compilation.compute(...)" method (as suggested by Chai).

+1.

Re. Ningxin's point 1: yes: we need to supply the input-values as parameters to the "compute" method.

+1. This would make the Compilation stateless.

(b) We don't do any copying in "compute" and the caller copies the results to the CPU separately (if they want to).

+1. This would enable passing the GPU resource along to other API even following compute method.

I think it is useful to have a type representing a tensor, regardless of where it is located.

+1. I propose to add Tensor interface and Compilation.compute on tensors instead of ArrayBufferView.

There are some usages with code sketches (updated).

  • Create input and output tensors for a compilation:
// Create input tensors with full specified shape.
const tensor1 = compilation.tensor(inputShape1);
const tensor2 = compilation.tensor(inputShape2);
// Create output tensor without specifying shape.
const tensor3 = compilation.tensor();
  • Write data into the input tensors.
const inputBuffer1 = new Float32Array(sizeOfShape(inputShape)).fill(value);
tensor1.writeFrom(inputBuffer1);
// Similar code for tensor2.
  • Compute is not async, as reading back of tensor is async.
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
  • Async read back the output into a buffer
const outputBuffer = new Float32Array(sizeOfShape(outputShape));
await tensor3.readInto(outputBuffer);
  • If the output shape is dynamic, users could query the shape before allocating buffer.
// Might wait for shape inference done.
const outputShape = await outputTensor.getDimensions();
const outputBuffer = new Float32Array(sizeOfShape(outputShape));
  • A following compute could reuse the tensors without creating new ones.
// The second inference.
tensor1.writeFrom(inputBuffer1);
tensor2.writeFrom(inputBuffer2);
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
await tensor3.readInto(outputBuffer);
  • An output tensor could be passed to another compilation's (different graph) compute without downloading its data.
tensor1.writeFrom(inputBuffer1);
tensor2.writeFrom(inputBuffer2);
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
compilation2.compute({'input': tensor3}, {'output': tensor4});
// Only download the final result.
await tensor4.readInto(outputBuffer);

@wchao1115
Copy link
Collaborator

wchao1115 commented Sep 11, 2020

@huningxin It seems a key difference in this discussion so far is whether we should define a notion of Tensor in the API as a dedicated resource that can be implemented and passed in and out of the API, potentially enabling resource sharing with other API.

This idea is conceptually sound with slightly greater complexity, but my concern is on the implementation side, especially for the GPU implementation of the API. The first problem is that you actually don't want to map a Tensor to a GPU resource like a UAV buffer, one to one. In practice it's far more common to pool and reuse a single UAV buffer for multiple sets of tensor data with proper alignments. Frequent creation of GPU resources is prohibitively expensive.

Secondly, activities such as upload/download of tensor data from/to CPU buffers are commonly pipelined in the same command list along with shader dispatches that carry out the computation of graph operations, so that the entire command list can be pipelined and executed in one go. It would be highly inefficient to break up the command list just to upload the tensor data for instance.

And lastly, there are some type of tensor data e.g. constant weight data that sometimes requires special treatment depending on the underlying hardware; a process that involves an additional copy pass before issuing shader dispatches. It's hard to properly locate the timing of that copy pass when the tensor data is associated with the Tensor API objects that can operate independently from the rest of the graph execution process.

Based on this set of difficulties, I'm still advocating against defining a Tensor type at the API level and instead continuing with the current approach of explicit upload/download methods off the Compilation interface.

@huningxin
Copy link
Contributor

huningxin commented Sep 11, 2020

@wchao1115 , thanks for sharing your thoughts on GPU implementation that is extremely helpful.

The first problem is that you actually don't want to map a Tensor to a GPU resource like a UAV buffer, one to one.

+1. I think we might not need to maintain the one to one mapping between a Tensor and a GPU resource. A tensor could just "record" the logical resource usage, such as input or output of a compute. The real resource allocation is up to the implementation.

Secondly, activities such as upload/download of tensor data from/to CPU buffers are commonly pipelined in the same command list along with shader dispatches that carry out the computation of graph operations

That's a good point.

To test my understanding, I would slightly change the sketch code of the inference loop as:

tensor1.writeFrom(inputBuffer1);
tensor2.writeFrom(inputBuffer2);
compilation.compute({'input1': tensor1, 'input2': tensor2}, {'output': tensor3});
await tensor3.readInto(outputBuffer);

When running above code, a GPU implementation might be able to delay the execution until the last step (await tensor3.readInto(outputBuffer)). When user wants to read back the data, the GPU implementation could pipeline the CPU memory upload for tensor1 and tensor2, shader dispatches of compiled operations of compilation and CPU memory download from tensor3 into one command list, and execute it in one go .

And lastly, there are some type of tensor data e.g. constant weight data that sometimes requires special treatment depending on the underlying hardware;

The constant data would not be represented by Tensor. We use nn.constant to associate the constant data to a model. For GPU implementation, the special treatment including an additional copy pass could be handled in model.compile.

Does this work?

@wchao1115
Copy link
Collaborator

wchao1115 commented Sep 11, 2020

I think we might not need to maintain the one to one mapping between a Tensor and a GPU resource. A tensor could just "record" the logical resource usage, such as input or output of a compute.

In practice the backing implementation of the API Tensor type would most likely be just a sort of handle with a refcount pointer to the context, which complicates the implementation since there will need to be additional bookkeeping going internally. It'll also tax resource management and garbage collection with small handles moving around. The API object gives a false impression that it can do something useful.

When running above code, a GPU implementation might be able to delay the execution until the last step (await tensor3.readInto(outputBuffer)).

That is very unintuitive and misleading to developers. It is very natural to expect that operations like shader dispatches happen at compute. The download copy to a CPU buffer should be entirely optional. I'm hesitated for an API that doesn't lend itself well to intuition. It is very important to convey the intent of the API so people knows what to expect.

The constant data would not be represented by Tensor. We use nn.constant to associate the constant data to a model.

Understood. I'm not suggesting that we change the notion of nn.constant. I'm pointing out that during graph execution the implementation will need to tag the tensors backing constant operands for that extra initialization stage as many GPU drivers pre-process weight data ahead of execution. By splitting up a notion of API Tensor, it's one additional thing to keep track on during graph execution.

@kpu
Copy link

kpu commented Sep 11, 2020

I'm pointing out that during graph execution the implementation will need to tag the tensors backing constant operands for that extra initialization stage as many GPU drivers pre-process weight data ahead of execution. By splitting up a notion of API Tensor, it's one additional thing to keep track on during graph execution.

This "pre-process weight data" is called packing #86 and also happens with CPU drivers. To generalize slightly, the tensor need not be a constant but simply reused enough to justify packing it. For example, a transformer encoder's outputs will be consulted repeatedly by a transformer decoder.

You appear to be assuming that graph compilation is the right place to pack parameter matrices. Is your graph compilation general enough to support graph structures that differ for every input like syntactic parsing and the existence of the dynet toolkit? https://github.com/clab/dynet/ If not, then then graph compilation is not the place to stuff packing.

@huningxin
Copy link
Contributor

@wchao1115

You're talking about something like this?

Semantic wise, yes. There are some syntax tweaks. I propose to use record<K, V> that allows the literal syntax for mapping name to input/output.

dictionary Input {
  required ArrayBufferView buffer;
  sequence<long> dimensions; // optional
};

dictionary Output {
  ArrayBufferView buffer; // optional
  sequence<long> dimensions; // optional
};

typedef record<DOMString, Input> Inputs;
typedef record<DOMString, Output> Outputs;

interface Compilation {
  Promise<Outputs> compute(Inputs inputs, optional Outputs outputs);
};

With that, user could pass inputs like following code:

let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}});

and can access output by named property:

console.log(results.c.dimensions);
console.log(results.c.buffer);

WDYT?

@pyu10055
Copy link
Collaborator Author

pyu10055 commented Sep 17, 2020

@huningxin @wchao1115 Are we combining the compilation and execution into a single method compilation.compute?
For users to supply the Input/Output parameters, do they need to know the shape prior calling the compute method?

@pyu10055
Copy link
Collaborator Author

pyu10055 commented Sep 17, 2020

@huningxin @wchao1115 Slowly reading through the thread, I think the proposal solves my concerns with the original API.

One extra question is the concept of compilation to the end user, it looks like the compilation become implicit, since it is combined with execution. It is not clear to me why we need to introduce this concept. Especial for end users do not care about controlling the compilation, it will much clear to have an execute method directly on the Model interface.

  model.execution(Inputs inputs, optional Outputs outputs);

@huningxin
Copy link
Contributor

huningxin commented Sep 17, 2020

@pyu10055

Are we combining the compilation and execution into a single method compilation.compute?

No. They are separated.

The compilation would be done by nn.compile as:

const a = nn.input('a', descriptorA);
const b = nn.constant(descriptorB, bufferB);
const c = nn.matmul(a, b);
const compilation = await nn.compile({'c': c}, {powerPreference: 'low-power'});

The execution is done by compliation.compute as:

let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}});
console.log(results.c.dimensions);
console.log(results.c.buffer);

For users to supply the Input/Output parameters, do they need to know the shape prior calling the compute method?

It would not require users to know the output shape prior calling the compute method. According to compute's signature, Promise<Outputs> compute(Inputs inputs, optional Outputs outputs), when the outputs parameter is not provided, the compute method would return the output buffer and shape as above sample code shows.

For static input shape, for example

const a = nn.input({type: 'tensor-float32', dimensions: [2, 2]});

The users can just provide the buffer when calling the compute method, such as

let results = await compilation.compute({'a': {buffer: bufferA}});

For dynamic input shape, for example

const a = nn.input({type: 'tensor-float32', dimensions: [-1, 2]});

The users need to provide both buffer and shape, such as:

let results = await compilation.compute({'a': {buffer: bufferA, dimensions: [2, 2]}});

@pyu10055
Copy link
Collaborator Author

pyu10055 commented Sep 17, 2020

@huningxin Actually this has gone back a little bit, take following graph as example:

const a = nn.input('a', descriptorA);
const b = nn.constant(descriptorB, bufferB);
const c = nn.matmul(a, b);
const d = nn.matmul(a, c);
const compilation = await nn.compile({'d': d}, {powerPreference: 'low-power'});

The compile is done for the full graph, now I want to execute the subgraph which has c as the output, can the previous compilation used for this new execution:

let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}}, {'c': {...}));
console.log(results.c.dimensions);
console.log(results.c.buffer);

Or the user need to create a new compilation?

const compilation2 = await nn.compile({'c': c}, {powerPreference: 'low-power'});

@huningxin
Copy link
Contributor

huningxin commented Sep 17, 2020

@pyu10055

Or the user need to create a new compilation?

Yes. The users could create a new compilation for that subgraph.

I am thinking about an alternative way. That allows to specify multiple outputs when compile, such as:

const compilation = await nn.compile({'c': c, 'd': d}, {powerPreference: 'low-power'});

The users may just compute either c or d.

let results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}}, {'c': {}));
console.log(results.c.dimensions);
console.log(results.c.buffer);

results = await compilation.compute({'a': {buffer: bufferA, dimensions: shapeA}}, {'d': {}));
console.log(results.d.dimensions);
console.log(results.d.buffer);

Would this work?

@pyu10055
Copy link
Collaborator Author

pyu10055 commented Sep 17, 2020

@huningxin This would work, but from usability point of view, why users need to care about the compilation step?
Can the compilation be an internal implementation detail that is hidden from the end users? The reason is that I don't see the benefit of calling compile + compute API, comparing to a single API:

  model.execution(Inputs inputs, optional Outputs outputs, {powerPreference: 'low-power'});

@huningxin
Copy link
Contributor

@pyu10055

why users need to care about the compilation step?

I think the major reason is compilation takes time and the users may want to handle that. @wchao1115 shared more insights at #87 (comment)

One example is, with webnn-polyfill, the compilation time and first inference time of LetNet example are:

compilation elapsed time: 582.70 ms
execution elapsed time: 27.80 ms

As we discussed, as there is no dedicated compilation step of tf.js, the webnn-polyfill emulates the compilation by letting tf.js infer once. The first inference would compile the WebGL kernels, so it takes much longer time comparing to following inferences. I guess the compilation in native API would be similar.

@wchao1115
Copy link
Collaborator

This would work, but from usability point of view, why users need to care about the compilation step?

It comes down to the choice between ease-of-use vs. control. A separate compile step allows more control for the users of the API to decide when would be an ideal time to spend on model compilation. For the scenarios described earlier in the thread, it is important to keep it.

@huningxin I'm not familiar with record<K,V>. Is it supported in the w3c webidl spec? Can you point me to the standard definition? I'm not sure what the c in this expression results.c.dimensions stands for.

@pyu10055
Copy link
Collaborator Author

pyu10055 commented Sep 17, 2020

@huningxin @wchao1115 Thanks for the explanation, it makes sense to have a dedicated compile step.
For API point of view, I am not convinced that Compilation is a good encapsulation, while Model is a much concrete concept.
In follow abstract example, User can use the compile method to cache the compilation, and call execute to take advantage of the caches.

class Model {
  compile() {
  }
  execute() {
  }
}

// they can do the compile explicitly
model.compile({'c': c}, {powerPreference: 'low-power'});
let results = await model.execute({'a': {buffer: bufferA, dimensions: shapeA}}, {'c': {}));
console.log(results.c.dimensions);
console.log(results.c.buffer);

// they can also call the execute, which will trigger compilation as needed.
results = await model.execute({'a': {buffer: bufferA, dimensions: shapeA}}, {'d': {}));
console.log(results.d.dimensions);
console.log(results.d.buffer);

The idea is we do not need to sacrifice ease-of-use over control. Thoughts?

@huningxin
Copy link
Contributor

@wchao1115

I'm not familiar with record<K,V>. Is it supported in the w3c webidl spec? Can you point me to the standard definition? I'm not sure what the c in this expression results.c.dimensions stands for.

The link to record definition is https://heycam.github.io/webidl/#idl-record.

The c stands for the output named as c. With record, user can access it as a named property. The other syntax is indexing by name, e.g. results['c'].dimensions.

@huningxin
Copy link
Contributor

huningxin commented Sep 18, 2020

@pyu10055

In follow abstract example, User can use the compile method to cache the compilation, and call execute to take advantage of the caches.

I have a few opens regarding to your proposal:

  • Should model.compile be an async method?
  • How would the users create a model? Why do the users need to supply the output operands to model.compile?

With today's spec, the users create a model by nn.createModel, such as

const model = await nn.createModel([{name: 'c', operand: c}]);

Then the users can call model.compile.

Actually, I propose to fold the model creation and compile into one nn.compile step, such as

-const model = await nn.createModel([{name: 'c', operand: c}]);
-const compilation = await model.compile({powerPreference: 'low-power'});
+const compilation = await nn.compile({'c': c}, {powerPreference: 'low-power'});
  • After the users call model.compile, the model would become a compiled state, correct? Then could the users call model.compile again with different options? Say
model.compile({'c': c}, {powerPreference: 'low-power'});
model.compile({'c': c}, {powerPreference: 'high-performance'});  // is this allowed?

Model might be a more familiar concept to the developers. I am open to give Compilation a better name, such as ExecutableModel or CompiledModel. And do you also suggest change the compute method name to execute? Say

partial interface NeuralNetworkContext {
  Promise<ExecutableModel> compile(NamedOperands outputs, optional CompilationOptions options = {});
};

interface ExecutableModel {
  Promise<Outputs> execute(Inputs inputs, optional Outputs outputs);
};

With that, there would be only one change comparing to your sample code:

- model.compile({'c': c}, {powerPreference: 'low-power'});
+ const model = nn.compile({'c': c}, {powerPreference: 'low-power'});
  let results = await model.execute({'a': {buffer: bufferA, dimensions: shapeA}}, {'c': {}));
  console.log(results.c.dimensions);
  console.log(results.c.buffer);

WDYT?

@pyu10055
Copy link
Collaborator Author

pyu10055 commented Sep 18, 2020

@huningxin first let me answer your questions:

  1. Should model.compile be an async method?
    It should be a async method to allow users to potentially time following operations.
  2. How would the users create a model? Why do the users need to supply the output operands to model.compile?
    basically the model is a warp of the topology with helper methods to facilitate compilation/execution/disposal/etc. Maybe it is the same concept as your Context class. The output operands for compilation is optional, but by specifying the output operands, users can compile subgraph. The output the compilation calls are stored within the Model.

My initial reaction to execution API is that the model topology is created using operands, but there is no clear ownership of those operands. The topology seems to live in a global context, and I may created multiple topologies and even share nodes across them. It is also possible to change the topology and not sure about how the changes would affect the compilation.

Model can act as a context where the topology is created within. And any compilations should also associate with that context. Topology change should invalidate the existing compilations. That is also the reason I see the benefit of separate compilation step, but think many users might not need be concerned with that.

@huningxin
Copy link
Contributor

huningxin commented Sep 19, 2020

@pyu10055

but there is no clear ownership of those operands.

According to the current API, the operands are just used to describe the graph topology for model creation (by nn.createModel of today's spec) or compilation (by nn.compile of my new proposal). The operands are not associated with model or compiled model.

It is also possible to change the topology and not sure about how the changes would affect the compilation.

Given that, the topology change would not affect the existing compilations. For example

const a = nn.input('a', descA);
const b = nn.constant(descB, bufferB);
let c = nn.add(a, b);
const compilation1 = await nn.compile({'c': c});  // compile "c = a + b"

c = nn.mul(a, b);  // This would not affect compilation1.
const compilation2 = await nn.compile({'c': c});  // compile "c = a * b"

let results = await compilation1.compute({'a': {buffer: bufferA}});
console.log(results.c.buffer);  // results.c.buffer = bufferA + bufferB

results = await compilation2.compute({'a': {buffer: bufferA}});
console.log(results.c.buffer);  // results.c.buffer = bufferA * bufferB

Does it make sense?

@wchao1115
Copy link
Collaborator

I think the point @pyu10055 is making is that the transient state of the topology being constructed seems to be living inside the nn context, which makes it hard for developers to manage those state separately from the context itself. Assuming we expose the notion of topology (or model) to manage them instead, we could do something like this:

const t1 = nn.createTopology();
const a = t1.input('a', descA);
const b = t1.constant('b', descB, bufferB);
const c = t1.add(a, b);
const t2 = nn.createTopology();
const x = t2.input('x', descX);
const y = t2.constant('y', descY, bufferY);
const z1 = t2.mul(x, b);  // z1 = x * b
const m1 = await t2.compile({'z1': z1});  // not allowed! cross topology operand detected.
const z2 = t2.mul(x, y);  // z2 = x * y
const m2 = await t2.compile({'z2': z2});  // ok!

By introducing the notion of topology in the API, it acts as an agent to the context that owns the transient state of the topology being constructed. This means t1 and t2 may be garbage-collected separately whenever it's no longer used without affecting the state of the context itself, which might be long-lived. Otherwise, we would need to have a clearTopology method exposed from the context to wipe off this transient state every now and then since you can never be sure what's left in there at any given time.

@gramalingam
Copy link
Contributor

Yes, to the last point @wchao1115 makes. (I had earlier assumed we would create different nn context for different models.) Regardless of what we call it (model, topology, context), it is useful to have different instances for different models. However, we still need to specify what can be shared across different instances. E.g., I don't think nodes should be shared. They should belong to a unique model/topology.

Another advantage of having this model/topology/context be an object is that, in principle, it may be possible to have both an graph-builder implementation of the interface as well as an "eager" implementation of the interface that evaluates the ops as it is constructed (in the longer run). This is doable, at least in the absence of control-flow ops. But may be this is a digression/distraction.

@wchao1115
Copy link
Collaborator

I think the topology idea has merit and should help with the transfer learning scenario where a topology maybe altered after it being created but before compile. I see that eager could be implemented as you describe -- as another implementation of the nn interface with the compile method being optional.

@huningxin
Copy link
Contributor

huningxin commented Sep 23, 2020

@wchao1115

I think the point @pyu10055 is making is that the transient state of the topology being constructed seems to be living inside the nn context, which makes it hard for developers to manage those state separately from the context itself.

The topology is represented by the wired operands, e.g. c = nn.matmul(a, b). The nn.compile({'c', c}) method compiles the topology into the executable model. After that, these operands would not be referenced by either the model or the context. So they are able to be garbage-collected and the developers would not need to manage them explicitly.

@gramalingam

it is useful to have different instances for different models. However, we still need to specify what can be shared across different instances. E.g., I don't think nodes should be shared. They should belong to a unique model/topology.

Although the developers may reuse the same operands to describe the different topologies, these operands are not referenced or shared by the compiled models. For example

const a = nn.input('a', descA);
const b = nn.constant(descB, bufferB);
const c = nn.add(a, b);
const compilation1 = await nn.compile({'c': c});  // compile "c = a + b"

const d = nn.mul(a, b);
const compilation2 = await nn.compile({'d': d});  // compile "d = a * b"
// a, b, c and d can be garbage-collected.

Another advantage of having this model/topology/context be an object is that, in principle, it may be possible to have both an graph-builder implementation of the interface as well as an "eager" implementation of the interface that evaluates the ops as it is constructed (in the longer run).

It may be possible by allowing getting an "eager" context and reading the operand buffer within that context. This is based on the idea of @wchao1115 . For example:

const nn = navigator.ml.getNeuralNetworkContext('eager');
const a = nn.constant(descA, bufferA);
const b = nn.constant(descB, bufferB);
const c = nn.add(a, b);
await c.buffer();

@wchao1115
Copy link
Collaborator

wchao1115 commented Sep 23, 2020

[Note: For the sake of this discussion, I purposefully borrow the word topology to try to differentiate it from the notion of model that is an immutable output of the existing createModel method to avoid confusions. I agree that createModel along with its output doesn't seem to serve much purpose and could be let go.]

As for the topology, its main advantage is to convey to the developers and implementers of the API that if they have states that are specific to a specific graph building session, they could store it here and leave all the global states that affect all graph building sessions in the context itself. I believe the addition of this abstraction could make the API more clear and more intuitive to all parties. I'm a bit concerned when we have to make statements like

these operands would not be referenced by either the model or the context.

as it seems to impose non-obvious implementation choices and constraints, which may not hold up in certain situation.

The issue of how to support eager has been brought up before in the past, and though they're relevant to the general design of the API, I'd rather have it tracked in a separate issue for the purpose of keeping this thread focused on the specific issue originally raised. In my view I think we generally agree that eager can be thought of as an implementation variant of the same context interface, and that at the moment the introduction of the notion of topology doesn't seem to prevent supporting eager execution in the future.

@huningxin
Copy link
Contributor

@wchao1115 , thanks for the detailed explanation.

As for the topology, its main advantage is to convey to the developers and implementers of the API that if they have states that are specific to a specific graph building session, they could store it here and leave all the global states that affect all graph building sessions in the context itself.

I agree it makes sense to allow multiple graph building sessions. Now we only have a global one.

these operands would not be referenced by either the model or the context.

as it seems to impose non-obvious implementation choices and constraints, which may not hold up in certain situation.

I agree this should be implementation details.

The issue of how to support eager has been brought up before in the past, and though they're relevant to the general design of the API, I'd rather have it tracked in a separate issue for the purpose of keeping this thread focused on the specific issue originally raised.

+1

As @pyu10055 mentioned,

Model can act as a context where the topology is created within. And any compilations should also associate with that context.

Probably we could repurpose the Model interface for the graph building session instead of as an immutable one. So we can keep the Model interface and add the graph building methods into it. Its usage would be the same as Topology that @wchao1115 proposed. For example:

interface NeuralNetworkContext {
  Model createModel();
};

interface Model {
  Operand input(DOMString name, OperandDescriptor desc);
  Operand constant(OperandDescriptor desc, ArrayBufferView value);

  Operand add(Operand a, Operand b);
  Operand mul(Operand a, Operand b);
  // and other operations
  
  Promise<Compilation> compile(NamedOperands outputs,  optional CompilationOptions options = {});
};

@wchao1115
Copy link
Collaborator

wchao1115 commented Sep 25, 2020

Thanks @huningxin. My only concern on the model naming is that it implies immutability to most people. What we're looking for here is a name for a mutable graph building state. I'm not fixed on the word topology but I think at least that wording does imply an arrangement or build-up of a graph as in topology: the way in which constituent parts are interrelated or arranged. Suggestion is welcome.

@pyu10055
Copy link
Collaborator Author

pyu10055 commented Sep 25, 2020

@wchao1115 @huningxin I agree that topology shall usually be immutable after creation, but the concept of model is usually used for constructing topology. If you look at any of the model building API (keras for example), the topology is part of the model object, whether it is SequentialModel or FunctionalModel.
In Ningxin's LeNet example, he is constructing a model. And if we are expecting support training in the future, having model concept is also useful.

@wchao1115
Copy link
Collaborator

@pyu10055 I'm not sure I understand your comment. Are you making a case of calling it a model?

@huningxin
Copy link
Contributor

huningxin commented Sep 27, 2020

@wchao1115 @pyu10055 we may consider to align to the model-loader API where a model should be immutable. @jbingham, please correct me if I am wrong.

So probably we could change the current NueralNetworkContext to ModelBuilder that would be the counterpart of ModelLoader. Developers would create a builder (through navigator.ml.createModelBuilder) for a specific graph building session. And we may still keep the Model interface as the product of the builder and loader.

The code sketch would be:

// building a model
const builder = navigator.ml.createModelBuilder();
const a = builder.input('a', descA);
const b = builder.constant(descB, bufferB);
const c = builder.matmul(a, b);
const m1 = builder.createModel({'c', c});

// loading a model
const loader = navigator.ml.createModelLoader();
const m2 = await loader.load(modelUrl);

@jbingham
Copy link

jbingham commented Sep 27, 2020 via email

@gramalingam
Copy link
Contributor

@huningxin : the ModelBuilder idea makes sense. As for loading a model, why can't it be just:

// loading a model
const m2 = await navigator.ml.load(modelUrl);

If the indirection serves a purpose, that's fine with me.

@zolkis
Copy link
Collaborator

zolkis commented Oct 1, 2020

const m2 = await navigator.ml.load(modelUrl)

This makes sense IMHO, perhaps even with an optional arg for options.

@wchao1115
Copy link
Collaborator

Per PR #94

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants