Complex number handling in Torch-TensorRT #3456

apbose · 2025-04-01T06:46:32Z

apbose
Apr 1, 2025
Collaborator

Complex number handling in Torch-TensorRT

TL;DR

This RFC proposes the addition of complex number support in Torch-TensorRT. TensorRT does not support complex numbers, but with the use of rotary embeddings in positional embeddings, complex numbers play an important role on how these embeddings are applied.

Goal

To support the multi-GPU example of Llama 3 model running end to end

Use case

Through this feature we intend to demonstrate the end to end forward pass of torchTRT compiled llama3 distributed model in multi GPU. Below illustrates how complex numbers are inputs to the llama3 model

freq_cis = precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0)
xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0) -> torch.Tensor:
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)
    freqs = torch.outer(t, freqs).float()
    return torch.polar(torch.ones_like(freqs), freqs)  # complex64

def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
   
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

The query and key vectors are viewed as complex, while the freq vectors are computed in the polar form with complex frequency.

The reason we encounter this only for distributed examples is because when we compile the model using
torch.compile(distributed_model, backend = torch_tensorrt)
The distributed tensors are hoisted to inputs when model is wrapped with aot_autograd leading to complex inputs to torchTRT compiled graph.
Ref- pytorch/pytorch#136289

Implementation Stages

Complex unpacking

Convert the complex numbers into a tuple of real and imaginary parts. Complex number denoted by x+iy, should be provided as input in the form of (x,y)
This involves modifying the meta data shape and data type of the complex nodes. Also the subsequent operations with these complex numbers as input

Numeric truncation

In the above complex64 should be unpacked to a tuple of float32. Similarly complex128 should be unpacked to a tuple of float32. For which the truncate_flag has to be used

Function signature modification

Identify the boundary of the operations affected by the complex inputs. Below is an example of how it looks like in llama3 model for the rotary embedding operation
eg:

input1 -> complex freq vector which is reshaped via slice operation
input2 -> torch tensor viewed as complex via view_as_complex op by slicing the last dimension by 2, with last dimension now 2
complex output = complex_mul(input1, input2)
final_output = view_as_real(complex_output)

The signature of these complex operations needs to be modified so that there are no graph breaks, and it handles the complex unpacking also

Unification of pre_lowering and post_lowering pass for distributed and non distributed

The pre_lowering and post_lowering needs to uniform across both distributed and non distributed cases.

Diagram

In the above there has to be additional handling in the torch TRT runtime. All the above will be called via an API in the post lowering passes.

API changes

We will discuss these APIs citing the example of rotary embeddings in Llama3 model

Detection stage

find_complex_op_subgraphs(gm: torch.fx.GraphModule) in torch_tensorrt/dynamo/lowering/passes/pass_utils.py

The above API should return the subgraph . The metainfo of the subgraph can be captured in a class named complexSubGraphInfo

class complexSubGraphInfo():
       def __init__(self, ...):
              self.anchor_nodes = anchor_node #list of anchor_nodes, the first guiding node for back traversal
              self.subgraph_nodes = subgraph_nodes #list of subgraph nodes in subgraph
              self.input_nodes = input_nodes #list of input nodes to subgraph
       def __repr__(self):
              return (
              f"ComplexOpSubGraphInfo{anchor_nodes=[n.name for n in self.anchor_nodes]} "
              f"subgraph={[n.name for n in self.subgraph_nodes]}, "
              f"inputs={[n.name for n in self.input_nodes]})"
              )

With respect to Llama3, it is the subgraph as denoted in the figure below. This graph broadly captures the operations in the rotary embeddings of query and key vectors. Please note that the freq node is denoted by rehape_default_12, which is same for both query and key vectors. The rotated query and key vector embeddings from the view_as_real nodes are then inputs to the scaled_dot_product_attention torch node with the value vector. That means for our design below, in each subgraph, we can have multiple anchor nodes.

n_heads = 32, n_kv_heads = 8, so there are 32/8 = 4 such subgraphs to be captured.
anchor_node = [view_as_real, view_as_real_1], subgraph_nodes = [view_as_complex, mul_2, slice_1, reshape_default_12, mul_3, view_as_complex_1], input_nodes = [reshape_default_10, arg2_1, reshape_default_11]

Below elaborates on the detector class which returns the subgraph guided by the anchor nodes.

class complexOpDetector():
     def __init__(self):
            ...
     def is_complex_dtype(self, node: torch.fx.node):
            #find if the meta data of the node is complex
    def node_include_in_subgraph(self, node: torch.fx.node):
           #Two conditions - self.is_complex_dtype(node) and node.op should be 'call_function'
           #If yes, return true
    def subgraph_from_anchor(self, anchor_node):
          subgraph_nodes = set() #set of torch.fx.node
          input_nodes = set() #set of torch.fx.node
          traversal_list = set({anchor_node}
          #do a backward dfs traversal from the anchor_node, adding to the input_nodes/subgraph_nodes, 
          return complexSubGraphInfo(anchor_node, subgraph_nodes, input_nodes)
   def find_complex_op_subgraphs(self, gm: torch.fx.GraphModule, anchor_node_target):
         complex_op_subgraphs = [] 
         seen_nodes = set() #set of nodes already a part of subgraph
         for node in gm.graph.nodes:
              if node in seen_nodes:
                  continue
              if node.target == anchor_node_target
                  subgraph = subgraph_from_anchor(node)
        if (subgraph.subgraph_nodes is not in seen_nodes):
              complex_op_subgraphs.append(subgraph)
              #update seen_nodes in subgraph.subgraph_nodes

It can be called as

detector = complexOpDetector()
subgraphs = detector.find_complex_op_subgraphs(gm, anchor_node)

find_complex_indices(torch_inputs) in torch_tensorrt/dynamo/utils.py

def find_complex_indices(torch_inputs):
    #find complex indices from the torch compile inputs
    #return indices

Decomposition stage

Here we can decompose the complex input to input.real and input.complex, and concatenate along the last dimension for torch_inputs in torch_tensorrt/dynamo/backend/backends.py. These would be at the indices returned above.

Graph Rewrite stage

replace_complex_input_nodes(subgraphs) in torch_tensorrt/dynamo/lowering/passes/complex_graph_rewrite'

def replace_complex_input_nodes(subgraphs, truncate_double):
   for subgraph in complex_op_subgraphs:
         for input_node in subgraph.input_nodes:
               if (input_node is placeholder node):
                       #replace the  node with placeholder nodewith varied shape (added 2 dimension)
                       #flag will determine if the node has to be changed from complex128->float32, else error out

complex_graph_rewrite(subgraphs) in torch_tensorrt/dynamo/lowering/passes/complex_graph_rewrite

def complex_graph_rewrite(complex_op_subgraphs):
    for subgraph in complex_op_subgraphs:
         for subgraph_node in subgraph.subgraph_nodes:
               #change the args of reshape and slice nodes in this case by appending 2
               #change the mul node and define a custom op for complex mul, which should be executed in torchTRT partition

Modifying the graph nodes and their signature can be done through- torch.fx.subgraph_rewriter.replace_pattern_with_filters() with approriate match filters.

The above are explained in the below diagram with respect to Llama 3 examples

Blue arrows denote the graph modifications/rewrite passes.

Below represents the modified target graph

All the above need to be called sequentially in the torch_tensorrt/dynamo/backend/backends.py

Further to be explored are the changes in the runtimes in _PythonTorchTensorRTModule.py, _TorchTensorRTModule.py,_CudaGraphsTorchTensorRTModule.py since we are modifying the inputs

Runtime changes

In all the above runtimes, since the inputs is now processed such that the partitioned graph module should now get a reshaped input of real and complex number, with last dimension as 2 (eg 1, 512, 1, 64 -> 1, 512, 1, 64, 2) the input_tensors need to be reshaped and fed to the corresponding runtimes.

This can be done in the runtime level when we can insert decomposition.

In _PythonTorchTensorRTModule.py this can be done in setup_input_tensors. When the contiguous_inputs[i] dtype or shape varies from the compiled engine input dtype or shape in

self.input_dtypes = [
            dtype._from(self.engine.get_tensor_dtype(input_name))
            for input_name in self.input_names]
self.input_shapes = [
            self.engine.get_tensor_shape(input_name) for input_name in self.input_names
        ]

we change the shape and dtype of input. We could use the same API as in the decomposition stage. A similar analysis can be done in _TorchTensorRTModule.py

narendasan · 2025-04-15T21:32:43Z

narendasan
Apr 15, 2025
Collaborator

@apbose is there a way to create a hard subset of complex that we can support easily and grow from there?

1 reply

apbose Apr 15, 2025
Collaborator Author

For complex operations of llama3, slice, reshape and torch.mul() between the two complex inputs are the operations. The find_complex_boundary() above should return these ops

narendasan · 2025-04-29T21:34:26Z

narendasan
Apr 29, 2025
Collaborator

Runtime changes

In all the above runtimes, since the inputs is now processed such that the partitioned graph module > gets a tuple/reshape input of real and complex number separate with last dimension as 2, the > input_tensors need to be processed accordingly and fed to the corresponding runtimes.

There needs to be more detail here: For example there are many possible designs, do we need to change the graph signature, or do we put the unpacking code in the graph or does the runtime need to detect and insert decomposition for complex inputs at the runtime level?

1 reply

apbose May 2, 2025
Collaborator Author

Added the detail. I would think to add this in the runtime, when the input to the engine is setup. Since the compiled graph nodes and dtypes are already modified during the lowering passes

narendasan · 2025-04-29T21:38:56Z

narendasan
Apr 29, 2025
Collaborator

Detection stage

find_complex_op_subgraphs(gm: torch.fx.GraphModule) in torch_tensorrt/dynamo/lowering/passes/pass_utils.py

def find_complex_op_subgraphs(gm: torch.fx.GraphModule):
#find complex op subgraphs by the datatype of metadata of nodes
#return (complex_op_subgraphs)

There is likely some sort of data structure we need here that marks insertion points so we can replace the graph later, as well as the original graph and then later on the new graph

1 reply

apbose May 2, 2025
Collaborator Author

Added the data structure above

narendasan · 2025-04-29T21:42:43Z

narendasan
Apr 29, 2025
Collaborator

@apbose write some standalone cases where we have the original complex graph and the expected target graph

1 reply

apbose May 2, 2025
Collaborator Author

The diagram above shows the original complex graph and the target graph. Idea is to keep it generic such that only the complex nodes in the complex subgraph are modified, while the placeholder nodes in inputs to the complex subgraphs are modified

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Complex number handling in Torch-TensorRT #3456

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Complex number handling in Torch-TensorRT #3456

Uh oh!

Uh oh!

apbose Apr 1, 2025 Collaborator

Complex number handling in Torch-TensorRT

TL;DR

Goal

Use case

Implementation Stages

Diagram

API changes

Detection stage

Decomposition stage

Graph Rewrite stage

Runtime changes

Replies: 4 comments · 4 replies

Uh oh!

narendasan Apr 15, 2025 Collaborator

Uh oh!

apbose Apr 15, 2025 Collaborator Author

Uh oh!

narendasan Apr 29, 2025 Collaborator

Uh oh!

apbose May 2, 2025 Collaborator Author

Uh oh!

narendasan Apr 29, 2025 Collaborator

Uh oh!

apbose May 2, 2025 Collaborator Author

Uh oh!

narendasan Apr 29, 2025 Collaborator

Uh oh!

apbose May 2, 2025 Collaborator Author

apbose
Apr 1, 2025
Collaborator

Replies: 4 comments 4 replies

narendasan
Apr 15, 2025
Collaborator

apbose Apr 15, 2025
Collaborator Author

narendasan
Apr 29, 2025
Collaborator

apbose May 2, 2025
Collaborator Author

narendasan
Apr 29, 2025
Collaborator

apbose May 2, 2025
Collaborator Author

narendasan
Apr 29, 2025
Collaborator

apbose May 2, 2025
Collaborator Author