-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Compile] Conditional compilation. Introduce compile_ranges #24252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Compile] Conditional compilation. Introduce compile_ranges #24252
Conversation
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
|
||
def __call__(self, *args) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, does this PR work, or is it mostly WIP? (Are you sure that the graph generated ends up being dynamic on the specific range that is passed?)
There's one problem that I don't know how to solve yet. Let's say we're compiling with ranges [2, 16] and (16, 4096]. Each compilation needs its own ShapeEnv (environment with symbols in it), which has the batch_size constrained to the particular range.
So what we should do is for each range, take the current ShapeEnv (which thinks the batch_size is dynamic on range [2, 4096], clone it, constrain to the current range (e.g. [2, 16]), and use this throughout the compilation.
I don't know how to "clone" ShapeEnvs. Is there anything else we can do here @laithsakka @bobrenjc93 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobrenjc93 reminded me that that is what https://github.com/pytorch/pytorch/blob/fecd9686f543487793e0c55977555b2cdbae1a73/torch/fx/experimental/symbolic_shapes.py#L3904-L3919 is for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works already leaving aside a pytorch standalone_compile that should be fixed in new pytorch release in this commit. But the graphs for each range are dynamically generated, and fusions are applied differently in each graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamo traces out a graph that is fully dynamic over the batch_size. We should tell torch.compile that that we know things about the batch_size for each range, for example, that the range is constrained to [2, 16]. This will help it generate better code. In order to do this, you'll need to grab the SymInt that is the batch_size and add constraints to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. These are the hints for torch.compile, I meant at the meeting. Thanks, I'll add ShapeEnv here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are using is_applicable_for_range (the current form of the PR this is fine), if we want to go with the other approach[see my other comment on the PR], which is more complicated i think if we are doing we want a reason) then yeh this is problematic mm./
return compile_range is not None and ( | ||
compile_range[0] | ||
== compile_range[1]) and (compile_range[1] % tp_size == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I originally thought of doing this is something like:
return statically_known_true(batch_size %tp_size == 0):
If we are able to access the batch_size SymInt here, then we are able to query things about it.
cc @laithsakka @bobrenjc93 on if I'm butchering this API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on how statically_known_true
is going to improve the existing approach? Is it more stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of implementing your own range analysis, PyTorch already encodes range information in the SymInts themselves. So this is more of a code-reuse thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it really depends on the goals of those ranges. If the goal is solely/mainly to allow custom passes to branch on ranges, this is fine. In fact, it's simpler than mutating the shape env and having to fork it.
Also, we can then keep the invariant that inductor itself does not specialize and run the same checks here (which we do not have yet).
On the other hand, if someone really thinks that inductor can do better itself significantly if we actually specialize the shape env, then yeah we would not have to do something else.
But it sounds to me like the intention is the earlier one?
Signed-off-by: ilmarkov <[email protected]>
@ilmarkov out of curiosity, do you have a sense of how much perf wins you'll get out of this (and from which models?) |
This pull request has merge conflicts that must be resolved before it can be |
@bobrenjc93 Without multiple graphs our fallback (for the large input sizes, i.e. when we don't use allreduce fusion) uses either custom ops or non optimized pytorch operations and which are slower than torch triton operations. I think reasonable perf comparison was done in #19830 |
|
||
if isinstance(runtime_shape, int): | ||
dynamic_shapes = "from_example_inputs" | ||
if isinstance(compile_range, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah i do not like how stand_alone_compiler deals now receive this dynamic_shapes option!!! "from_example_inputs" is a very hacky approach to tell inductor not to specialize I guess it will work until we hit an issue.
if compile_range[0] == compile_range[1]: | ||
dynamic_shapes = "from_example_inputs" | ||
else: | ||
dynamic_shapes = "from_graph" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both "from_graph" and "from_tracing_context" here have the same effect of getting the shape env we traced the DS graph with? if yes lets do less divergence.
sym_shape_indices: list[int], | ||
compiled_graph_for_general_shape: Callable, | ||
vllm_backend: VllmBackend): | ||
sym_shape_indices: list[int], vllm_backend: VllmBackend): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so with this change you are forcing that a range should be specified and removing the path for the compiled_graph_for_general_shape.
and if a range is not specified we throw. did we document this anywhere?
probably around the config?
index, | ||
len(self.compile_submod_names), | ||
sym_shape_indices, | ||
# compiled_graph_for_dynamic_shape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? wanted to remove it?
else: | ||
dynamic_shapes = "from_graph" | ||
else: | ||
dynamic_shapes = "from_tracing_context" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this even reachable now ? seems like we force a range to be specififed?
ditto for having compile_range optional
def test_compile_ranges(): | ||
vllm_config = VllmConfig(compilation_config=CompilationConfig( | ||
level=CompilationLevel.PIECEWISE, | ||
compile_ranges_split_points=[8, 32], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can we test with empty split ranges?
- can we test with some ranges that get translated to specializations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah it seems users can still specify self.compile_sizes in addition to this .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, empty means single compile size.
|
||
def get_compile_ranges(self) -> list[tuple[int, int]]: | ||
"""Get the compile ranges for the compilation config.""" | ||
compile_ranges_split_points = self.compile_ranges_split_points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a user specify
compile_sizes = 16
and a range[1, 100, 1000]
we would split the ranges
to [1, 16], [16, 16] [16, 100]..
wonder if we want [16, 16]
and [1, 100] instead ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But [16, 16] will be inside [1, 100]. We want to have non-overlapping ranges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you would compile identical graphs twice in that case no? the user did not explicitly ask to split at 16?
in the dispatch you can always track singleton ranges to dispatch to first before ranges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one good side effect of this also other than custom passes is that
Each range is tuned with a hint from that range in inductor meaning that we can use this also to ensure that small inputs vs large inputs are max auto tuned with separate hints.
but splitting ranges
this would also work for unbacked which is good! (Well except that we would have to call override hint for unabcked with the actual example value when we do the range compilations cc @bobrenjc93 )
here is once concern of this, it will make the soundness story with respect to the DS added by inductor harder. Now the ideal and only actual right fix, is to use unbacked, unbacked comes with a perf hit. with this! now we we have so much more branching, we would need to track Inductor guards per each of those compilations |
Second part of splitting #22086
Dynamic Graph dispatch via compile_ranges: Introduces a new configuration option, compile_ranges, as an alternative to compile_sizes. This enables dynamic dispatch to different compiled graphs based on the input batch size.
Now with this approach, when allreduce fusion is enabled, vllm adds additional compile range split point in order to separate the graphs: 1. One with fused allreduce for small-middle shape inputs. 2 One with nccl based allreduce for large shape inputs
The existing compile_sizes feature is extended and generalized with compile_ranges. Defined by split points, these ranges allow vllm to dynamically dispatch requests to specific, pre-compiled graphs based on input batch size. For example, a configuration of (32, 64) defines three distinct ranges: [1, 32), [32, 64), and [64, max_num_batched_tokens). This provides granular control, allowing developers to statically enable or disable fusions within each graph to optimize performance for different batch sizes.
Purpose
Corresponding RFC: #23113
The primary motivation for these changes is to enhance vllm's performance and adaptability for diverse workloads. By supporting allreduce fusion without custom ops and introducing dynamic graph dispatch, we empower users to fine-tune vllm for more efficient and scalable inference.
Test Plan
Added tests for compile_ranges