Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 10 additions & 12 deletions docs/source/developer-guide/perf-benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ Expect breaking API changes.
```

TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:
easier for users to reproduce our officially published [performance overview](../performance/perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:

- A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
- An entirely Python workflow for benchmarking.
- Ability to benchmark various flows and features within TensorRT LLM.

`trtllm-bench` executes all benchmarks using [in-flight batching] -- for more information see
the [in-flight batching section](../advanced/gpt-attention.md#in-flight-batching) that describes the concept
the [in-flight batching section](../features/attention.md#inflight-batching) that describes the concept
in further detail.

## Before Benchmarking
Expand Down Expand Up @@ -67,7 +67,7 @@ sudo nvidia-smi boost-slider --vboost <max_boost_slider>

While `trtllm-bench` should be able to run any network that TensorRT LLM supports, the following are the list
that have been validated extensively and is the same listing as seen on the
[Performance Overview](./perf-overview.md) page.
[Performance Overview](../performance/perf-overview.md) page.

- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- [meta-llama/Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf)
Expand Down Expand Up @@ -98,8 +98,8 @@ Export your token in the `HF_TOKEN` environment variable.
- `FP8`
- `NVFP4`

For more information about quantization, refer to [](../reference/precision.md) and
the [support matrix](../reference/precision.md#support-matrix) of the supported quantization methods for each network.
For more information about quantization, refer to [](../features/quantization.md) and
the [support matrix](../features/quantization.md#model-supported-matrix) of the supported quantization methods for each network.

```{tip}
Although TensorRT LLM supports more quantization modes than listed above, `trtllm-bench` currently only configures for
Expand Down Expand Up @@ -155,11 +155,9 @@ python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-3

### Running with the PyTorch Workflow

To benchmark the PyTorch backend (`tensorrt_llm._torch`), use the following command with [dataset](#preparing-a-dataset) generated from previous steps. The `throughput` benchmark initializes the backend by tuning against the
dataset provided via `--dataset` (or the other build mode settings described [above](#other-build-modes)).
Note that CUDA graph is enabled by default. You can add additional pytorch config with
`--extra_llm_api_options` followed by the path to a YAML file. For more details, please refer to the
help text by running the command with `--help`.
To benchmark the PyTorch backend (`tensorrt_llm._torch`), use the following command with [dataset](#preparing-a-dataset) generated from previous steps. The `throughput` benchmark initializes the backend by tuning against the dataset provided via `--dataset` (or the other build mode settings described above).

Note that CUDA graph is enabled by default. You can add additional pytorch config with `--extra_llm_api_options` followed by the path to a YAML file. For more details, please refer to the help text by running the command with `--help`.

```{tip}
The command below specifies the `--model_path` option. The model path is optional and used only when you want to run a locally
Expand Down Expand Up @@ -310,7 +308,7 @@ Each subdirectory should contain the LoRA adapter files for that specific task.
To benchmark multi-modal models with PyTorch workflow, you can follow the similar approach as above.

First, prepare the dataset:
```
```python
python ./benchmarks/cpp/prepare_dataset.py \
--tokenizer Qwen/Qwen2-VL-2B-Instruct \
--stdout \
Expand All @@ -334,7 +332,7 @@ Sample dataset for multimodal:
```

Run the benchmark:
```
```python
trtllm-bench --model Qwen/Qwen2-VL-2B-Instruct \
throughput \
--dataset mm_data.jsonl \
Expand Down