Benchmarking V2: framework impl #40486

ahadnagy · 2025-08-27T11:32:37Z

What does this PR do?

This PR is another iteration of reworking the benchmarking flow in Transformers. the goal is to have a similar flow as for Diffusers: daily reports to HF Datasets, and visualization in Gradio Spaces.

This PR focuses on the framework fundamentals, export to Datasets, GH actions and more model support will come in follow-ups.

From the wishlist, the first iteration includes the following

JSON output with all the scenarios as an array
support for different attention packages and SDPA backends
compiled, kernelized scenarios
HW info an utilization collection
abstractions for making the ModelBenchmark code more lean and standardized.

I put everything into a _v2 folder so we can keep the existing framework intact until this stabilizes a bit.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-08-27T11:42:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2025-08-28T11:51:13Z

cc @ydshieh maybe? not sure who to ping for daily benchmark runs

ydshieh · 2025-08-28T13:57:17Z

@McPatate would be better person :-)

McPatate

Did a first pass, will continue tomorrow.
Awesome work! 🔥

McPatate · 2025-08-28T17:18:34Z

benchmark_v2/README.md

+
+# Run with custom parameters
+python run_benchmarks.py \
+    --warmup-iterations 5 \


Nice to have this as a flag!

McPatate · 2025-08-28T17:21:46Z

benchmark_v2/README.md

+          "measurement_iterations": 5
+        }
+      },
+      "measurements": {


Would it make sense to have individual events (or aggregates over small time buckets) to be able to plot the data on top of having variance / means?

Maybe in later version anyway

I'm all for doing this in a later version. This is fine as a first step, but we can definitely stat nerd the thing.

McPatate · 2025-08-28T17:28:04Z

benchmark_v2/run_benchmarks.py

+        format='[%(levelname)s - %(asctime)s] %(name)s: %(message)s',
+        handlers=[
+            logging.StreamHandler(sys.stdout),
+            logging.FileHandler(f'benchmark_run_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log')


Not sure we really need a file logger, maybe enable this with a cmd line arg?

It proved itself useful for development, I'll put it behind a cmd flag - disabled by default.

McPatate · 2025-08-28T17:32:49Z

benchmark_v2/run_benchmarks.py

+    logger: logging.Logger
+) -> str:
+    """Generate a summary report of all benchmark runs."""
+    summary_file = os.path.join(output_dir, "benchmark_summary.json")


Perhaps we should add a timestamp to the file name so multiple runs won't overwrite the existing content.
Also, not sure a file is needed, stdout is ok imo

On a second thought, I'm not even sure this file is useful.

I kept it at the end with a timestamp.

McPatate · 2025-09-01T13:40:39Z

benchmark_v2/benches/llama.py

+
+import torch
+
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"


Is this necessary now that we use xet for most repos?

Not really, removing it.

McPatate · 2025-09-01T13:49:17Z

benchmark_v2/benchmark_framework.py

+import torch
+
+
+class CUDATimer:


Suggested change

class CUDATimer:

class GPUTimer:

perhaps? CUDATimer sounds a bit narrow given cuda is optional

Or something like ArchAwareTimer idk 😄

McPatate · 2025-09-01T13:52:36Z

benchmark_v2/benchmark_framework.py

+    time_to_first_token: Optional[float] = None
+    latency: float = 0.0


I think it could be neat to have the unit in the name like so:

Suggested change

time_to_first_token: Optional[float] = None

latency: float = 0.0

time_to_first_token_seconds: Optional[float] = None

latency_seconds: float = 0.0

for clarity, like below

or at least a doc string for each parameter

McPatate · 2025-09-01T13:55:25Z

benchmark_v2/benchmark_framework.py

+    time_to_first_token: Optional[float] = None
+    latency: float = 0.0
+    tokens_per_second: Optional[float] = None
+    time_per_output_token_seconds: Optional[float] = None


Isn't this the same as latency?

Ok just googled what everything is, I think ITL (inter-token latency) is clearer than TPOT, but from what I can see it looks like TPOT is more widely used.
Do you think it's ok for us to use ITL instead?

I'm fine with ITL. Personally, I'm a tok/sec guy, but we also have that. :)

McPatate · 2025-09-01T13:55:39Z

benchmark_v2/benchmark_framework.py

+
+
+@dataclass
+class TimingResult:


Is this for one event or is this storing averages?

This is for storing the results of a complete benchmark scenario. This is the data that gets serialized into the JSON output. (and this is where we could add the time-series data later)

In that case perhaps we should add _mean[_] to field names of this class? I assume the values are means.

Pardon, late evening brainfart. TimingResult represents the data point of a benchmark iteration, while BenchmarkStatistics holds the derived stats.

McPatate · 2025-09-01T14:07:45Z

benchmark_v2/benchmark_framework.py

+    def __init__(self, sample_interval: float = 0.1, logger: logging.Logger = None):
+        self.sample_interval = sample_interval
+        self.logger = logger or logging.getLogger(__name__)
+        self.monitoring = False


Is using an Event not preferable here?

McPatate · 2025-09-01T14:10:56Z

benchmark_v2/benchmark_framework.py

+        return self._default_prompt
+
+    @property
+    def model_type(self) -> str:


What's the purpose of model_type?

Leftover from another iteration, removed it. :)

McPatate · 2025-09-01T14:20:07Z

benchmark_v2/benchmark_framework.py

+
+                if ttft_measurements:
+                    ttft_stats = BenchmarkStatistics.from_measurements("time_to_first_token", ttft_measurements)
+                    scenario_results["measurements"]["time_to_first_token"] = asdict(ttft_stats)


Suggested change

scenario_results["measurements"]["time_to_first_token"] = asdict(ttft_stats)

scenario_results["measurements"]["time_to_first_token_seconds"] = asdict(ttft_stats)

McPatate · 2025-09-01T14:21:04Z

benchmark_v2/benchmark_framework.py

+                if tokens_per_sec_measurements:
+                    self.logger.info(f"Throughput: {tps_stats.mean:.2f}±{tps_stats.std:.2f} tokens/sec (mean±std)")
+                if tpot_measurements:
+                    self.logger.info(f"TPOT: {tpot_stats.mean:.4f}±{tpot_stats.std:.4f}s/token (mean±std)")


Is TPOT clear/widely used?

McPatate · 2025-09-01T14:22:39Z

benchmark_v2/benches/llama.py

+        logger.info(f"Created {len(scenarios)} benchmark scenarios")
+
+        # Create runner and execute benchmarks
+        runner = BenchmarkRunner(logger, output_dir)


Will there be cases where we run multiple Benchmarks with a single BenchmarkRunner?

BenchmarkRunner will always run the benchmark for a single model, but all the scenarios for that model.

McPatate · 2025-09-03T11:51:05Z

benchmark_v2/benchmark_framework.py

+import torch
+
+
+class WithGPU(TypedDict):


I didn't necessarily mean to take my suggestion literally, it was the essence of it that was important 😄

Suggested change

class WithGPU(TypedDict):

class GPUMetrics(TypedDict):

may be better for this one

GPUMetrics sounds a bit better, yes. Done.

McPatate · 2025-09-03T11:51:27Z

benchmark_v2/benchmark_framework.py

+    gpu_monitoring_status: str
+
+
+class NoGPU(TypedDict):


That one I don't know, this should be good enough

Kept this one, seems to capture the essence well. :D

McPatate · 2025-09-03T11:52:48Z

benchmark_v2/benchmark_framework.py

+
+
+@dataclass
+class TimingResult:


In that case perhaps we should add _mean[_] to field names of this class? I assume the values are means.

McPatate

lgtm overall!

Do you have the space ready? 👀

ahadnagy · 2025-09-03T20:26:28Z

Thank you! The space is coming with tomorrows's PR and GH Actions setup, will post it here as well :)

ahadnagy added 18 commits July 27, 2025 13:50

Start revamping benchmarking

284fa37

Start refactoring benchmarking

f5ac690

Use Pandas for CSV

50baf56

import fix

08652c5

Remove benchmark files

b1bf111

Remove sample data

bd61357

Address review comments

c95b2b9

Benchmarking v2

cc229f2

Fix llama bench parameters

687325b

Working checkpoint

7c399ca

Readme touchups

c12b24c

Remove unnecessary test

32bb209

Massage the framework a bit

9d46d61

Small cleanup

adcb614

Remove unnecessary flushes

d1ee7ce

Remove references to mock benchmark

34985cd

Take commit ID from CLI

34cc100

Merge branch 'main' into benchmarking-refactor

29f7e42

LysandreJik requested a review from McPatate August 27, 2025 12:18

McPatate reviewed Aug 28, 2025

View reviewed changes

McPatate reviewed Sep 1, 2025

View reviewed changes

ahadnagy added 2 commits September 2, 2025 18:24

Address review comments

bf780ac

Use Events for thread comms

9f2fafc

ahadnagy requested a review from McPatate September 2, 2025 18:55

McPatate reviewed Sep 3, 2025

View reviewed changes

Tiny renaming

c23bc9f

McPatate approved these changes Sep 3, 2025

View reviewed changes

ahadnagy merged commit f22ec7f into huggingface:main Sep 3, 2025
14 checks passed

ahadnagy mentioned this pull request Sep 5, 2025

Benchmarking v2 GH workflows #40716

Merged

5 tasks

		time_to_first_token: Optional[float] = None
		latency: float = 0.0

	scenario_results["measurements"]["time_to_first_token"] = asdict(ttft_stats)
	scenario_results["measurements"]["time_to_first_token_seconds"] = asdict(ttft_stats)


		import torch

		os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

Benchmarking V2: framework impl #40486

Benchmarking V2: framework impl #40486

Conversation

ahadnagy commented Aug 27, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

Rocketknight1 commented Aug 28, 2025

Uh oh!

ydshieh commented Aug 28, 2025

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahadnagy Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ahadnagy Sep 2, 2025 •

edited

Loading