Skip to content

Conversation

ahadnagy
Copy link
Contributor

What does this PR do?

This PR is another iteration of reworking the benchmarking flow in Transformers. the goal is to have a similar flow as for Diffusers: daily reports to HF Datasets, and visualization in Gradio Spaces.

This PR focuses on the framework fundamentals, export to Datasets, GH actions and more model support will come in follow-ups.

From the wishlist, the first iteration includes the following

  • JSON output with all the scenarios as an array
  • support for different attention packages and SDPA backends
  • compiled, kernelized scenarios
  • HW info an utilization collection
  • abstractions for making the ModelBenchmark code more lean and standardized.

I put everything into a _v2 folder so we can keep the existing framework intact until this stabilizes a bit.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@LysandreJik LysandreJik requested a review from McPatate August 27, 2025 12:18
@Rocketknight1
Copy link
Member

cc @ydshieh maybe? not sure who to ping for daily benchmark runs

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 28, 2025

@McPatate would be better person :-)

Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass, will continue tomorrow.
Awesome work! 🔥


# Run with custom parameters
python run_benchmarks.py \
--warmup-iterations 5 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have this as a flag!

"measurement_iterations": 5
}
},
"measurements": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have individual events (or aggregates over small time buckets) to be able to plot the data on top of having variance / means?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in later version anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for doing this in a later version. This is fine as a first step, but we can definitely stat nerd the thing.

format='[%(levelname)s - %(asctime)s] %(name)s: %(message)s',
handlers=[
logging.StreamHandler(sys.stdout),
logging.FileHandler(f'benchmark_run_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we really need a file logger, maybe enable this with a cmd line arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It proved itself useful for development, I'll put it behind a cmd flag - disabled by default.

logger: logging.Logger
) -> str:
"""Generate a summary report of all benchmark runs."""
summary_file = os.path.join(output_dir, "benchmark_summary.json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add a timestamp to the file name so multiple runs won't overwrite the existing content.
Also, not sure a file is needed, stdout is ok imo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, I'm not even sure this file is useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it at the end with a timestamp.


import torch

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary now that we use xet for most repos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, removing it.

import torch


class CUDATimer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class CUDATimer:
class GPUTimer:

perhaps? CUDATimer sounds a bit narrow given cuda is optional

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or something like ArchAwareTimer idk 😄

Comment on lines 185 to 186
time_to_first_token: Optional[float] = None
latency: float = 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be neat to have the unit in the name like so:

Suggested change
time_to_first_token: Optional[float] = None
latency: float = 0.0
time_to_first_token_seconds: Optional[float] = None
latency_seconds: float = 0.0

for clarity, like below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or at least a doc string for each parameter

time_to_first_token: Optional[float] = None
latency: float = 0.0
tokens_per_second: Optional[float] = None
time_per_output_token_seconds: Optional[float] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as latency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok just googled what everything is, I think ITL (inter-token latency) is clearer than TPOT, but from what I can see it looks like TPOT is more widely used.
Do you think it's ok for us to use ITL instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with ITL. Personally, I'm a tok/sec guy, but we also have that. :)



@dataclass
class TimingResult:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for one event or is this storing averages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for storing the results of a complete benchmark scenario. This is the data that gets serialized into the JSON output. (and this is where we could add the time-series data later)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case perhaps we should add _mean[_] to field names of this class? I assume the values are means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon, late evening brainfart. TimingResult represents the data point of a benchmark iteration, while BenchmarkStatistics holds the derived stats.

def __init__(self, sample_interval: float = 0.1, logger: logging.Logger = None):
self.sample_interval = sample_interval
self.logger = logger or logging.getLogger(__name__)
self.monitoring = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using an Event not preferable here?

return self._default_prompt

@property
def model_type(self) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of model_type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from another iteration, removed it. :)


if ttft_measurements:
ttft_stats = BenchmarkStatistics.from_measurements("time_to_first_token", ttft_measurements)
scenario_results["measurements"]["time_to_first_token"] = asdict(ttft_stats)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scenario_results["measurements"]["time_to_first_token"] = asdict(ttft_stats)
scenario_results["measurements"]["time_to_first_token_seconds"] = asdict(ttft_stats)

if tokens_per_sec_measurements:
self.logger.info(f"Throughput: {tps_stats.mean:.2f}±{tps_stats.std:.2f} tokens/sec (mean±std)")
if tpot_measurements:
self.logger.info(f"TPOT: {tpot_stats.mean:.4f}±{tpot_stats.std:.4f}s/token (mean±std)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is TPOT clear/widely used?

logger.info(f"Created {len(scenarios)} benchmark scenarios")

# Create runner and execute benchmarks
runner = BenchmarkRunner(logger, output_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be cases where we run multiple Benchmarks with a single BenchmarkRunner?

Copy link
Contributor Author

@ahadnagy ahadnagy Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BenchmarkRunner will always run the benchmark for a single model, but all the scenarios for that model.

@ahadnagy ahadnagy requested a review from McPatate September 2, 2025 18:55
import torch


class WithGPU(TypedDict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't necessarily mean to take my suggestion literally, it was the essence of it that was important 😄

Suggested change
class WithGPU(TypedDict):
class GPUMetrics(TypedDict):

may be better for this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUMetrics sounds a bit better, yes. Done.

gpu_monitoring_status: str


class NoGPU(TypedDict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one I don't know, this should be good enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept this one, seems to capture the essence well. :D



@dataclass
class TimingResult:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case perhaps we should add _mean[_] to field names of this class? I assume the values are means.

Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall!

Do you have the space ready? 👀

@ahadnagy
Copy link
Contributor Author

ahadnagy commented Sep 3, 2025

Thank you! The space is coming with tomorrows's PR and GH Actions setup, will post it here as well :)

@ahadnagy ahadnagy merged commit f22ec7f into huggingface:main Sep 3, 2025
14 checks passed
@ahadnagy ahadnagy mentioned this pull request Sep 5, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants