Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions benchmark_v2/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
benchmark_results/
98 changes: 98 additions & 0 deletions benchmark_v2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Benchmarking v2

A comprehensive benchmarking framework for transformer models that supports multiple execution modes (eager, compiled, kernelized), detailed performance metrics collection, and structured output format.


## Quick Start

### Running All Benchmarks

```bash
# Run all benchmarks with default settings
python run_benchmarks.py

# Specify output directory
python run_benchmarks.py --output-dir my_results

# Run with custom parameters
python run_benchmarks.py \
--warmup-iterations 5 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have this as a flag!

--measurement-iterations 10 \
--num-tokens-to-generate 200
```

### Running Specific Benchmarks

```bash
# Include only specific benchmarks
python run_benchmarks.py --include llama

# Exclude specific benchmarks
python run_benchmarks.py --exclude old_benchmark

## Output Format

Results are saved as JSON files with the following structure:

```json
{
"model_name": "llama_2_7b",
"benchmark_scenarios": [
{
"scenario_name": "eager_variant",
"metadata": {
"timestamp": "2025-01-XX...",
"commit_id": "abc123...",
"hardware_info": {
"gpu_name": "NVIDIA A100",
"gpu_memory_total": 40960,
"cpu_count": 64
},
"config": {
"variant": "eager",
"warmup_iterations": 3,
"measurement_iterations": 5
}
},
"measurements": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have individual events (or aggregates over small time buckets) to be able to plot the data on top of having variance / means?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in later version anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for doing this in a later version. This is fine as a first step, but we can definitely stat nerd the thing.

"latency": {
"mean": 2.45,
"median": 2.43,
"std": 0.12,
"min": 2.31,
"max": 2.67,
"p95": 2.61,
"p99": 2.65
},
"time_to_first_token": {
"mean": 0.15,
"std": 0.02
},
"tokens_per_second": {
"mean": 87.3,
"unit": "tokens/sec"
}
},
"gpu_metrics": {
"gpu_utilization_mean": 85.2,
"gpu_memory_used_mean": 12450
}
}
]
}
```

### Debug Mode

```bash
python run_benchmarks.py --log-level DEBUG
```

## Contributing

To add new benchmarks:

1. Create a new file in `benches/`
2. Implement the `ModelBenchmark` interface
3. Add a runner function (`run_<benchmark_name>` or `run_benchmark`)
4. run_benchmarks.py
1 change: 1 addition & 0 deletions benchmark_v2/benches/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Benchmark implementations directory
156 changes: 156 additions & 0 deletions benchmark_v2/benches/llama.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import logging
from typing import Dict, Any, List

from benchmark_framework import ModelBenchmark

import torch

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary now that we use xet for most repos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, removing it.

os.environ["TOKENIZERS_PARALLELISM"] = "1"
torch.set_float32_matmul_precision("high")

class LLaMABenchmark(ModelBenchmark):
"""Simplified LLaMA model benchmark implementation using the ModelBenchmark base class."""

def __init__(self, logger: logging.Logger):
super().__init__(logger)
self._default_prompt = "Why dogs are so cute?" # Custom prompt for LLaMA



def get_scenario_configs(self) -> List[Dict[str, Any]]:
"""
Get LLaMA-specific scenario configurations.

Returns:
List of scenario configuration dictionaries
"""
return [
# Eager variants
{"variant": "eager", "compile_mode": None, "use_cache": True, "description": "Eager execution with cache"},

# Compiled variants
{"variant": "compiled", "compile_mode": "max-autotune", "use_cache": True, "description": "Compiled with max autotune"},

# Kernelized variant (if available)
{"variant": "kernelized", "compile_mode": "max-autotune", "use_cache": True, "description": "Kernelized execution"},
]

def _is_kernelization_available(self) -> bool:
"""Check if kernelization is available for LLaMA."""
try:
from kernels import Mode, kernelize
return True
except ImportError:
self.logger.debug("Kernelization not available: kernels module not found")
return False

def get_default_generation_config(self) -> Dict[str, Any]:
"""Get LLaMA-specific generation configuration."""
return {
"do_sample": False,
"top_p": 1.0,
"temperature": 1.0,
"repetition_penalty": 1.0,
"max_new_tokens": None, # Will be set per scenario
}

def get_model_init_kwargs(self, config) -> Dict[str, Any]:
"""Get LLaMA-specific model initialization kwargs."""
from benchmark_framework import BenchmarkConfig
return {
"torch_dtype": getattr(torch, config.torch_dtype),
"attn_implementation": config.attn_implementation,
"use_cache": True,
}

def get_default_torch_dtype(self) -> str:
"""Get default torch dtype for LLaMA."""
return "float16" # LLaMA works well with float16

def get_default_device(self) -> str:
"""Get default device for LLaMA."""
return "cuda" # LLaMA prefers CUDA


def run_llama(logger, output_dir, **kwargs):
"""
Run LLaMA benchmark with the given configuration.

Args:
logger: Logger instance
output_dir: Output directory for results
**kwargs: Additional configuration options

Returns:
Path to output file if successful
"""
from benchmark_framework import BenchmarkRunner

# Extract parameters with defaults
model_id = kwargs.get('model_id', 'meta-llama/Llama-2-7b-hf')
warmup_iterations = kwargs.get('warmup_iterations', 3)
measurement_iterations = kwargs.get('measurement_iterations', 5)
num_tokens_to_generate = kwargs.get('num_tokens_to_generate', 100)
include_sdpa_variants = kwargs.get('include_sdpa_variants', True)
device = kwargs.get('device', 'cuda')
torch_dtype = kwargs.get('torch_dtype', 'float16')
batch_size = kwargs.get('batch_size', 1)
commit_id = kwargs.get('commit_id', None)

logger.info(f"Starting LLaMA benchmark for model: {model_id}")
logger.info(f"Configuration: warmup={warmup_iterations}, measurement={measurement_iterations}, tokens={num_tokens_to_generate}")

try:
# Create benchmark instance
benchmark = LLaMABenchmark(logger)

# Create scenarios
scenarios = benchmark.create_scenarios(
model_id=model_id,
warmup_iterations=warmup_iterations,
measurement_iterations=measurement_iterations,
num_tokens_to_generate=num_tokens_to_generate,
include_sdpa_variants=include_sdpa_variants,
device=device,
torch_dtype=torch_dtype,
batch_size=batch_size
)

logger.info(f"Created {len(scenarios)} benchmark scenarios")

# Create runner and execute benchmarks
runner = BenchmarkRunner(logger, output_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be cases where we run multiple Benchmarks with a single BenchmarkRunner?

Copy link
Contributor Author

@ahadnagy ahadnagy Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BenchmarkRunner will always run the benchmark for a single model, but all the scenarios for that model.

results = runner.run_benchmark(benchmark, scenarios, commit_id=commit_id)

if not results:
logger.warning("No successful benchmark results")
return None

# Save results
model_name = model_id.split('/')[-1] # Extract model name from ID
output_file = runner.save_results(model_name, results)

logger.info(f"LLaMA benchmark completed successfully. Results saved to: {output_file}")
return output_file

except Exception as e:
logger.error(f"LLaMA benchmark failed: {e}")
import traceback
logger.debug(traceback.format_exc())
raise
Loading