-
Notifications
You must be signed in to change notification settings - Fork 30.5k
Benchmarking V2: framework impl #40486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
cc @ydshieh maybe? not sure who to ping for daily benchmark runs |
@McPatate would be better person :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass, will continue tomorrow.
Awesome work! 🔥
|
||
# Run with custom parameters | ||
python run_benchmarks.py \ | ||
--warmup-iterations 5 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have this as a flag!
"measurement_iterations": 5 | ||
} | ||
}, | ||
"measurements": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have individual events (or aggregates over small time buckets) to be able to plot the data on top of having variance / means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in later version anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for doing this in a later version. This is fine as a first step, but we can definitely stat nerd the thing.
benchmark_v2/run_benchmarks.py
Outdated
format='[%(levelname)s - %(asctime)s] %(name)s: %(message)s', | ||
handlers=[ | ||
logging.StreamHandler(sys.stdout), | ||
logging.FileHandler(f'benchmark_run_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we really need a file logger, maybe enable this with a cmd line arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It proved itself useful for development, I'll put it behind a cmd flag - disabled by default.
benchmark_v2/run_benchmarks.py
Outdated
logger: logging.Logger | ||
) -> str: | ||
"""Generate a summary report of all benchmark runs.""" | ||
summary_file = os.path.join(output_dir, "benchmark_summary.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should add a timestamp to the file name so multiple runs won't overwrite the existing content.
Also, not sure a file is needed, stdout is ok imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, I'm not even sure this file is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept it at the end with a timestamp.
|
||
import torch | ||
|
||
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary now that we use xet for most repos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, removing it.
benchmark_v2/benchmark_framework.py
Outdated
import torch | ||
|
||
|
||
class CUDATimer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class CUDATimer: | |
class GPUTimer: |
perhaps? CUDATimer
sounds a bit narrow given cuda is optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or something like ArchAwareTimer
idk 😄
benchmark_v2/benchmark_framework.py
Outdated
time_to_first_token: Optional[float] = None | ||
latency: float = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be neat to have the unit in the name like so:
time_to_first_token: Optional[float] = None | |
latency: float = 0.0 | |
time_to_first_token_seconds: Optional[float] = None | |
latency_seconds: float = 0.0 |
for clarity, like below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or at least a doc string for each parameter
time_to_first_token: Optional[float] = None | ||
latency: float = 0.0 | ||
tokens_per_second: Optional[float] = None | ||
time_per_output_token_seconds: Optional[float] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok just googled what everything is, I think ITL (inter-token latency) is clearer than TPOT, but from what I can see it looks like TPOT is more widely used.
Do you think it's ok for us to use ITL instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with ITL. Personally, I'm a tok/sec guy, but we also have that. :)
|
||
|
||
@dataclass | ||
class TimingResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for one event or is this storing averages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for storing the results of a complete benchmark scenario. This is the data that gets serialized into the JSON output. (and this is where we could add the time-series data later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case perhaps we should add _mean[_]
to field names of this class? I assume the values are means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pardon, late evening brainfart. TimingResult
represents the data point of a benchmark iteration, while BenchmarkStatistics
holds the derived stats.
benchmark_v2/benchmark_framework.py
Outdated
def __init__(self, sample_interval: float = 0.1, logger: logging.Logger = None): | ||
self.sample_interval = sample_interval | ||
self.logger = logger or logging.getLogger(__name__) | ||
self.monitoring = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using an Event
not preferable here?
benchmark_v2/benchmark_framework.py
Outdated
return self._default_prompt | ||
|
||
@property | ||
def model_type(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of model_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from another iteration, removed it. :)
benchmark_v2/benchmark_framework.py
Outdated
|
||
if ttft_measurements: | ||
ttft_stats = BenchmarkStatistics.from_measurements("time_to_first_token", ttft_measurements) | ||
scenario_results["measurements"]["time_to_first_token"] = asdict(ttft_stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scenario_results["measurements"]["time_to_first_token"] = asdict(ttft_stats) | |
scenario_results["measurements"]["time_to_first_token_seconds"] = asdict(ttft_stats) |
benchmark_v2/benchmark_framework.py
Outdated
if tokens_per_sec_measurements: | ||
self.logger.info(f"Throughput: {tps_stats.mean:.2f}±{tps_stats.std:.2f} tokens/sec (mean±std)") | ||
if tpot_measurements: | ||
self.logger.info(f"TPOT: {tpot_stats.mean:.4f}±{tpot_stats.std:.4f}s/token (mean±std)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is TPOT
clear/widely used?
logger.info(f"Created {len(scenarios)} benchmark scenarios") | ||
|
||
# Create runner and execute benchmarks | ||
runner = BenchmarkRunner(logger, output_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be cases where we run multiple Benchmark
s with a single BenchmarkRunner
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BenchmarkRunner
will always run the benchmark for a single model, but all the scenarios for that model.
benchmark_v2/benchmark_framework.py
Outdated
import torch | ||
|
||
|
||
class WithGPU(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't necessarily mean to take my suggestion literally, it was the essence of it that was important 😄
class WithGPU(TypedDict): | |
class GPUMetrics(TypedDict): |
may be better for this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPUMetrics sounds a bit better, yes. Done.
gpu_monitoring_status: str | ||
|
||
|
||
class NoGPU(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one I don't know, this should be good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept this one, seems to capture the essence well. :D
|
||
|
||
@dataclass | ||
class TimingResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case perhaps we should add _mean[_]
to field names of this class? I assume the values are means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm overall!
Do you have the space ready? 👀
Thank you! The space is coming with tomorrows's PR and GH Actions setup, will post it here as well :) |
What does this PR do?
This PR is another iteration of reworking the benchmarking flow in Transformers. the goal is to have a similar flow as for Diffusers: daily reports to HF Datasets, and visualization in Gradio Spaces.
This PR focuses on the framework fundamentals, export to Datasets, GH actions and more model support will come in follow-ups.
From the wishlist, the first iteration includes the following
I put everything into a _v2 folder so we can keep the existing framework intact until this stabilizes a bit.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.