Open
Description
Today I'm monitoring the infra health only via the HUD by filtering jobs with "-perf": https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=-perf&mergeLF=true
I'm wondering if there is a better way to monitor the health and with detailed metrics. It could be something like this: https://hud.pytorch.org/metrics, where I can see the historical run and success rate of the benchmark jobs, nightly runs vs. on-demand. High frequent failures, hotspot devices, etc.