diff --git a/docs/comparison-analysis.md b/docs/comparison-analysis.md new file mode 100644 index 000000000..528e172c9 --- /dev/null +++ b/docs/comparison-analysis.md @@ -0,0 +1,65 @@ +# Comparison Analysis + +The following is a detailed explanation of the process undertaken to automate the analysis of test results for two artifacts of interest (artifact A and B). + +This analysis can be done by hand, by using the [comparison page](https://perf.rust-lang.org/compare.html) and entering the two artifacts of interest in the form at the top. + +## The goal + +The goal of the analysis is to determine whether artifact B represents a performance improvement, regression, or mixed result from artifact A. Typically artifact B will be based on artifact A with a certain pull requests changes applied to artifact A to get artifact B, but this is not required. + +Performance analysis is typically used to determine whether a pull request has introduced performance regressions or improvements. + +## What is being compared? + +At the core of comparison analysis are the collection of test results for the two artifacts being compared. For each test case, the statistics for the two artifacts are compared and a relative change percentage is obtained using the following formula: + +``` +100 * ((statisticForArtifactB - statisticForArtifactA) / statisticForArtifactA) +``` + +## High-level analysis description + +Analysis of the changes is performed in order to determine whether artifact B represents a performance change over artifact A. At a high level the analysis performed takes the following form: + +How many _significant_ test results indicate performance changes and what is the magnitude of the changes (i.e., how large are the changes regardless of the direction of change)? + +* If there are improvements and regressions with magnitude of medium or above then the comparison is mixed. +* If there are only either improvements or regressions then the comparison is labeled with that kind. +* If one kind of changes are of medium or above magnitude (and the other kind are not), then the comparison is mixed if 15% or more of the total changes are the other (small magnitude) kind. For example: + * given 20 regressions (with at least 1 of medium magnitude) and all improvements of low magnitude, the comparison is only mixed if there are 4 or more improvements. + * given 5 regressions (with at least 1 of medium magnitude) and all improvements of low magnitude, the comparison is only mixed if there 1 or more improvements. +* If both kinds of changes are all low magnitude changes, then the comparison is mixed unless 90% or more of total changes are of one kind. For example: + * given 20 changes of different kinds all of low magnitude, the result is mixed unless only 2 or fewer of the changes are of one kind. + * given 5 changes of different kinds all of low magnitude, the result is always mixed. + +Whether we actually _report_ an analysis or not depends on the context and how _confident_ we are in the summary of the results (see below for an explanation of how confidence is derived). For example, in pull request performance "try" runs, we report a performance change if we are at least confident that the results are "probably relevant", while for the triage report, we only report if the we are confident the results are "definitely relevant". + +### What makes a test result significant? + +A test result is significant if the relative change percentage meets some threshold. What the threshold is depends of whether the test case is "dodgy" or not (see below for an examination of "dodginess"). For dodgy test cases, the threshold is set at 1%. For non-dodgy test cases, the threshold is set to 0.1%. + +### What makes a test case "dodgy"? + +A test case is "dodgy" if it shows signs of either being noisy or highly variable. + +To determine noise and high variability, the previous 100 test results for the test case in question are examined by calculating relative delta changes between adjacent test results. This is done with the following formula (where `testResult1` is the test result immediately proceeding `testResult2`): + +``` +testResult2 - testResult1 / testResult1 +``` + +Any relative delta change that is above a threshold (currently 0.1) is considered "significant" for the purposes of dodginess detection. + +A highly variable test case is one where a certain percentage (currently 5%) of relative delta changes are significant. The logic being that test cases should only display significant relative delta changes a small percentage of the time. + +A noisy test case is one where of all the non-significant relative delta changes, the average delta change is still above some threshold (0.001). The logic being that non-significant changes should, on average, being very close to 0. If they are not close to zero, then they are noisy. + +### How is confidence in whether a test analysis is "relevant" determined? + +The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude (how large a change is regardless of the direction of the change). + +The actual algorithm for determining confidence may change, but in general the following rules apply: +* Definitely relevant: any number of very large changes, a small amount of large and/or medium changes, or a large amount of small changes. +* Probably relevant: any number of large changes, more than 1 medium change, or smaller but still substantial amount of small changes. +* Maybe relevant: if it doesn't fit into the above two categories, it ends in this category. diff --git a/site/src/api.rs b/site/src/api.rs index eacf9d993..b3d27a556 100644 --- a/site/src/api.rs +++ b/site/src/api.rs @@ -138,7 +138,6 @@ pub mod bootstrap { } pub mod comparison { - use crate::comparison; use collector::Bound; use database::Date; use serde::{Deserialize, Serialize}; @@ -153,13 +152,12 @@ pub mod comparison { #[derive(Debug, Clone, Serialize)] pub struct Response { - /// The variance data for each benchmark, if any. - pub variance: Option>, /// The names for the previous artifact before `a`, if any. pub prev: Option, - pub a: ArtifactData, - pub b: ArtifactData, + pub a: ArtifactDescription, + pub b: ArtifactDescription, + pub comparisons: Vec, /// The names for the next artifact after `b`, if any. pub next: Option, @@ -169,15 +167,25 @@ pub mod comparison { pub is_contiguous: bool, } - /// A serializable wrapper for `comparison::ArtifactData`. #[derive(Debug, Clone, Serialize)] - pub struct ArtifactData { + pub struct ArtifactDescription { pub commit: String, pub date: Option, pub pr: Option, - pub data: HashMap>, pub bootstrap: HashMap, } + + /// A serializable wrapper for `comparison::ArtifactData`. + #[derive(Debug, Clone, Serialize)] + pub struct Comparison { + pub benchmark: String, + pub profile: String, + pub scenario: String, + pub is_significant: bool, + pub is_dodgy: bool, + pub historical_statistics: Option>, + pub statistics: (f64, f64), + } } pub mod status { diff --git a/site/src/comparison.rs b/site/src/comparison.rs index 0bf7c8b0c..c5a1b25a4 100644 --- a/site/src/comparison.rs +++ b/site/src/comparison.rs @@ -12,7 +12,7 @@ use collector::Bound; use log::debug; use serde::Serialize; -use std::collections::HashMap; +use std::collections::{HashMap, HashSet}; use std::error::Error; use std::hash::Hash; use std::sync::Arc; @@ -37,6 +37,7 @@ pub async fn handle_triage( let mut report = HashMap::new(); let mut before = start.clone(); + let mut num_comparisons = 0; loop { let comparison = match compare_given_commits( before, @@ -56,6 +57,7 @@ pub async fn handle_triage( break; } }; + num_comparisons += 1; log::info!( "Comparing {} to {}", comparison.b.artifact, @@ -77,7 +79,7 @@ pub async fn handle_triage( } let end = end.unwrap_or(next); - let report = generate_report(&start, &end, report).await; + let report = generate_report(&start, &end, report, num_comparisons).await; Ok(api::triage::Response(report)) } @@ -96,91 +98,182 @@ pub async fn handle_compare( let prev = comparison.prev(&master_commits); let next = comparison.next(&master_commits); let is_contiguous = comparison.is_contiguous(&*conn, &master_commits).await; + let comparisons = comparison + .statistics + .into_iter() + .map(|comparison| api::comparison::Comparison { + benchmark: comparison.benchmark.to_string(), + profile: comparison.profile.to_string(), + scenario: comparison.scenario.to_string(), + is_dodgy: comparison.is_dodgy(), + is_significant: comparison.is_significant(), + historical_statistics: comparison.variance.map(|v| v.data), + statistics: comparison.results, + }) + .collect(); Ok(api::comparison::Response { - variance: comparison.benchmark_variances.map(|b| b.data), prev, a: comparison.a.into(), b: comparison.b.into(), + comparisons, next, is_contiguous, }) } -async fn populate_report(comparison: &Comparison, report: &mut HashMap>) { +async fn populate_report( + comparison: &Comparison, + report: &mut HashMap, Vec>, +) { if let Some(summary) = ComparisonSummary::summarize_comparison(comparison) { - if let Some(direction) = summary.direction() { - let entry = report.entry(direction).or_default(); - - entry.push(summary.write(comparison).await) + let confidence = summary.confidence(); + if confidence.is_atleast_probably_relevant() { + if let Some(direction) = summary.direction() { + let entry = report + .entry(confidence.is_definitely_relevant().then(|| direction)) + .or_default(); + + entry.push(summary.write(comparison).await) + } } } } -pub struct ComparisonSummary<'a> { - hi: Option>, - lo: Option>, +pub struct ComparisonSummary { + /// Significant comparisons ordered by magnitude + comparisons: Vec, } -impl ComparisonSummary<'_> { - pub fn summarize_comparison<'a>(comparison: &'a Comparison) -> Option> { - let mut benchmarks = comparison.get_benchmarks(); +impl ComparisonSummary { + pub fn summarize_comparison(comparison: &Comparison) -> Option { + let mut comparisons = comparison + .get_individual_comparisons() + .filter(|c| c.is_significant()) + .cloned() + .collect::>(); // Skip empty commits, sometimes happens if there's a compiler bug or so. - if benchmarks.len() == 0 { + if comparisons.len() == 0 { return None; } - let cmp = |b1: &BenchmarkComparison, b2: &BenchmarkComparison| { - b1.log_change() - .partial_cmp(&b2.log_change()) + let cmp = |b1: &TestResultComparison, b2: &TestResultComparison| { + b2.relative_change() + .abs() + .partial_cmp(&b1.relative_change().abs()) .unwrap_or(std::cmp::Ordering::Equal) }; - let lo = benchmarks - .iter() - .enumerate() - .min_by(|&(_, b1), &(_, b2)| cmp(b1, b2)) - .filter(|(_, c)| c.is_significant() && !c.is_increase()) - .map(|(i, _)| i); - let lo = lo.map(|lo| benchmarks.remove(lo)); - let hi = benchmarks - .iter() - .enumerate() - .max_by(|&(_, b1), &(_, b2)| cmp(b1, b2)) - .filter(|(_, c)| c.is_significant() && c.is_increase()) - .map(|(i, _)| i); - let hi = hi.map(|hi| benchmarks.remove(hi)); + comparisons.sort_by(cmp); - Some(ComparisonSummary { hi, lo }) + Some(ComparisonSummary { comparisons }) } /// The direction of the changes pub fn direction(&self) -> Option { - let d = match (&self.hi, &self.lo) { - (None, None) => return None, - (Some(b), None) => b.direction(), - (None, Some(b)) => b.direction(), - (Some(a), Some(b)) if a.is_increase() == b.is_increase() => a.direction(), - _ => Direction::Mixed, - }; - Some(d) - } - - /// The changes ordered by their signficance (most significant first) - pub fn ordered_changes(&self) -> Vec<&BenchmarkComparison<'_>> { - match (&self.hi, &self.lo) { - (None, None) => Vec::new(), - (Some(b), None) => vec![b], - (None, Some(b)) => vec![b], - (Some(a), Some(b)) - if b.log_change() - .abs() - .partial_cmp(&a.log_change().abs()) - .unwrap_or(std::cmp::Ordering::Equal) - == std::cmp::Ordering::Greater => - { - vec![b, a] + if self.comparisons.len() == 0 { + return None; + } + + let (regressions, improvements): (Vec<&TestResultComparison>, _) = + self.comparisons.iter().partition(|c| c.is_regression()); + + if regressions.len() == 0 { + return Some(Direction::Improvement); + } + + if improvements.len() == 0 { + return Some(Direction::Regression); + } + + let total_num = self.comparisons.len(); + let regressions_ratio = regressions.len() as f64 / total_num as f64; + + let has_medium_and_above_regressions = regressions + .iter() + .any(|c| c.magnitude().is_medium_or_above()); + let has_medium_and_above_improvements = improvements + .iter() + .any(|c| c.magnitude().is_medium_or_above()); + match ( + has_medium_and_above_improvements, + has_medium_and_above_regressions, + ) { + (true, true) => return Some(Direction::Mixed), + (true, false) => { + if regressions_ratio >= 0.15 { + Some(Direction::Mixed) + } else { + Some(Direction::Improvement) + } + } + (false, true) => { + if regressions_ratio < 0.85 { + Some(Direction::Mixed) + } else { + Some(Direction::Regression) + } + } + (false, false) => { + if regressions_ratio >= 0.1 && regressions_ratio <= 0.9 { + Some(Direction::Mixed) + } else if regressions_ratio <= 0.1 { + Some(Direction::Improvement) + } else { + Some(Direction::Regression) + } + } + } + } + + pub fn relevant_changes<'a>(&'a self) -> [Option<&TestResultComparison>; 2] { + match self.direction() { + Some(Direction::Improvement) => [self.largest_improvement(), None], + Some(Direction::Regression) => [self.largest_regression(), None], + Some(Direction::Mixed) => [self.largest_improvement(), self.largest_regression()], + None => [None, None], + } + } + + pub fn largest_improvement(&self) -> Option<&TestResultComparison> { + self.comparisons + .iter() + .filter(|s| !s.is_regression()) + .next() + } + + pub fn largest_regression(&self) -> Option<&TestResultComparison> { + self.comparisons.iter().filter(|s| s.is_regression()).next() + } + + pub fn confidence(&self) -> ComparisonConfidence { + let mut num_small_changes = 0; + let mut num_medium_changes = 0; + let mut num_large_changes = 0; + let mut num_very_large_changes = 0; + for c in self.comparisons.iter() { + match c.magnitude() { + Magnitude::Small => num_small_changes += 1, + Magnitude::Medium => num_medium_changes += 1, + Magnitude::Large => num_large_changes += 1, + Magnitude::VeryLarge => num_very_large_changes += 1, + } + } + + match ( + num_medium_changes, + num_large_changes, + num_very_large_changes, + ) { + (_, _, vl) if vl > 0 => ComparisonConfidence::DefinitelyRelevant, + (m, l, _) if m + (l * 2) > 4 => ComparisonConfidence::DefinitelyRelevant, + (m, l, _) if m > 1 || l > 0 => ComparisonConfidence::ProbablyRelevant, + (m, _, _) => { + if (m * 2 + num_small_changes) > 10 { + ComparisonConfidence::ProbablyRelevant + } else { + ComparisonConfidence::MaybeRelevant + } } - (Some(a), Some(b)) => vec![a, b], } } @@ -200,7 +293,7 @@ impl ComparisonSummary<'_> { let end = &comparison.b.artifact; let link = &compare_link(start, end); - for change in self.ordered_changes() { + for change in self.relevant_changes().iter().filter_map(|s| *s) { write!(result, "- ").unwrap(); change.summary_line(&mut result, Some(link)) } @@ -208,6 +301,25 @@ impl ComparisonSummary<'_> { } } +/// The amount of confidence we have that a comparison actually represents a real +/// change in the performance characteristics. +#[derive(Clone, Copy, Debug)] +pub enum ComparisonConfidence { + MaybeRelevant, + ProbablyRelevant, + DefinitelyRelevant, +} + +impl ComparisonConfidence { + pub fn is_definitely_relevant(self) -> bool { + matches!(self, Self::DefinitelyRelevant) + } + + pub fn is_atleast_probably_relevant(self) -> bool { + matches!(self, Self::DefinitelyRelevant | Self::ProbablyRelevant) + } +} + /// Compare two bounds on a given stat /// /// Returns Ok(None) when no data for the end bound is present @@ -250,11 +362,30 @@ async fn compare_given_commits( let mut responses = ctxt.statistic_series(query.clone(), aids).await?; let conn = ctxt.conn().await; + let statistics_for_a = statistics_from_series(&mut responses); + let statistics_for_b = statistics_from_series(&mut responses); + + let variances = BenchmarkVariances::calculate(ctxt, a.clone(), master_commits, stat).await?; + let statistics = statistics_for_a + .into_iter() + .filter_map(|(test_case, a)| { + statistics_for_b + .get(&test_case) + .map(|&b| TestResultComparison { + benchmark: test_case.0, + profile: test_case.1, + scenario: test_case.2, + variance: variances + .as_ref() + .and_then(|v| v.data.get(&test_case).cloned()), + results: (a, b), + }) + }) + .collect(); Ok(Some(Comparison { - benchmark_variances: BenchmarkVariances::calculate(ctxt, a.clone(), master_commits, stat) - .await?, - a: ArtifactData::consume_one(&*conn, a.clone(), &mut responses, master_commits).await, - b: ArtifactData::consume_one(&*conn, b.clone(), &mut responses, master_commits).await, + a: ArtifactDescription::for_artifact(&*conn, a.clone(), master_commits).await, + b: ArtifactDescription::for_artifact(&*conn, b.clone(), master_commits).await, + statistics, })) } @@ -280,39 +411,30 @@ fn previous_commits( prevs } -/// Data associated with a specific artifact +/// Detailed description of a specific artifact #[derive(Debug, Clone)] -pub struct ArtifactData { +pub struct ArtifactDescription { /// The artifact in question pub artifact: ArtifactId, /// The pr of the artifact if known pub pr: Option, - /// Benchmark data in the form "$crate-$profile" -> Vec<("$cache", nanoseconds)> - /// - /// * $profile refers to the flavor of compilation like debug, doc, opt(timized), etc. - /// * $cache refers to how much of the compilation must be done and how much is cached - /// (e.g., "incr-unchanged" == compiling with full incremental cache and no code having changed) - pub data: HashMap>, /// Bootstrap data in the form "$crate" -> nanoseconds pub bootstrap: HashMap, } -impl ArtifactData { +type StatisticsMap = HashMap; +type TestCase = (Benchmark, Profile, Scenario); + +impl ArtifactDescription { /// For the given `ArtifactId`, consume the first datapoint in each of the given `SeriesResponse` /// /// It is assumed that the provided `ArtifactId` matches the artifact id of the next data /// point for all of `SeriesResponse`. If this is not true, this function will panic. - async fn consume_one<'a, T>( + async fn for_artifact( conn: &dyn database::Connection, artifact: ArtifactId, - series: &mut [selector::SeriesResponse], master_commits: &[collector::MasterCommit], - ) -> Self - where - T: Iterator)>, - { - let data = data_from_series(series); - + ) -> Self { let bootstrap = conn .get_bootstrap(&[conn.artifact_id(&artifact).await]) .await; @@ -347,41 +469,35 @@ impl ArtifactData { Self { pr, artifact, - data, bootstrap, } } } -fn data_from_series( - series: &mut [selector::SeriesResponse], -) -> HashMap> +fn statistics_from_series(series: &mut [selector::SeriesResponse]) -> StatisticsMap where T: Iterator)>, { - let mut data = HashMap::new(); + let mut stats: StatisticsMap = HashMap::new(); for response in series { let (_, point) = response.series.next().expect("must have element"); - let point = if let Some(pt) = point { - pt + let value = if let Some(v) = point { + v } else { continue; }; - data.entry(format!( - "{}-{}", - response.path.get::().unwrap(), - response.path.get::().unwrap(), - )) - .or_insert_with(Vec::new) - .push((response.path.get::().unwrap().to_string(), point)); + let benchmark = *response.path.get::().unwrap(); + let profile = *response.path.get::().unwrap(); + let scenario = *response.path.get::().unwrap(); + stats.insert((benchmark, profile, scenario), value); } - data + stats } -impl From for api::comparison::ArtifactData { - fn from(data: ArtifactData) -> Self { - api::comparison::ArtifactData { +impl From for api::comparison::ArtifactDescription { + fn from(data: ArtifactDescription) -> Self { + api::comparison::ArtifactDescription { commit: match data.artifact.clone() { ArtifactId::Commit(c) => c.sha, ArtifactId::Tag(t) => t, @@ -392,7 +508,6 @@ impl From for api::comparison::ArtifactData { None }, pr: data.pr, - data: data.data, bootstrap: data.bootstrap, } } @@ -400,12 +515,10 @@ impl From for api::comparison::ArtifactData { // A comparison of two artifacts pub struct Comparison { - /// Data on how variable benchmarks have historically been - /// - /// Is `None` if we cannot determine historical variance - pub benchmark_variances: Option, - pub a: ArtifactData, - pub b: ArtifactData, + pub a: ArtifactDescription, + pub b: ArtifactDescription, + /// Statistics based on test case + pub statistics: HashSet, } impl Comparison { @@ -437,36 +550,16 @@ impl Comparison { next_commit(&self.b.artifact, master_commits).map(|c| c.sha.clone()) } - fn get_benchmarks<'a>(&'a self) -> Vec> { - let mut result = Vec::new(); - for (bench_name, a) in self.a.data.iter() { - if bench_name.ends_with("-doc") { - continue; - } - if let Some(b) = self.b.data.get(bench_name) { - for (cache_state, a) in a.iter() { - if let Some(b) = b.iter().find(|(cs, _)| cs == cache_state).map(|(_, b)| b) { - result.push(BenchmarkComparison { - bench_name, - cache_state, - results: (a.clone(), b.clone()), - }) - } - } - } - } - - result + fn get_individual_comparisons(&self) -> impl Iterator { + self.statistics.iter().filter(|b| b.profile != Profile::Doc) } } /// A description of the amount of variance a certain benchmark is historically /// experiencing at a given point in time. pub struct BenchmarkVariances { - /// Variance data on a per benchmark basis - /// Key: $benchmark-$profile-$cache - /// Value: `BenchmarkVariance` - pub data: HashMap, + /// Variance data on a per test case basis + pub data: HashMap<(Benchmark, Profile, Scenario), BenchmarkVariance>, } impl BenchmarkVariances { @@ -495,23 +588,18 @@ impl BenchmarkVariances { .statistic_series(query, previous_commits.clone()) .await?; - let mut variance_data: HashMap = HashMap::new(); + let mut variance_data: HashMap<(Benchmark, Profile, Scenario), BenchmarkVariance> = + HashMap::new(); for _ in previous_commits.iter() { - let series_data = data_from_series(&mut previous_commit_series); - for (bench_and_profile, data) in series_data { - for (cache, val) in data { - variance_data - .entry(format!("{}-{}", bench_and_profile, cache)) - .or_default() - .push(val); - } + for (test_case, stat) in statistics_from_series(&mut previous_commit_series) { + variance_data.entry(test_case).or_default().push(stat); } } if variance_data.len() < Self::MIN_PREVIOUS_COMMITS { return Ok(None); } - for (bench, results) in variance_data.iter_mut() { + for ((bench, _, _), results) in variance_data.iter_mut() { debug!("Calculating variance for: {}", bench); results.calculate_description(); } @@ -532,8 +620,8 @@ impl BenchmarkVariance { const SIGNFICANT_DELTA_THRESHOLD: f64 = 0.01; /// The percentage of significant changes that we consider too high const SIGNFICANT_CHANGE_THRESHOLD: f64 = 5.0; - /// The percentage of change that constitutes noisy data - const NOISE_THRESHOLD: f64 = 0.1; + /// The ratio of change that constitutes noisy data + const NOISE_THRESHOLD: f64 = 0.001; fn push(&mut self, value: f64) { self.data.push(value); @@ -567,19 +655,26 @@ impl BenchmarkVariance { ); if percent_significant_changes > Self::SIGNFICANT_CHANGE_THRESHOLD { - self.description = - BenchmarkVarianceDescription::HighlyVariable(percent_significant_changes); + self.description = BenchmarkVarianceDescription::HighlyVariable; return; } let delta_mean = non_significant.iter().map(|(&d, _)| d).sum::() / (non_significant.len() as f64); - let percent_change = (delta_mean / results_mean) * 100.0; - debug!("Percent change: {:.3}%", percent_change); - if percent_change > Self::NOISE_THRESHOLD { - self.description = BenchmarkVarianceDescription::Noisy(percent_change); + let ratio_change = delta_mean / results_mean; + debug!("Ratio change: {:.3}", ratio_change); + if ratio_change > Self::NOISE_THRESHOLD { + self.description = BenchmarkVarianceDescription::Noisy; } } + + /// Whether we can trust this benchmark or not + fn is_dodgy(&self) -> bool { + matches!( + self.description, + BenchmarkVarianceDescription::Noisy | BenchmarkVarianceDescription::HighlyVariable + ) + } } #[derive(Debug, Clone, Copy, Serialize)] @@ -588,14 +683,10 @@ pub enum BenchmarkVarianceDescription { Normal, /// A highly variable benchmark that produces many significant changes. /// This might indicate a benchmark which is very sensitive to compiler changes. - /// - /// Cotains the percentage of significant changes. - HighlyVariable(f64), + HighlyVariable, /// A noisy benchmark which is likely to see changes in performance simply between /// compiler runs. - /// - /// Contains the percent change that happens on average - Noisy(f64), + Noisy, } impl Default for BenchmarkVarianceDescription { @@ -629,44 +720,70 @@ pub fn next_commit<'a>( } } -// A single comparison based on benchmark and cache state -#[derive(Debug)] -pub struct BenchmarkComparison<'a> { - bench_name: &'a str, - cache_state: &'a str, +// A single comparison between two test results +#[derive(Debug, Clone)] +pub struct TestResultComparison { + benchmark: Benchmark, + profile: Profile, + scenario: Scenario, + variance: Option, results: (f64, f64), } -const SIGNIFICANCE_THRESHOLD: f64 = 0.01; -impl BenchmarkComparison<'_> { - fn log_change(&self) -> f64 { - let (a, b) = self.results; - (b / a).ln() - } +impl TestResultComparison { + /// The amount of relative change considered significant when + /// the test case is not dodgy + const SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD: f64 = 0.002; + + /// The amount of relative change considered significant when + /// the test case is dodgy + const SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD_DODGY: f64 = 0.008; - fn is_increase(&self) -> bool { + fn is_regression(&self) -> bool { let (a, b) = self.results; b > a } fn is_significant(&self) -> bool { - // This particular (benchmark, cache) combination frequently varies - if self.bench_name.starts_with("coercions-debug") - && self.cache_state == "incr-patched: println" - { - self.relative_change().abs() > 2.0 + self.relative_change().abs() > self.signifcance_threshold() + } + + fn signifcance_threshold(&self) -> f64 { + if self.is_dodgy() { + Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD_DODGY + } else { + Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD + } + } + + fn magnitude(&self) -> Magnitude { + let mag = self.relative_change().abs(); + let threshold = self.signifcance_threshold(); + if mag < threshold * 3.0 { + Magnitude::Small + } else if mag < threshold * 10.0 { + Magnitude::Medium + } else if mag < threshold * 25.0 { + Magnitude::Large } else { - self.log_change().abs() > SIGNIFICANCE_THRESHOLD + Magnitude::VeryLarge } } + fn is_dodgy(&self) -> bool { + self.variance + .as_ref() + .map(|v| v.is_dodgy()) + .unwrap_or(false) + } + fn relative_change(&self) -> f64 { let (a, b) = self.results; (b - a) / a } fn direction(&self) -> Direction { - if self.log_change() > 0.0 { + if self.relative_change() > 0.0 { Direction::Regression } else { Direction::Improvement @@ -675,24 +792,13 @@ impl BenchmarkComparison<'_> { pub fn summary_line(&self, summary: &mut String, link: Option<&str>) { use std::fmt::Write; - let magnitude = self.log_change().abs(); - let size = if magnitude > 0.10 { - "Very large" - } else if magnitude > 0.05 { - "Large" - } else if magnitude > 0.01 { - "Moderate" - } else if magnitude > 0.005 { - "Small" - } else { - "Very small" - }; + let magnitude = self.magnitude(); let percent = self.relative_change() * 100.0; write!( summary, "{} {} in {}", - size, + magnitude, self.direction(), match link { Some(l) => format!("[instruction counts]({})", l), @@ -703,14 +809,32 @@ impl BenchmarkComparison<'_> { writeln!( summary, " (up to {:.1}% on `{}` builds of `{}`)", - percent, self.cache_state, self.bench_name + percent, self.scenario, self.benchmark ) .unwrap(); } } +impl std::cmp::PartialEq for TestResultComparison { + fn eq(&self, other: &Self) -> bool { + self.benchmark == other.benchmark + && self.profile == other.profile + && self.scenario == other.scenario + } +} + +impl std::cmp::Eq for TestResultComparison {} + +impl std::hash::Hash for TestResultComparison { + fn hash(&self, state: &mut H) { + self.benchmark.hash(state); + self.profile.hash(state); + self.scenario.hash(state); + } +} + // The direction of a performance change -#[derive(PartialEq, Eq, Hash)] +#[derive(PartialEq, Eq, Hash, Debug)] pub enum Direction { Improvement, Regression, @@ -728,10 +852,38 @@ impl std::fmt::Display for Direction { } } +/// The relative size of a performance change +#[derive(Debug)] +enum Magnitude { + Small, + Medium, + Large, + VeryLarge, +} + +impl Magnitude { + fn is_medium_or_above(&self) -> bool { + !matches!(self, Self::Small) + } +} + +impl std::fmt::Display for Magnitude { + fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { + let s = match self { + Self::Small => "Small", + Self::Medium => "Moderate", + Self::Large => "Large", + Self::VeryLarge => "Very large", + }; + f.write_str(s) + } +} + async fn generate_report( start: &Bound, end: &Bound, - mut report: HashMap>, + mut report: HashMap, Vec>, + num_comparisons: usize, ) -> String { fn fmt_bound(bound: &Bound) -> String { match bound { @@ -742,9 +894,14 @@ async fn generate_report( } let start = fmt_bound(start); let end = fmt_bound(end); - let regressions = report.remove(&Direction::Regression).unwrap_or_default(); - let improvements = report.remove(&Direction::Improvement).unwrap_or_default(); - let mixed = report.remove(&Direction::Mixed).unwrap_or_default(); + let regressions = report + .remove(&Some(Direction::Regression)) + .unwrap_or_default(); + let improvements = report + .remove(&Some(Direction::Improvement)) + .unwrap_or_default(); + let mixed = report.remove(&Some(Direction::Mixed)).unwrap_or_default(); + let unlabeled = report.remove(&None).unwrap_or_default(); let untriaged = match github::untriaged_perf_regressions().await { Ok(u) => u .iter() @@ -768,6 +925,8 @@ TODO: Summary Triage done by **@???**. Revision range: [{first_commit}..{last_commit}](https://perf.rust-lang.org/?start={first_commit}&end={last_commit}&absolute=false&stat=instructions%3Au) +{num_comparisons} comparisons made in total +{num_def_relevant} definitely relevant comparisons and {num_prob_relevant} probably relevant comparisons {num_regressions} Regressions, {num_improvements} Improvements, {num_mixed} Mixed; ??? of them in rollups @@ -783,6 +942,15 @@ Revision range: [{first_commit}..{last_commit}](https://perf.rust-lang.org/?star {mixed} +#### Probably changed + +The following is a list of comparisons which *probably* represent real performance changes, +but we're not 100% sure. Please move things from this category into the categories +above for changes you think *are* definitely relevant and file an issue for each so that +we can consider how to change our heuristics. + +{unlabeled} + #### Untriaged Pull Requests {untriaged} @@ -795,12 +963,16 @@ TODO: Nags date = chrono::Utc::today().format("%Y-%m-%d"), first_commit = start, last_commit = end, + num_comparisons = num_comparisons, + num_def_relevant = regressions.len() + improvements.len() + mixed.len(), + num_prob_relevant = unlabeled.len(), num_regressions = regressions.len(), num_improvements = improvements.len(), num_mixed = mixed.len(), regressions = regressions.join("\n\n"), improvements = improvements.join("\n\n"), mixed = mixed.join("\n\n"), + unlabeled = unlabeled.join("\n\n"), untriaged = untriaged ) } diff --git a/site/src/github.rs b/site/src/github.rs index 2cb1f6d87..0a59e74a4 100644 --- a/site/src/github.rs +++ b/site/src/github.rs @@ -606,7 +606,7 @@ async fn categorize_benchmark( const DISAGREEMENT: &str = "If you disagree with this performance assessment, \ please file an issue in [rust-lang/rustc-perf](https://github.com/rust-lang/rustc-perf/issues/new)."; let (summary, direction) = match ComparisonSummary::summarize_comparison(&comparison) { - Some(s) if s.direction().is_some() => { + Some(s) if s.confidence().is_atleast_probably_relevant() => { let direction = s.direction().unwrap(); (s, direction) } @@ -630,7 +630,7 @@ async fn categorize_benchmark( "This change led to significant {} in compiler performance.\n", category ); - for change in summary.ordered_changes() { + for change in summary.relevant_changes().iter().filter_map(|c| *c) { write!(result, "- ").unwrap(); change.summary_line(&mut result, None) } diff --git a/site/static/compare.html b/site/static/compare.html index bd84caa62..800fd20fd 100644 --- a/site/static/compare.html +++ b/site/static/compare.html @@ -525,28 +525,23 @@

Comparing {{stat}} between { return true; } } - function toVariants(name) { + function toVariants(results) { let variants = []; - for (let d of data.a.data[name]) { - const key = d[0]; - const datumA = d[1]; - const datumB = data.b.data[name]?.find(x => x[0] == key)?.[1]; - if (!datumB) { - continue; - } + for (let r of results) { + const scenarioName = r.scenario; + const datumA = r.statistics[0]; + const datumB = r.statistics[1]; + const isSignificant = r.is_significant; let percent = (100 * (datumB - datumA) / datumA); - let isDodgy = false; - if (data.variance) { - let variance = data.variance[name + "-" + key]; - isDodgy = (variance?.description?.type ?? "Normal") != "Normal"; - } - if (shouldShowBuild(key)) { + let isDodgy = r.is_dodgy; + if (shouldShowBuild(scenarioName)) { variants.push({ - casename: key, + casename: scenarioName, datumA, datumB, percent, isDodgy, + isSignificant }); } } @@ -555,25 +550,33 @@

Comparing {{stat}} between { } let benches = - Object.keys(data.a.data). - filter(n => filter.name && filter.name.trim() ? n.includes(filter.name.trim()) : true). - map(name => { - const variants = toVariants(name).filter(v => filter.showOnlySignificant ? isSignificant(v.percent, v.isDodgy) : true); - const pcts = variants.map(field => parseFloat(field.percent)); - const maxPct = Math.max(...pcts).toFixed(1); - const minPct = Math.min(...pcts).toFixed(1); - if (variants.length > 0) { - variants[0].first = true; - } - - return { - name, - variants, - maxPct, - minPct, - }; - }). - filter(b => b.variants.length > 0); + data.comparisons. + filter(n => filter.name && filter.name.trim() ? n.benchmark.includes(filter.name.trim()) : true). + reduce((accum, next) => { + accum[next.benchmark + "-" + next.profile] ||= []; + accum[next.benchmark + "-" + next.profile].push(next); + return accum; + }, {}); + benches = Object.entries(benches). + map(c => { + const name = c[0]; + const comparison = c[1]; + const variants = toVariants(comparison).filter(v => filter.showOnlySignificant ? v.isSignificant : true); + const pcts = variants.map(field => parseFloat(field.percent)); + const maxPct = Math.max(...pcts).toFixed(1); + const minPct = Math.min(...pcts).toFixed(1); + if (variants.length > 0) { + variants[0].first = true; + } + + return { + name, + variants, + maxPct, + minPct, + }; + }). + filter(b => b.variants.length > 0); const largestChange = a => Math.max(Math.abs(a.minPct), Math.abs(a.maxPct)); benches.sort((a, b) => largestChange(b) - largestChange(a)); @@ -634,29 +637,25 @@

Comparing {{stat}} between { return findQueryParam("stat") || "instructions:u"; }, summary() { - const filtered = Object.fromEntries(this.benches.map(b => [b.name, Object.fromEntries(b.variants.map(v => [v.casename, true]))])); + // Create object with each test case that is not filtered out as a key + const filtered = Object.fromEntries(this.benches.flatMap(b => b.variants.map(v => [b.name + "-" + v.casename, true]))); const newCount = { regressions: 0, improvements: 0, unchanged: 0 } let result = { all: { ...newCount }, filtered: { ...newCount } } - for (let benchmarkAndProfile of Object.keys(this.data.a.data)) { - for (let d of this.data.a.data[benchmarkAndProfile]) { - const scenario = d[0]; - const datumA = d[1]; - const datumB = this.data.b.data[benchmarkAndProfile]?.find(x => x[0] == scenario)?.[1]; - if (!datumB) { - continue; - } - let percent = 100 * ((datumB - datumA) / datumA); - const isDodgy = this.isDodgy(benchmarkAndProfile, scenario); - if (percent > 0 && isSignificant(percent, isDodgy)) { - result.all.regressions += 1; - result.filtered.regressions += filtered[benchmarkAndProfile]?.[scenario] || 0; - } else if (percent < 0 && isSignificant(percent, isDodgy)) { - result.all.improvements += 1; - result.filtered.improvements += filtered[benchmarkAndProfile]?.[scenario] || 0; - } else { - result.all.unchanged += 1; - result.filtered.unchanged += filtered[benchmarkAndProfile]?.[scenario] || 0; - } + for (let d of this.data.comparisons) { + const testCase = testCaseString(d) + const scenario = d.scenario; + const datumA = d.statistics[0]; + const datumB = d.statistics[1]; + let percent = 100 * ((datumB - datumA) / datumA); + if (percent > 0 && d.is_significant) { + result.all.regressions += 1; + result.filtered.regressions += filtered[testCase] || 0; + } else if (percent < 0 && d.is_significant) { + result.all.improvements += 1; + result.filtered.improvements += filtered[testCase] || 0; + } else { + result.all.unchanged += 1; + result.filtered.unchanged += filtered[testCase] || 0; } } return result; @@ -719,26 +718,9 @@

Comparing {{stat}} between { } return result; }, - isDodgy(benchmarkAndProfile, scenario) { - let variance = this.data.variance; - if (!variance) { - return false; - } - variance = this.data.variance[benchmarkAndProfile + "-" + scenario]; - return (variance?.description?.type ?? "Normal") != "Normal"; - }, }, }); - function isSignificant(percent, isNoisy) { - const perAbs = Math.abs(percent); - if (isNoisy) { - return perAbs > 1.0; - } else { - return perAbs >= 0.2; - } - } - function toggleFilters(id, toggle) { let styles = document.getElementById(id).style; let indicator = document.getElementById(toggle); @@ -753,6 +735,10 @@

Comparing {{stat}} between { toggleFilters("filters-content", "filters-toggle-indicator"); toggleFilters("search-content", "search-toggle-indicator"); + function testCaseString(testCase) { + return testCase.benchmark + "-" + testCase.profile + "-" + testCase.scenario; + } + document.getElementById("filters-toggle").onclick = (e) => { toggleFilters("filters-content", "filters-toggle-indicator"); };