Add clickbench parquet based queries to sql_planner benchmark #13103

Omega359 · 2024-10-25T03:44:14Z

Which issue does this PR close?

This change uses the benchmarks/data/hits_partitioned/ data which must be downloaded via the bench.sh script prior to the sql_planner benchmark running. If this is not reasonable we can either switch to the itty bitty clickbench_hits_10.parquet file or create some other clickbench files (perhaps in the testing folder)

Rationale for this change

Extending the sql planning benchmark with addition tests that covers planning against an actual table (not memory based) and including sorting.

What changes are included in this PR?

sql planner benchmark only.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

alamb

THank you @Omega359 -- I think this is a great addition. I have two suggestions I think are worth considering but I don't think they are required ❤️

BTW here is a preview flamegraph (lots of cool things there)

alamb · 2024-10-25T14:18:14Z

datafusion/core/benches/sql_planner.rs

+
+    let clickbench_ctx = register_clickbench_hits_table();
+
+    for (i, sql) in clickbench_queries.iter().enumerate() {


Since the physical planing benchmark also includes the logical planning and most usecases include both logican and physical planning, I think the logical planning only benchmarks are largely redundant

however, I realize this PR just follows the existing pattern. Maybe we should remove all the "logical planning" benchmarks 🤔

I can comment them out and if anyone wants to see just those they can edit the code 💭

alamb · 2024-10-25T14:25:16Z

datafusion/core/benches/sql_planner.rs

 use std::sync::Arc;
 use test_utils::tpcds::tpcds_schemas;
 use test_utils::tpch::tpch_schemas;
 use test_utils::TableDef;
 use tokio::runtime::Runtime;

+const CLICKBENCH_DATA_PATH: &str = "../../benchmarks/data/hits_partitioned/";


I think this assumes the script is run from datafusion/core (what cargo does)

However, that meant when I tried to run the benchmark binary directly it failed like this:

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ target/release/deps/sql_planner-d64e21551189f776 --bench physical_plan_clickbench_q1 Gnuplot not found, using plotters backend thread 'main' panicked at datafusion/core/benches/sql_planner.rs:121:38: benchmarks/data/hits_partitioned/ could not be loaded. Please run 'benchmarks/bench.sh data clickbench_partitioned' prior to running this benchmark: Os { code: 2, kind: NotFound, message: "No such file or directory" }

Any chance you could make the script test both locations

../../benchmarks/data/hits_partitioned

benchmarks/data/hits_partitioned?

Commented out most logical_plan tests & updated code to allow for running from either cargo or via target/release/deps/sql_planner-xyz

…ning from either cargo or via target/release/deps/sql_planner-xyz

…ith_clickbench # Conflicts: # datafusion/core/benches/sql_planner.rs

alamb

Thanks @Omega359 -- this looks good to me

I tested it locally and it worked great

target/release/deps/sql_planner-d64e21551189f776  --bench physical_plan_clickbench_q37

alamb · 2024-10-26T11:00:20Z

datafusion/core/benches/sql_planner.rs

@@ -235,9 +274,15 @@ fn criterion_benchmark(c: &mut Criterion) {
        "q16", "q17", "q18", "q19", "q20", "q21", "q22",
    ];

+    let benchmarks_path = if PathBuf::from(BENCHMARKS_PATH_1).exists() {


alamb · 2024-10-26T11:00:39Z

datafusion/core/benches/sql_planner.rs

-            }
-        })
-    });
+    // c.bench_function("logical_plan_tpch_all", |b| {


maybe we could just delete it entirely?

alamb · 2024-10-26T11:35:10Z

datafusion/core/benches/sql_planner.rs

+
+    let queries_file =
+        File::open(format!("{benchmarks_path}queries/clickbench/queries.sql")).unwrap();
+    let extended_file =


I was confused at first what click bench Q48 was (as there are only 42 queries) -- but this now makes sense.

physical_plan_clickbench_q48 time: [2.6437 ms 2.6674 ms 2.6943 ms] Found 11 outliers among 100 measurements (11.00%) 3 (3.00%) high mild 8 (8.00%) high severe

It would probably be less confusing if this was called physical_plan_clickbench_extended_q5 or whatever to align with the naming of suites

alamb · 2024-10-26T11:38:30Z

I think this is a nice improvement as is -- maybe we can keep improving things as follow on PRs

thanks again @Omega359

Add clickbench parquet based queries to sql_planner benchmark.

d7ae4fd

github-actions bot added the core Core DataFusion crate label Oct 25, 2024

Cargo fmt.

74425d3

Omega359 marked this pull request as ready for review October 25, 2024 13:56

alamb approved these changes Oct 25, 2024

View reviewed changes

Omega359 added 2 commits October 25, 2024 17:08

Commented out most logical_plan tests & updated code to allow for run…

ad5a86e

…ning from either cargo or via target/release/deps/sql_planner-xyz

Merge remote-tracking branch 'origin/main' into feature/sql_planner_w…

f4c6bb7

…ith_clickbench # Conflicts: # datafusion/core/benches/sql_planner.rs

alamb approved these changes Oct 26, 2024

View reviewed changes

alamb merged commit 412ca4e into apache:main Oct 26, 2024
24 checks passed

Omega359 deleted the feature/sql_planner_with_clickbench branch October 26, 2024 19:17

alamb mentioned this pull request Oct 29, 2024

Oct 28, 2024: This week in DataFusion #13167

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add clickbench parquet based queries to sql_planner benchmark #13103

Add clickbench parquet based queries to sql_planner benchmark #13103

Uh oh!

Omega359 commented Oct 25, 2024 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Oct 25, 2024

Uh oh!

Omega359 Oct 25, 2024

Uh oh!

alamb Oct 25, 2024

Uh oh!

Omega359 Oct 25, 2024

Uh oh!

Omega359 Oct 25, 2024

Uh oh!

alamb left a comment

Uh oh!

alamb Oct 26, 2024

Uh oh!

alamb Oct 26, 2024

Uh oh!

alamb Oct 26, 2024

Uh oh!

alamb commented Oct 26, 2024

Uh oh!

Uh oh!

Uh oh!


		let clickbench_ctx = register_clickbench_hits_table();

		for (i, sql) in clickbench_queries.iter().enumerate() {

Add clickbench parquet based queries to sql_planner benchmark #13103

Add clickbench parquet based queries to sql_planner benchmark #13103

Uh oh!

Conversation

Omega359 commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 26, 2024

Uh oh!

Uh oh!

Uh oh!

Omega359 commented Oct 25, 2024 •

edited

Loading