-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
I measured a performance of execution plan creation (and optimization) using default physical planner.
0.40.0 vs current main (84ac4f9).
To measure used the following test (put it into core/src/physical_planner.rs):
#[tokio::test]
async fn execution_plan_creation_bench() {
let ctx = SessionContext::new();
let mut fields = vec![];
let mut columns = vec![];
const COLUMNS_NUM: usize = 200;
for i in 0..COLUMNS_NUM {
fields.push(Field::new(format!("attribute{}", i), datafusion::arrow::datatypes::DataType::Int32, true));
columns.push(Int32Array::from(vec![1]));
}
let batch = RecordBatch::try_new(
SchemaRef::new(Schema::new(fields)),
columns
.into_iter()
.map(|it| Arc::new(it) as Arc<dyn Array>)
.collect(),
)
.unwrap();
ctx.register_batch("t", batch).unwrap();
let mut aggregates = String::new();
for i in 0..COLUMNS_NUM {
if i > 0 {
aggregates.push_str(", ");
}
aggregates.push_str(format!("MAX(attribute{})", i).as_str());
}
/*
SELECT max(attr0), ..., max(attrN) FROM t
*/
let query = format!("SELECT {} FROM t", aggregates);
let mut res = 0;
let statement = ctx.state().sql_to_statement(&query, "Generic").unwrap();
let plan = ctx.state().statement_to_plan(statement).await.unwrap();
let planner = DefaultPhysicalPlanner::default();
let mut duration_sum = 0.0;
let mut durtaion_count = 0;
for _ in 0..500 {
let creation_start = std::time::Instant::now();
let phys_plan = planner
.create_physical_plan(&plan, &ctx.state())
.await
.unwrap();
duration_sum += creation_start.elapsed().as_micros() as f64;
durtaion_count += 1;
res += phys_plan.children().len();
}
println!("mean = {}", duration_sum / (durtaion_count as f64));
/* Not important. */
println!("res [not important, just to not discard] = {}", res);
}
The query in the test is SELECT f(a0), ..., f(a_N) FROM t
, where f - some aggregate function.
The command to test:
cargo test --release --package datafusion --lib -- physical_planner::tests::execution_plan_creation_bench --exact --show-output
On the branch-40 I got the result:
mean = 1003.368
On the current main (84ac4f9) I got:
mean = 1286.524
But what is more interesting, when I increase the number of columns in the query (200 -> 1000), got:
branch-40:
mean = 8207.86
main (84ac4f9):
mean = 9836.384
So, the difference increases (it seems that the dependency is linear) with columns number growth.
I create this issue to investigate: is such degradation inevitable? Or maybe there is a an obvious optimization that solves the problem.