Fix HashJoinExec sideways information passing for partitioned queries #17197

adriangb · 2025-08-14T18:13:58Z

The bug originated because I did not account for each partition building a build side hash table independently.
Having grokked the code in more detail I now understand how this happens.
The approach I've taken here to fix the bug is to wait until all partitions have finished their build side (keeping track of the bounds for each partition) and only then calculating the overall bounds and pushing them down as filters to the scans.
I don't see how we could do much better: since we only have 1 filter to push down (expressions are not aware of partitions) we have to make that filter represent the entire build side (i.e. we can't push down a filter per partition).

adriangb · 2025-08-14T18:15:15Z

I think we can support this but need to do some thinking / writing code

adriangb · 2025-08-15T03:44:50Z

@nuno-faria @jonathanc-n would you mind reviewing this PR?

jonathanc-n · 2025-08-15T05:05:51Z

In the case of hash join spilling this might be a bit difficult. I'm planning on putting out a proposal for hash join spilling in the next few days. To give you a quick rundown the idea is to essentially partition the data and when there is a need to spill we can write some partitions to disk, and after the first hash join exec is done, it will read spilled batches from disk and run again.

Since we can't use the entire batch to compute the bounds upfront I was thinking we could call min_batch/max_batch for every as the batches are being loaded in.

Due to this reason I think the first hash spilling pull request won't include support for filter pushdown, or we can discuss this once I open the proposal

jonathanc-n · 2025-08-15T05:12:07Z

Its a bit late for me so I'll take a look at the code tomorrow. It'd be interesting if we can somehow have partition aware expressions, even though it doesn't quite make sense, maybe we can instead can the filter behave differently with certain execution plans (like hash join) -> create a runtime filter 🤷 . I'll give some more thought tomorrow

nuno-faria · 2025-08-15T09:13:35Z

datafusion/sqllogictest/test_files/push_down_filter.slt

+query III rowsort
+select *
+from t1
+join t2 on t1.k = t2.k
+where v = 1 or v = 10000000;
+----
+1 1 1
+10000000 10000000 10000000


I think we should add a few retries of this query just to be sure, since it could randomly work before the fix.

nuno-faria · 2025-08-15T09:30:54Z

Thanks @adriangb for looking into it.

I found some issues with the fix.

It appears to be causing a regression with the regular dynamic filter pushdown. Now more rows are being returned than what is supposed to. For example:

copy (select i as k from generate_series(1, 10000000) as t(i)) to 't1.parquet';
copy (select i as k, i as v from generate_series(1, 10000000) as t(i)) to 't2.parquet';
create external table t1 stored as parquet location 't1.parquet';
create external table t2 stored as parquet location 't2.parquet';

explain analyze select *
from t1
join t2 on t1.k = t2.k
where v = 1;

-- in each run, the number of returned rows by t1 is different
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]
    output_rows=6854272
    
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]
    output_rows=8951424

DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]
    output_rows=7902848

DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]
    output_rows=8951424

DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]
    output_rows=7902848

On main, the number of rows returned by t1 is the minimum expected:

DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]
    output_rows=20480

DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]
    output_rows=20480
    
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]
    output_rows=20480

DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]
    output_rows=20480
    
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]
    output_rows=20480

With partitioned joins, the dynamic filter is always true, while I think it should be the minimum/maximum of all partial filters. For example:

explain analyze select *
from t1
join t2 on t1.k = t2.k
where v = 1 or v = 2;

-- all rows are returned
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ true ]
    output_rows=10000000

I think in this case it should be DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 2 ].

However, if we had v=1 or v=10000000 the filter in this case would be [ k@0 >= 1 AND k@0 <= 10000000 ], which would also return all rows. Wouldn't it be feasible to build a filter by combining the different partition expressions with OR? Like so: [ (k@0 >= 1 AND k@0 <= 1) OR (k@0 >= 10000000 AND k@0 <= 10000000) ].

adriangb · 2025-08-15T18:40:09Z

@nuno-faria did you review 5d0d6a4 by chance? I've completely changed the approach since then and I can't reproduce the issues you describe.

adriangb · 2025-08-15T18:43:02Z

@kosiew since you're working on these bits as well, could you help review this PR?

nuno-faria · 2025-08-16T11:10:42Z

@adriangb I tested again and still see the issues. I'm testing with datafusion-cli using the debug mode, in 481f7f9.

❯ git log -1
commit 481f7f9382dd83d7d5416cdfb1bf52d26d7cec40 (HEAD -> fix-hash-join-partitioned)
Author: Adrian Garcia Badaracco <[email protected]>
Date:   Fri Aug 15 13:44:49 2025 -0500

    remove field, run test multiple times

❯ cargo run
DataFusion CLI v49.0.0
> copy (select i as k from generate_series(1, 10000000) as t(i)) to 't1.parquet';
+----------+
| count    |
+----------+
| 10000000 |
+----------+
1 row(s) fetched.
Elapsed 4.200 seconds.

> copy (select i as k, i as v from generate_series(1, 10000000) as t(i)) to 't2.parquet';
+----------+
| count    |
+----------+
| 10000000 |
+----------+
1 row(s) fetched.
Elapsed 4.345 seconds.

> create external table t1 stored as parquet location 't1.parquet';
0 row(s) fetched.
Elapsed 0.005 seconds.

> create external table t2 stored as parquet location 't2.parquet';
0 row(s) fetched.
Elapsed 0.006 seconds.

Here are a few runs showing the first issue:

> explain analyze select *
from t1
join t2 on t1.k = t2.k
where v = 1;
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | ProjectionExec: expr=[k@2 as k, k@0 as k, v@1 as v], metrics=[output_rows=1, elapsed_compute=6.711µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                   |   CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1, elapsed_compute=2.824202ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                   |     HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(k@0, k@0)], filter=[k@0 >= 1 AND k@0 <= 1], metrics=[output_rows=1, elapsed_compute=1.043652101s, build_input_batches=1, build_input_rows=1, input_batches=709, input_rows=5805696, output_batches=709, build_mem_used=89, build_time=839.3µs, join_time=1.0427621s]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                   |       CoalescePartitionsExec, metrics=[output_rows=1, elapsed_compute=63.8µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                   |         CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1, elapsed_compute=43.9µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                   |           FilterExec: v@1 = 1, metrics=[output_rows=1, elapsed_compute=569.911µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   |             DataSourceExec: file_groups={12 groups: [[t2.parquet:0..2133322], [t2.parquet:2133322..4266644], [t2.parquet:4266644..6399966], [t2.parquet:6399966..8533288], [t2.parquet:8533288..10666610], ...]}, projection=[k, v], file_type=parquet, predicate=v@1 = 1, pruning_predicate=v_null_count@2 != row_count@3 AND v_min@0 <= 1 AND 1 <= v_max@1, required_guarantees=[v in (1)]                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                   | , metrics=[output_rows=20480, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=412894, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=20480, page_index_rows_pruned=1028096, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=9, bloom_filter_eval_time=252.512µs, metadata_load_time=3.801512ms, page_index_eval_time=329.512µs, row_pushdown_eval_time=24ns, statistics_eval_time=1.223112ms, time_elapsed_opening=11.4854ms, time_elapsed_processing=46.3598ms, time_elapsed_scanning_total=35.9475ms, time_elapsed_scanning_until_data=32.8674ms] |
|                   |       DataSourceExec: file_groups={12 groups: [[t1.parquet:0..1066678], [t1.parquet:1066678..2133356], [t1.parquet:2133356..3200034], [t1.parquet:3200034..4266712], [t1.parquet:4266712..5333390], ...]}, projection=[k], file_type=parquet, predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ], pruning_predicate=k_null_count@1 != row_count@2 AND k_max@0 >= 1 AND k_null_count@1 != row_count@2 AND k_min@3 <= 1, required_guarantees=[]                                                                                                                                                                                                                                                                                                                                                           |
|         here ->   | , metrics=[output_rows=5805696, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=7485222, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=0, page_index_rows_pruned=1048576, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=3, bloom_filter_eval_time=3.62µs, metadata_load_time=3.269812ms, page_index_eval_time=218.712µs, row_pushdown_eval_time=24ns, statistics_eval_time=443.22µs, time_elapsed_opening=6.0144ms, time_elapsed_processing=814.7442ms, time_elapsed_scanning_total=2.0536666s, time_elapsed_scanning_until_data=119.6913ms]     |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.434 seconds.

> explain analyze select *
from t1
join t2 on t1.k = t2.k
where v = 1;
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | ProjectionExec: expr=[k@2 as k, k@0 as k, v@1 as v], metrics=[output_rows=1, elapsed_compute=8.611µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                   |   CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1, elapsed_compute=3.987404ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   |     HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(k@0, k@0)], filter=[k@0 >= 1 AND k@0 <= 1], metrics=[output_rows=1, elapsed_compute=1.458327901s, build_input_batches=1, build_input_rows=1, input_batches=899, input_rows=7360512, output_batches=899, build_mem_used=89, build_time=974µs, join_time=1.4572807s]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                   |       CoalescePartitionsExec, metrics=[output_rows=1, elapsed_compute=52.6µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                   |         CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1, elapsed_compute=29.1µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                   |           FilterExec: v@1 = 1, metrics=[output_rows=1, elapsed_compute=586.111µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|                   |             DataSourceExec: file_groups={12 groups: [[t2.parquet:0..2133322], [t2.parquet:2133322..4266644], [t2.parquet:4266644..6399966], [t2.parquet:6399966..8533288], [t2.parquet:8533288..10666610], ...]}, projection=[k, v], file_type=parquet, predicate=v@1 = 1, pruning_predicate=v_null_count@2 != row_count@3 AND v_min@0 <= 1 AND 1 <= v_max@1, required_guarantees=[v in (1)]                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                   | , metrics=[output_rows=20480, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=412894, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=20480, page_index_rows_pruned=1028096, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=9, bloom_filter_eval_time=130.412µs, metadata_load_time=4.664712ms, page_index_eval_time=198.612µs, row_pushdown_eval_time=24ns, statistics_eval_time=1.181212ms, time_elapsed_opening=12.7488ms, time_elapsed_processing=46.6822ms, time_elapsed_scanning_total=35.0437ms, time_elapsed_scanning_until_data=31.9659ms]  |
|                   |       DataSourceExec: file_groups={12 groups: [[t1.parquet:0..1066678], [t1.parquet:1066678..2133356], [t1.parquet:2133356..3200034], [t1.parquet:3200034..4266712], [t1.parquet:4266712..5333390], ...]}, projection=[k], file_type=parquet, predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ], pruning_predicate=k_null_count@1 != row_count@2 AND k_max@0 >= 1 AND k_null_count@1 != row_count@2 AND k_min@3 <= 1, required_guarantees=[]                                                                                                                                                                                                                                                                                                                                                            |
|          here ->  | , metrics=[output_rows=7360512, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=9474601, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=20480, page_index_rows_pruned=2076672, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=1, bloom_filter_eval_time=1.023µs, metadata_load_time=2.509312ms, page_index_eval_time=578.312µs, row_pushdown_eval_time=24ns, statistics_eval_time=164.423µs, time_elapsed_opening=5.3009ms, time_elapsed_processing=1.1821071s, time_elapsed_scanning_total=2.937489s, time_elapsed_scanning_until_data=188.8519ms] |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.496 seconds.

> explain analyze select *
from t1
join t2 on t1.k = t2.k
where v = 1;
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | ProjectionExec: expr=[k@2 as k, k@0 as k, v@1 as v], metrics=[output_rows=1, elapsed_compute=6.711µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                   |   CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1, elapsed_compute=6.169003ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                   |     HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(k@0, k@0)], filter=[k@0 >= 1 AND k@0 <= 1], metrics=[output_rows=1, elapsed_compute=2.217662901s, build_input_batches=1, build_input_rows=1, input_batches=1221, input_rows=10000000, output_batches=1221, build_mem_used=89, build_time=711.7µs, join_time=2.2169098s]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                   |       CoalescePartitionsExec, metrics=[output_rows=1, elapsed_compute=51.6µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                   |         CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1, elapsed_compute=26.6µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                   |           FilterExec: v@1 = 1, metrics=[output_rows=1, elapsed_compute=585.411µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                   |             DataSourceExec: file_groups={12 groups: [[t2.parquet:0..2133322], [t2.parquet:2133322..4266644], [t2.parquet:4266644..6399966], [t2.parquet:6399966..8533288], [t2.parquet:8533288..10666610], ...]}, projection=[k, v], file_type=parquet, predicate=v@1 = 1, pruning_predicate=v_null_count@2 != row_count@3 AND v_min@0 <= 1 AND 1 <= v_max@1, required_guarantees=[v in (1)]                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                   | , metrics=[output_rows=20480, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=412894, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=20480, page_index_rows_pruned=1028096, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=9, bloom_filter_eval_time=129.712µs, metadata_load_time=4.275612ms, page_index_eval_time=206.112µs, row_pushdown_eval_time=24ns, statistics_eval_time=1.155312ms, time_elapsed_opening=13.5415ms, time_elapsed_processing=47.6242ms, time_elapsed_scanning_total=35.1671ms, time_elapsed_scanning_until_data=32.096ms] |
|                   |       DataSourceExec: file_groups={12 groups: [[t1.parquet:0..1066678], [t1.parquet:1066678..2133356], [t1.parquet:2133356..3200034], [t1.parquet:3200034..4266712], [t1.parquet:4266712..5333390], ...]}, projection=[k], file_type=parquet, predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ], pruning_predicate=k_null_count@1 != row_count@2 AND k_max@0 >= 1 AND k_null_count@1 != row_count@2 AND k_min@3 <= 1, required_guarantees=[]                                                                                                                                                                                                                                                                                                                                                          |
|         here ->   | , metrics=[output_rows=10000000, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=12781310, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=0, page_index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, bloom_filter_eval_time=1.123µs, metadata_load_time=2.642712ms, page_index_eval_time=4.312µs, row_pushdown_eval_time=24ns, statistics_eval_time=131.323µs, time_elapsed_opening=4.6513ms, time_elapsed_processing=1.8171676s, time_elapsed_scanning_total=4.5157559s, time_elapsed_scanning_until_data=260.5798ms]        |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.549 seconds.

On main it always returns 20480 rows for t1:

+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | ProjectionExec: expr=[k@2 as k, k@0 as k, v@1 as v], metrics=[output_rows=1, elapsed_compute=6.611µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                   |   CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1, elapsed_compute=31.7µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|                   |     HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(k@0, k@0)], filter=[k@0 >= 1 AND k@0 <= 1], metrics=[output_rows=1, elapsed_compute=4.247301ms, build_input_batches=1, build_input_rows=1, input_batches=3, input_rows=20480, output_batches=3, build_mem_used=89, build_time=673.7µs, join_time=3.555ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                   |       CoalescePartitionsExec, metrics=[output_rows=1, elapsed_compute=66.4µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                   |         CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1, elapsed_compute=28.7µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                   |           FilterExec: v@1 = 1, metrics=[output_rows=1, elapsed_compute=601.011µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                   |             DataSourceExec: file_groups={12 groups: [[t2.parquet:0..2133322], [t2.parquet:2133322..4266644], [t2.parquet:4266644..6399966], [t2.parquet:6399966..8533288], [t2.parquet:8533288..10666610], ...]}, projection=[k, v], file_type=parquet, predicate=v@1 = 1, pruning_predicate=v_null_count@2 != row_count@3 AND v_min@0 <= 1 AND 1 <= v_max@1, required_guarantees=[v in (1)]                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                   | , metrics=[output_rows=20480, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=412894, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=20480, page_index_rows_pruned=1028096, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=9, bloom_filter_eval_time=91.312µs, metadata_load_time=5.554412ms, page_index_eval_time=173.912µs, row_pushdown_eval_time=24ns, statistics_eval_time=1.118912ms, time_elapsed_opening=15.7229ms, time_elapsed_processing=49.715ms, time_elapsed_scanning_total=35.0119ms, time_elapsed_scanning_until_data=31.9417ms] |
|                   |       DataSourceExec: file_groups={12 groups: [[t1.parquet:0..1066678], [t1.parquet:1066678..2133356], [t1.parquet:2133356..3200034], [t1.parquet:3200034..4266712], [t1.parquet:4266712..5333390], ...]}, projection=[k], file_type=parquet, predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ], pruning_predicate=k_null_count@1 != row_count@2 AND k_max@0 >= 1 AND k_null_count@1 != row_count@2 AND k_min@3 <= 1, required_guarantees=[]                                                                                                                                                                                                                                                                                                                                                         |
|       here ->     | , metrics=[output_rows=20480, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=206447, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=20480, page_index_rows_pruned=1028096, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=9, bloom_filter_eval_time=134.212µs, metadata_load_time=5.683512ms, page_index_eval_time=172.312µs, row_pushdown_eval_time=24ns, statistics_eval_time=1.148112ms, time_elapsed_opening=12.28ms, time_elapsed_processing=29.6494ms, time_elapsed_scanning_total=21.961ms, time_elapsed_scanning_until_data=16.4331ms]  |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

And there is the second issue, where DynamicFilterPhysicalExpr is true:

> explain analyze select *
from t1
join t2 on t1.k = t2.k
where v = 1 or v = 2;
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | ProjectionExec: expr=[k@2 as k, k@0 as k, v@1 as v], metrics=[output_rows=2, elapsed_compute=11.31µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                   |   CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=2, elapsed_compute=7.552604ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                   |     HashJoinExec: mode=Partitioned, join_type=Inner, on=[(k@0, k@0)], metrics=[output_rows=2, elapsed_compute=431.141012ms, build_input_batches=2, build_input_rows=2, input_batches=1175, input_rows=10000000, output_batches=1368, build_mem_used=908, build_time=11.9206ms, join_time=408.0587ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|                   |       CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=2, elapsed_compute=50.702µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|                   |         RepartitionExec: partitioning=Hash([k@0], 12), input_partitions=12, metrics=[fetch_time=51.8786ms, repartition_time=75.211µs, send_time=53.142µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                   |           CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=2, elapsed_compute=28.3µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|                   |             FilterExec: v@1 = 1 OR v@1 = 2, metrics=[output_rows=2, elapsed_compute=1.082711ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                   |               DataSourceExec: file_groups={12 groups: [[t2.parquet:0..2133322], [t2.parquet:2133322..4266644], [t2.parquet:4266644..6399966], [t2.parquet:6399966..8533288], [t2.parquet:8533288..10666610], ...]}, projection=[k, v], file_type=parquet, predicate=v@1 = 1 OR v@1 = 2, pruning_predicate=v_null_count@2 != row_count@3 AND v_min@0 <= 1 AND 1 <= v_max@1 OR v_null_count@2 != row_count@3 AND v_min@0 <= 2 AND 2 <= v_max@1, required_guarantees=[v in (1, 2)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                   | , metrics=[output_rows=20480, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=412894, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=20480, page_index_rows_pruned=1028096, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=9, bloom_filter_eval_time=167.712µs, metadata_load_time=7.794612ms, page_index_eval_time=234.612µs, row_pushdown_eval_time=24ns, statistics_eval_time=1.390112ms, time_elapsed_opening=15.7157ms, time_elapsed_processing=49.7691ms, time_elapsed_scanning_total=35.6062ms, time_elapsed_scanning_until_data=32.055ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                   |       CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=10000000, elapsed_compute=88.506601ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                   |         RepartitionExec: partitioning=Hash([k@0], 12), input_partitions=12, metrics=[fetch_time=1.961983s, repartition_time=1.462319702s, send_time=193.006324ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                   |           DataSourceExec: file_groups={12 groups: [[t1.parquet:0..1066678], [t1.parquet:1066678..2133356], [t1.parquet:2133356..3200034], [t1.parquet:3200034..4266712], [t1.parquet:4266712..5333390], ...]}, projection=[k], file_type=parquet, predicate=DynamicFilterPhysicalExpr [ true ], metrics=[output_rows=10000000, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=12781310, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=0, page_index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, bloom_filter_eval_time=24ns, metadata_load_time=2.152412ms, page_index_eval_time=3.512µs, row_pushdown_eval_time=24ns, statistics_eval_time=24ns, time_elapsed_opening=3.8613ms, time_elapsed_processing=1.9387351s, time_elapsed_scanning_total=3.9634373s, time_elapsed_scanning_until_data=251.6241ms]                                                                                                                                                                                                                                                                               |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.501 seconds.

adriangb · 2025-08-16T17:31:48Z

@nuno-faria thank you for your patience. I can reproduce now, I must have been on the wrong commit or something. My laptop started having issues a couple days ago so I've been resetting it, will probably need to get a new one, and must have gotten confused about what commit I was on or something.

In any case, I think I found the root cause the both issues: empty partitions (with no rows in the hash table / build side) do not report bounds -> were not being counted towards cross partition synchronization -> dynamic filters were not being built at the right time / at all. I think d2b8da5 should fix this.

Now with the following q.sql:

copy (select i as k from generate_series(1, 10000000) as t(i)) to 't1.parquet';
copy (select i as k, i as v from generate_series(1, 10000000) as t(i)) to 't2.parquet';
create external table t1 stored as parquet location 't1.parquet';
create external table t2 stored as parquet location 't2.parquet';

explain analyze select *
from t1
join t2 on t1.k = t2.k
where v = 1;

When I run cargo run -p datafusion-cli -- -f q.sql I consistently get metrics=[output_rows=20480.

And with this query:

explain analyze select *
from t1
join t2 on t1.k = t2.k
where v = 1 or v = 2;

I consistently get DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 2 ]

nuno-faria · 2025-08-18T08:17:18Z

@adriangb It appears issue 2 is fixed, it has the correct filter now.

Issue 1 still appears on my end, the number of rows collected from t1 keeps changing:

DataSourceExec t2.parquet,  output_rows=20480
DataSourceExec t1.parquet, output_rows=7902848 -- should be 20480

DataSourceExec t2.parquet,  output_rows=20480
DataSourceExec t1.parquet, output_rows=10000000 -- should be 20480

DataSourceExec t2.parquet,  output_rows=20480
DataSourceExec t1.parquet, output_rows=9437184 -- should be 20480

Note that t2 is filtered correctly and returns 20480, but t1 does not appear to be fully filtered by predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ]. This the result when I manually apply the filter:

+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1, elapsed_compute=22.8µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                   |   FilterExec: k@0 = 1, metrics=[output_rows=1, elapsed_compute=542.011µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|                   |     DataSourceExec: file_groups={12 groups: [[t1.parquet:0..1066678], [t1.parquet:1066678..2133356], [t1.parquet:2133356..3200034], [t1.parquet:3200034..4266712], [t1.parquet:4266712..5333390], ...]}, projection=[k], file_type=parquet, predicate=k@0 = 1, pruning_predicate=k_null_count@2 != row_count@3 AND k_min@0 <= 1 AND 1 <= k_max@1, required_guarantees=[k in (1)]                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   | , metrics=[output_rows=20480, elapsed_compute=12ns, batches_splitted=0, bytes_scanned=206447, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=0, num_predicate_creation_errors=0, page_index_rows_matched=20480, page_index_rows_pruned=1028096, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=9, bloom_filter_eval_time=132.912µs, metadata_load_time=4.262112ms, page_index_eval_time=188.312µs, row_pushdown_eval_time=24ns, statistics_eval_time=1.258112ms, time_elapsed_opening=10.551ms, time_elapsed_processing=27.7574ms, time_elapsed_scanning_total=18.0588ms, time_elapsed_scanning_until_data=16.2569ms] |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

adriangb · 2025-08-18T12:48:32Z

Ah I see I was looking at the wrong 20480.

I wonder if this is just a consequence of synchronization: there are no guarantees about the order in which the build side hash tables are built, and we don't push down the filter until they're all done, so e.g. one partition may stream several batches before the filter even kicks in, and that's going to be non-deterministic. I think it's a non issue in the real world because small queries are already going to be fast and large queries will stream enough batches that the filter will eventually kick in and be pushed down.

But I need to confirm this hypothesis.

kosiew

Good tackle of a tricky problem!
Left a question and a typo fix suggestion.

kosiew · 2025-08-18T14:15:02Z

datafusion/physical-plan/src/joins/hash_join.rs

+        let expected_calls = match partition_mode {
+            // Each output partition accesses shared build data
+            PartitionMode::CollectLeft => {
+                right_child.output_partitioning().partition_count()


Isn't this is fragile?

right_child.output_partitioning().partition_count() reflects the plan at planning time. If any Repartition/Coalesce/EnforceDistribution/etc. gets inserted between planning and execution (or an upstream operator behaves differently at runtime), the actual number of streams created at execute() can differ. That can break coordination that relies on a compile-time partition count (like your SharedBoundsAccumulator.total_partitions).

We could defer creation of this up to dynamic filter pushdown optimizer pass (the last one) or we could try and defer it into ExecutionPlan::execute. Does that resolve the concern?

Does 9fe1bf4 work you think?

I agree the change so that method is only called from HashJoinExec::execute, after any optimizer passes are run, should resolve the concern

datafusion/physical-plan/src/joins/hash_join.rs

adriangb · 2025-08-18T18:42:19Z

@nuno-faria I was able to confirm this is just a race between updating the filter and starting work on the probe side: https://github.com/pydantic/datafusion/compare/fix-hash-join-partitioned...pydantic:datafusion:demo-race?expand=1

On that branch I get consistently get 20480 output rows on both sides of the join.

But I don't think that is a good approach long term. The code is more complex and there are potentials for deadlocks. The way the current code is structured even if there are bugs it should never be slower than not having dynamic filters. They just may take a couple batches to kick in, they won't help small queries on a local SSD (like we've been testing here) much but will help massively for large queries on slower storage, etc.

nuno-faria · 2025-08-19T08:07:25Z

@nuno-faria I was able to confirm this is just a race between updating the filter and starting work on the probe side: https://github.com/pydantic/datafusion/compare/fix-hash-join-partitioned...pydantic:datafusion:demo-race?expand=1

I see, that makes sense. I think in theory we would need to have something like what Postgres does and determine at plan time that following the parameterized path would be the best approach, which would be quite complex.

But I don't think that is a good approach long term. The code is more complex and there are potentials for deadlocks. The way the current code is structured even if there are bugs it should never be slower than not having dynamic filters. They just may take a couple batches to kick in, they won't help small queries on a local SSD (like we've been testing here) much but will help massively for large queries on slower storage, etc.

Agreed.

I did some tests and found that the number of rows is kept to the minimuim when the number of partitions is set to 1. On a simple join query this makes it more than 20x faster than DuckDB.

adriangb · 2025-08-19T11:30:29Z

On a simple join query this makes it more than 20x faster than DuckDB

That's crazy!

I think all we need now is an approval to merge this.

@kosiew not sure if you're a commiter, otherwise @xudong963?

jonathanc-n

Sorry for the delay, this LGTM. nice catch on that race

jonathanc-n · 2025-08-19T17:59:30Z

datafusion/physical-plan/src/joins/hash_join.rs

+            PartitionMode::Auto => 1,
+        };
+        Self {
+            bounds: Mutex::new(Vec::new()),


Can preallocate bounds with_capacity()

adriangb · 2025-08-19T22:43:15Z

I plan to merge this once CI passes

kosiew

Observation:
The fix collects every partition’s local (min, max) pairs and merges them into a single “global” (min, max) per join column. In SharedBoundsAccumulator::merge_bounds, the loop walks all stored bounds and computes global_min/global_max for each column, discarding the original partition-specific ranges.
Later, create_filter_from_bounds converts each global pair into col ≥ min AND col ≤ max, then conjoins these ranges across columns.

Because the merged predicate is always a single contiguous interval, any disjoint conditions (e.g. v = 1 OR v = 10000000) get collapsed into one broad range. In that example, bounds (1, 1) and (10000000, 10000000) become (1, 10000000), yielding the filter k ≥ 1 AND k ≤ 10000000—a superset containing nearly all rows. While this fixes the earlier correctness bug (the filter won’t exclude valid results), the pushdown loses selectivity for highly selective OR predicates, making it less effective for pruning.

If we tracked multiple ranges per column lets the accumulator union all build-side information into a single expression like
k BETWEEN 1 AND 1 OR k BETWEEN 10000000 AND 10000000.
This remains a single dynamic filter; it simply captures non-contiguous values more precisely than a broad min/max interval.

Not suggesting for the above to be included in this PR.
Do you think a follow up issue is warranted to explore further refining this?

kosiew · 2025-08-20T09:50:08Z

datafusion/physical-plan/src/joins/hash_join.rs

+        let bounds = self.bounds.lock();
+        let all_bounds: Vec<_> = bounds.iter().collect();
+
+        if all_bounds.is_empty() {
+            return None;
+        }


Can we operate directly on the locked bounds and eliminate an unnecessary intermediate collection (all_bounds) to compute column-wise minima and maxima?

I've refactored this to also address #17197 (review) as well, that's a great point

kosiew · 2025-08-20T10:03:13Z

datafusion/physical-plan/src/joins/hash_join.rs

+    dynamic_filter: Arc<DynamicFilterPhysicalExpr>,
+    /// Shared bounds accumulator for coordinating dynamic filter updates across partitions
+    /// Lazily initialized at execution time to use actual runtime partition counts
+    bounds_accumulator: Arc<OnceLock<Arc<SharedBoundsAccumulator>>>,


The double Arc in Arc<OnceLock<Arc>> layering increases complexity and obscures ownership expectations

Can we replace with OnceLock<Arc> to eliminate the outer Arc wrapper?

I've refactored this away 😄

kosiew · 2025-08-20T10:13:34Z

datafusion/physical-plan/src/joins/hash_join.rs

+            if completed == total_partitions {
+                if let Some(merged_bounds) = self.bounds_accumulator.merge_bounds() {
+                    let filter_expr = self.create_filter_from_bounds(merged_bounds)?;
+                    dynamic_filter.update(filter_expr)?;
+                }
+            }


The plan can be rerun whenever the same ExecutionPlan object is executed more than once—e.g., if a caller issues multiple collect/execute invocations on a prepared physical plan or reuses the plan across query runs.

I think we should reset the bounds_accumulator to clear stale data before exiting this if block.

Maybe we could implement https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.reset_state

reset_state is already implemented for HashJoinExec and it does reset HashJoinExec::bounds_accumulator. This bit of code we are commenting on is within HashJoinStream which is created inside of HashJoinExec::execute

alamb · 2025-08-20T11:44:33Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fix-hash-join-partitioned (7d54247) to 1d4d74b diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb

Thank you @adriangb @kosiew @nuno-faria and @jonathanc-n for your work on this PR 🙏

I think @kosiew comment about resetting the state should be addressed prior to merge.

The rest of the comments I think could be done as a follow on PR (or never)

I also noticed that this module is getting huge so filed a suggestion for a follow up

#17260

Thank you all for your work to push DataFusion's joins along

alamb · 2025-08-20T11:17:53Z

datafusion/physical-plan/src/joins/hash_join.rs

+/// partition executions.
+struct SharedBoundsAccumulator {
+    /// Bounds from completed partitions.
+    bounds: Mutex<Vec<Vec<(ScalarValue, ScalarValue)>>>,


as a follow on, it would be nice to put this structure into its own struct or something -- figuring out what Vec<Vec<(ScalarValue, ScalarValue)>> represents is a bit 🤯

alamb · 2025-08-20T11:35:11Z

datafusion/physical-plan/src/joins/hash_join.rs

+        let expected_calls = match partition_mode {
+            // Each output partition accesses shared build data
+            PartitionMode::CollectLeft => {
+                right_child.output_partitioning().partition_count()


I agree the change so that method is only called from HashJoinExec::execute, after any optimizer passes are run, should resolve the concern

alamb · 2025-08-20T11:40:20Z

datafusion/physical-plan/src/joins/hash_join.rs

+            // Store bounds in the accumulator - this runs once per partition
+            if let Some(bounds) = &left_data.bounds {
+                // Only push actual bounds if they exist
+                self.bounds_accumulator.bounds.lock().push(bounds.clone());


I found the mixed use of the fields directly from self.bounds_accumulator and methods (try_merge) somewhat unclear. As a follow on it might make the code easier to understand if we moved more of the logic for the shared bounds into the BoundsAccumulator structure

Let's see how 01757fa looks

alamb · 2025-08-20T11:41:16Z

datafusion/physical-plan/src/joins/hash_join.rs

+            if completed == total_partitions {
+                if let Some(merged_bounds) = self.bounds_accumulator.merge_bounds() {
+                    let filter_expr = self.create_filter_from_bounds(merged_bounds)?;
+                    dynamic_filter.update(filter_expr)?;
+                }
+            }


Maybe we could implement https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.reset_state

alamb · 2025-08-20T11:42:12Z

datafusion/physical-plan/src/joins/hash_join.rs

+        bounds: Vec<(ScalarValue, ScalarValue)>,
+    ) -> Result<Arc<dyn PhysicalExpr>> {
+        // Create range predicates for each join key
+        let mut predicates = Vec::with_capacity(bounds.len());


this is another good candidate to move into the BoundsAccumulator

alamb · 2025-08-20T11:45:30Z

datafusion/sqllogictest/test_files/push_down_filter.slt

+
+# The failure before https://github.com/apache/datafusion/pull/17197 was non-deterministic and random
+# So we'll run the same query a couple of times just to have more certainty it's fixed
+# Sorry about the spam in this slt test...


alamb · 2025-08-20T11:50:50Z

BTW it would also be amazing if we could figure out how to test this more thoroughly (maybe unit tests?)

alamb · 2025-08-20T12:25:22Z

🤖: Benchmark completed

Details

Comparing HEAD and fix-hash-join-partitioned
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ fix-hash-join-partitioned ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1928.82 ms │                1953.44 ms │    no change │
│ QQuery 1     │   670.49 ms │                 704.53 ms │ 1.05x slower │
│ QQuery 2     │  1343.68 ms │                1329.13 ms │    no change │
│ QQuery 3     │   621.76 ms │                 614.28 ms │    no change │
│ QQuery 4     │  1264.83 ms │                1299.55 ms │    no change │
│ QQuery 5     │ 13761.54 ms │               13794.64 ms │    no change │
│ QQuery 6     │  2319.28 ms │                2307.32 ms │    no change │
│ QQuery 7     │  1816.92 ms │                1909.51 ms │ 1.05x slower │
└──────────────┴─────────────┴───────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                        │ 23727.31ms │
│ Total Time (fix-hash-join-partitioned)   │ 23912.38ms │
│ Average Time (HEAD)                      │  2965.91ms │
│ Average Time (fix-hash-join-partitioned) │  2989.05ms │
│ Queries Faster                           │          0 │
│ Queries Slower                           │          2 │
│ Queries with No Change                   │          6 │
│ Queries with Failure                     │          0 │
└──────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ fix-hash-join-partitioned ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.24 ms │                   2.22 ms │     no change │
│ QQuery 1     │    28.73 ms │                  29.31 ms │     no change │
│ QQuery 2     │    74.48 ms │                  73.69 ms │     no change │
│ QQuery 3     │    88.41 ms │                  93.67 ms │  1.06x slower │
│ QQuery 4     │   592.15 ms │                 603.53 ms │     no change │
│ QQuery 5     │   850.39 ms │                 866.50 ms │     no change │
│ QQuery 6     │     2.41 ms │                   2.22 ms │ +1.09x faster │
│ QQuery 7     │    32.25 ms │                  32.76 ms │     no change │
│ QQuery 8     │   845.96 ms │                 852.18 ms │     no change │
│ QQuery 9     │  1159.63 ms │                1152.01 ms │     no change │
│ QQuery 10    │   225.07 ms │                 225.01 ms │     no change │
│ QQuery 11    │   253.34 ms │                 253.07 ms │     no change │
│ QQuery 12    │   856.63 ms │                 866.35 ms │     no change │
│ QQuery 13    │  1196.94 ms │                1206.72 ms │     no change │
│ QQuery 14    │   798.08 ms │                 782.83 ms │     no change │
│ QQuery 15    │   776.45 ms │                 806.46 ms │     no change │
│ QQuery 16    │  1582.29 ms │                1599.05 ms │     no change │
│ QQuery 17    │  1566.25 ms │                1597.46 ms │     no change │
│ QQuery 18    │  2784.12 ms │                2793.57 ms │     no change │
│ QQuery 19    │    78.74 ms │                  80.14 ms │     no change │
│ QQuery 20    │  1168.36 ms │                1187.26 ms │     no change │
│ QQuery 21    │  1343.94 ms │                1333.46 ms │     no change │
│ QQuery 22    │  2218.21 ms │                2207.64 ms │     no change │
│ QQuery 23    │  7502.29 ms │                7519.60 ms │     no change │
│ QQuery 24    │   405.82 ms │                 415.48 ms │     no change │
│ QQuery 25    │   282.15 ms │                 279.70 ms │     no change │
│ QQuery 26    │   394.81 ms │                 399.27 ms │     no change │
│ QQuery 27    │  1582.75 ms │                1576.27 ms │     no change │
│ QQuery 28    │ 11827.77 ms │               11913.63 ms │     no change │
│ QQuery 29    │   524.03 ms │                 510.61 ms │     no change │
│ QQuery 30    │   770.10 ms │                 777.93 ms │     no change │
│ QQuery 31    │   765.31 ms │                 775.63 ms │     no change │
│ QQuery 32    │  2384.12 ms │                2372.96 ms │     no change │
│ QQuery 33    │  3128.35 ms │                3174.63 ms │     no change │
│ QQuery 34    │  3136.57 ms │                3188.06 ms │     no change │
│ QQuery 35    │  1226.86 ms │                1228.97 ms │     no change │
│ QQuery 36    │   127.28 ms │                 122.78 ms │     no change │
│ QQuery 37    │    54.33 ms │                  51.01 ms │ +1.06x faster │
│ QQuery 38    │   123.56 ms │                 120.90 ms │     no change │
│ QQuery 39    │   201.26 ms │                 194.84 ms │     no change │
│ QQuery 40    │    43.53 ms │                  43.29 ms │     no change │
│ QQuery 41    │    37.82 ms │                  39.48 ms │     no change │
│ QQuery 42    │    32.75 ms │                  32.75 ms │     no change │
└──────────────┴─────────────┴───────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                        │ 53076.55ms │
│ Total Time (fix-hash-join-partitioned)   │ 53384.93ms │
│ Average Time (HEAD)                      │  1234.34ms │
│ Average Time (fix-hash-join-partitioned) │  1241.51ms │
│ Queries Faster                           │          2 │
│ Queries Slower                           │          1 │
│ Queries with No Change                   │         40 │
│ Queries with Failure                     │          0 │
└──────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ fix-hash-join-partitioned ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │  99.42 ms │                  94.85 ms │ no change │
│ QQuery 2     │  20.80 ms │                  20.35 ms │ no change │
│ QQuery 3     │  32.74 ms │                  31.71 ms │ no change │
│ QQuery 4     │  17.92 ms │                  18.16 ms │ no change │
│ QQuery 5     │  48.78 ms │                  48.26 ms │ no change │
│ QQuery 6     │  11.77 ms │                  11.97 ms │ no change │
│ QQuery 7     │  85.12 ms │                  84.55 ms │ no change │
│ QQuery 8     │  24.28 ms │                  24.65 ms │ no change │
│ QQuery 9     │  52.09 ms │                  53.90 ms │ no change │
│ QQuery 10    │  39.79 ms │                  39.77 ms │ no change │
│ QQuery 11    │  37.90 ms │                  39.60 ms │ no change │
│ QQuery 12    │  29.79 ms │                  30.59 ms │ no change │
│ QQuery 13    │  25.39 ms │                  25.17 ms │ no change │
│ QQuery 14    │   9.61 ms │                   9.71 ms │ no change │
│ QQuery 15    │  18.62 ms │                  19.05 ms │ no change │
│ QQuery 16    │  17.40 ms │                  17.53 ms │ no change │
│ QQuery 17    │  95.99 ms │                  95.06 ms │ no change │
│ QQuery 18    │ 177.17 ms │                 183.31 ms │ no change │
│ QQuery 19    │  23.60 ms │                  23.43 ms │ no change │
│ QQuery 20    │  30.85 ms │                  31.07 ms │ no change │
│ QQuery 21    │ 139.06 ms │                 136.92 ms │ no change │
│ QQuery 22    │  13.68 ms │                  13.94 ms │ no change │
└──────────────┴───────────┴───────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                        │ 1051.76ms │
│ Total Time (fix-hash-join-partitioned)   │ 1053.56ms │
│ Average Time (HEAD)                      │   47.81ms │
│ Average Time (fix-hash-join-partitioned) │   47.89ms │
│ Queries Faster                           │         0 │
│ Queries Slower                           │         0 │
│ Queries with No Change                   │        22 │
│ Queries with Failure                     │         0 │
└──────────────────────────────────────────┴───────────┘

2010YOUY01 · 2025-08-20T14:06:53Z

datafusion/physical-plan/src/joins/hash_join.rs

+///
+/// ## Thread Safety
+///
+/// All fields use atomic operations or mutexes to ensure correct coordination between concurrent


Letting fields guarded by fine-grained mutex/atomic looks a bit dangerous, the implementation has also to ensure the correct update order (in this case first update bounds, then update completed_partitions), it's correct now but the future change might forget this very subtle restriction.
How about wrapping the whole struct with a big lock? I think it won't cause performance issue.

Good call, 360253e

adriangb · 2025-08-20T18:43:36Z

@alamb just out of curiosity - would you be able to kick off some more benches? I wonder if the change of how we structure the dynamic filter makes a difference

adriangb · 2025-08-20T20:00:56Z

Here's results from my local machine:

Comparing main and fix-hash-join-partitioned
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ fix-hash-join-partitioned ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    0.93 ms │                   0.94 ms │     no change │
│ QQuery 1     │   11.23 ms │                  10.93 ms │     no change │
│ QQuery 2     │   35.08 ms │                  34.38 ms │     no change │
│ QQuery 3     │   34.85 ms │                  34.81 ms │     no change │
│ QQuery 4     │  274.53 ms │                 270.18 ms │     no change │
│ QQuery 5     │  350.38 ms │                 349.15 ms │     no change │
│ QQuery 6     │    0.91 ms │                   0.82 ms │ +1.11x faster │
│ QQuery 7     │   13.06 ms │                  12.46 ms │     no change │
│ QQuery 8     │  335.96 ms │                 322.05 ms │     no change │
│ QQuery 9     │  477.25 ms │                 473.04 ms │     no change │
│ QQuery 10    │   85.16 ms │                  85.73 ms │     no change │
│ QQuery 11    │   95.87 ms │                  98.19 ms │     no change │
│ QQuery 12    │  324.21 ms │                 323.77 ms │     no change │
│ QQuery 13    │  468.87 ms │                 477.11 ms │     no change │
│ QQuery 14    │  306.01 ms │                 301.96 ms │     no change │
│ QQuery 15    │  307.36 ms │                 303.33 ms │     no change │
│ QQuery 16    │  735.54 ms │                 732.70 ms │     no change │
│ QQuery 17    │  726.69 ms │                 707.14 ms │     no change │
│ QQuery 18    │ 1768.84 ms │                1527.52 ms │ +1.16x faster │
│ QQuery 19    │   28.60 ms │                  26.75 ms │ +1.07x faster │
│ QQuery 20    │  735.84 ms │                 672.82 ms │ +1.09x faster │
│ QQuery 21    │  805.33 ms │                 748.53 ms │ +1.08x faster │
│ QQuery 22    │ 1265.32 ms │                1168.32 ms │ +1.08x faster │
│ QQuery 23    │ 4507.52 ms │                3785.36 ms │ +1.19x faster │
│ QQuery 24    │  134.22 ms │                  99.76 ms │ +1.35x faster │
│ QQuery 25    │  168.16 ms │                 148.06 ms │ +1.14x faster │
│ QQuery 26    │  118.92 ms │                  98.44 ms │ +1.21x faster │
│ QQuery 27    │ 1131.20 ms │                 836.35 ms │ +1.35x faster │
│ QQuery 28    │ 6888.21 ms │                6774.13 ms │     no change │
│ QQuery 29    │  357.40 ms │                 324.16 ms │ +1.10x faster │
│ QQuery 30    │  317.28 ms │                 302.51 ms │     no change │
│ QQuery 31    │  316.73 ms │                 313.78 ms │     no change │
│ QQuery 32    │ 1575.35 ms │                1695.26 ms │  1.08x slower │
│ QQuery 33    │ 1859.20 ms │                1551.85 ms │ +1.20x faster │
│ QQuery 34    │ 1839.58 ms │                1759.96 ms │     no change │
│ QQuery 35    │  498.48 ms │                 488.93 ms │     no change │
│ QQuery 36    │   62.53 ms │                  62.01 ms │     no change │
│ QQuery 37    │   24.15 ms │                  24.41 ms │     no change │
│ QQuery 38    │   61.68 ms │                  61.55 ms │     no change │
│ QQuery 39    │  103.08 ms │                 103.62 ms │     no change │
│ QQuery 40    │   18.36 ms │                  16.91 ms │ +1.09x faster │
│ QQuery 41    │   16.02 ms │                  16.60 ms │     no change │
│ QQuery 42    │   13.88 ms │                  13.36 ms │     no change │
└──────────────┴────────────┴───────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                        │ 29199.78ms │
│ Total Time (fix-hash-join-partitioned)   │ 27159.63ms │
│ Average Time (main)                      │   679.06ms │
│ Average Time (fix-hash-join-partitioned) │   631.62ms │
│ Queries Faster                           │         14 │
│ Queries Slower                           │          1 │
│ Queries with No Change                   │         28 │
│ Queries with Failure                     │          0 │
└──────────────────────────────────────────┴────────────┘
--------------------
Benchmark h2o_join.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ fix-hash-join-partitioned ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │    0.82 ms │                   0.66 ms │ +1.25x faster │
│ QQuery 2     │   26.58 ms │                  26.48 ms │     no change │
│ QQuery 3     │   25.73 ms │                  25.71 ms │     no change │
│ QQuery 4     │   21.92 ms │                  22.10 ms │     no change │
│ QQuery 5     │ 6678.28 ms │                6568.61 ms │     no change │
└──────────────┴────────────┴───────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                        │ 6753.34ms │
│ Total Time (fix-hash-join-partitioned)   │ 6643.54ms │
│ Average Time (main)                      │ 1350.67ms │
│ Average Time (fix-hash-join-partitioned) │ 1328.71ms │
│ Queries Faster                           │         1 │
│ Queries Slower                           │         0 │
│ Queries with No Change                   │         4 │
│ Queries with Failure                     │         0 │
└──────────────────────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ fix-hash-join-partitioned ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 469.22 ms │                 451.54 ms │     no change │
│ QQuery 2     │  93.19 ms │                  89.56 ms │     no change │
│ QQuery 3     │ 245.37 ms │                 245.78 ms │     no change │
│ QQuery 4     │ 202.82 ms │                 193.28 ms │     no change │
│ QQuery 5     │ 348.62 ms │                 339.03 ms │     no change │
│ QQuery 6     │ 127.61 ms │                 126.03 ms │     no change │
│ QQuery 7     │ 502.29 ms │                 483.22 ms │     no change │
│ QQuery 8     │ 417.75 ms │                 378.14 ms │ +1.10x faster │
│ QQuery 9     │ 634.01 ms │                 562.24 ms │ +1.13x faster │
│ QQuery 10    │ 331.93 ms │                 311.05 ms │ +1.07x faster │
│ QQuery 11    │  75.21 ms │                  67.87 ms │ +1.11x faster │
│ QQuery 12    │ 197.60 ms │                 184.62 ms │ +1.07x faster │
│ QQuery 13    │ 356.67 ms │                 335.05 ms │ +1.06x faster │
│ QQuery 14    │ 178.51 ms │                 167.70 ms │ +1.06x faster │
│ QQuery 15    │ 257.11 ms │                 232.88 ms │ +1.10x faster │
│ QQuery 16    │  65.33 ms │                  58.28 ms │ +1.12x faster │
│ QQuery 17    │ 650.07 ms │                 617.45 ms │ +1.05x faster │
│ QQuery 18    │ 985.80 ms │                 885.19 ms │ +1.11x faster │
│ QQuery 19    │ 284.72 ms │                 266.58 ms │ +1.07x faster │
│ QQuery 20    │ 233.45 ms │                 223.05 ms │     no change │
│ QQuery 21    │ 699.61 ms │                 652.03 ms │ +1.07x faster │
│ QQuery 22    │  83.26 ms │                  80.68 ms │     no change │
└──────────────┴───────────┴───────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                        │ 7440.16ms │
│ Total Time (fix-hash-join-partitioned)   │ 6951.27ms │
│ Average Time (main)                      │  338.19ms │
│ Average Time (fix-hash-join-partitioned) │  315.97ms │
│ Queries Faster                           │        13 │
│ Queries Slower                           │         0 │
│ Queries with No Change                   │         9 │
│ Queries with Failure                     │         0 │
└──────────────────────────────────────────┴───────────┘

alamb · 2025-08-20T20:03:45Z

🚀

adriangb · 2025-08-20T20:25:54Z

Looks like extended tests are failing: https://github.com/apache/datafusion/actions/runs/17108179889/job/48522491130

adriangb · 2025-08-20T21:01:10Z

I've put a fix up here: #17265

…apache#17197) (cherry picked from commit 64bc58d)

* Enable physical filter pushdown for hash joins (apache#16954) (cherry picked from commit b10f453) * Add ExecutionPlan::reset_state (apache#17028) * Add ExecutionPlan::reset_state Co-authored-by: Robert Ream <[email protected]> * Update datafusion/sqllogictest/test_files/cte.slt * Add reference * fmt * add to upgrade guide * add explain plan, implement in more plans * fmt * only explain --------- Co-authored-by: Robert Ream <[email protected]> * Add dynamic filter (bounds) pushdown to HashJoinExec (apache#16445) (cherry picked from commit ff77b70) * Push dynamic pushdown through CooperativeExec and ProjectionExec (apache#17238) (cherry picked from commit 4bc0696) * Fix dynamic filter pushdown in HashJoinExec (apache#17201) (cherry picked from commit 1d4d74b) * Fix HashJoinExec sideways information passing for partitioned queries (apache#17197) (cherry picked from commit 64bc58d) * disallow pushdown of volatile functions (apache#16861) * dissallow pushdown of volatile PhysicalExprs * fix * add FilteredVec helper to handle filter / remap pattern (#34) * checkpoint: Address PR feedback in https://github.com/apach... * add FilteredVec to consolidate handling of filter / remap pattern * lint * Add slt test for pushing volatile predicates down (#35) --------- Co-authored-by: Andrew Lamb <[email protected]> (cherry picked from commit 94e8548) * fix bounds accumulator reset in HashJoinExec dynamic filter pushdown (apache#17371) --------- Co-authored-by: Adrian Garcia Badaracco <[email protected]> Co-authored-by: Robert Ream <[email protected]> Co-authored-by: Jack Kleeman <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

github-actions bot added sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Aug 14, 2025

adriangb changed the title ~~disable HashJoinExec sideways information passing for partitioned queries~~ Fix HashJoinExec sideways information passing for partitioned queries Aug 14, 2025

nuno-faria reviewed Aug 15, 2025

View reviewed changes

kosiew reviewed Aug 18, 2025

View reviewed changes

adriangb requested a review from kosiew August 18, 2025 19:25

adriangb force-pushed the fix-hash-join-partitioned branch from 058953a to 279399b Compare August 19, 2025 13:07

jonathanc-n approved these changes Aug 19, 2025

View reviewed changes

kosiew reviewed Aug 20, 2025

View reviewed changes

alamb mentioned this pull request Aug 20, 2025

Split datafusion/physical-plan/src/joins/hash_join.rs into smaller modules #17260

Closed

alamb approved these changes Aug 20, 2025

View reviewed changes

2010YOUY01 reviewed Aug 20, 2025

View reviewed changes

adriangb added 16 commits August 20, 2025 13:47

Fix HashJoinExec sideways information passing for partitioned queries

a1a4eef

cleanup

51a3d31

pre-allocate

c93530b

make scope unreachable

e4baf7b

refactor

52cfc85

Rename method

212d2a6

fix locking

a00a84d

cleanup

6d25632

fixes

6e8e9f0

lint

c58cf60

revert upsie

bb48510

update snap

409540f

fmt

81b321f

add new test

879bae3

remove unused debug field

5b8c6a5

fmt

57dbec9

adriangb force-pushed the fix-hash-join-partitioned branch from 506d533 to 57dbec9 Compare August 20, 2025 18:47

adriangb merged commit 64bc58d into apache:main Aug 20, 2025
27 checks passed

adriangb deleted the fix-hash-join-partitioned branch August 20, 2025 19:21

adriangb mentioned this pull request Aug 20, 2025

Fix HashJoinExec test snapshot under force_hash_collisions=true #17265

Merged

jonathanc-n mentioned this pull request Aug 20, 2025

PROPOSAL Hash Join Spilling Proposal #17267

Open

LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 2, 2025

Fix HashJoinExec sideways information passing for partitioned queries (…

2188d26

…apache#17197) (cherry picked from commit 64bc58d)

nuno-faria mentioned this pull request Sep 6, 2025

Dynamic filters blog post (rev 2) apache/datafusion-site#103

Merged

LiaCastaneda mentioned this pull request Sep 12, 2025

Consumer receives duplicate bound predicates when join mode is CollectLeft #17541

Closed

Fix HashJoinExec sideways information passing for partitioned queries #17197

Fix HashJoinExec sideways information passing for partitioned queries #17197

Uh oh!

Conversation

adriangb commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Aug 14, 2025

Uh oh!

adriangb commented Aug 15, 2025

Uh oh!

jonathanc-n commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonathanc-n commented Aug 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuno-faria commented Aug 15, 2025

Uh oh!

adriangb commented Aug 15, 2025

Uh oh!

adriangb commented Aug 15, 2025

Uh oh!

nuno-faria commented Aug 16, 2025

Uh oh!

adriangb commented Aug 16, 2025

Uh oh!

nuno-faria commented Aug 18, 2025

Uh oh!

adriangb commented Aug 18, 2025

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adriangb commented Aug 18, 2025

Uh oh!

nuno-faria commented Aug 19, 2025

Uh oh!

adriangb commented Aug 19, 2025

Uh oh!

jonathanc-n left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Aug 19, 2025

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 20, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb commented Aug 14, 2025 •

edited

Loading

jonathanc-n commented Aug 15, 2025 •

edited

Loading

kosiew left a comment •

edited

Loading