Skip to content

Consumer receives duplicate bound predicates when join mode is CollectLeft #17541

@LiaCastaneda

Description

@LiaCastaneda

Describe the bug

I see duplicated OR clauses on the DynamicPhysicalExpr I get in the consumer

for an execution plan like this:


ProjectionExec: expr=[c0@0 as c0, c1@1 as c1, c2@2 as c2]
  CoalescePartitionsExec: fetch=5
    CoalesceBatchesExec: target_batch_size=8192, fetch=5
      HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(c0@0, c32@32)]
        CoalesceBatchesExec: target_batch_size=8192
          FilterExec: c0@0 IS NOT NULL
            DataSourceExec: partitions=1, partition_sizes=[1]
         RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1
           CooperativeExec
             DataSourceExec: partitions=1

The bounds predicates arrive as 16 identical conjuncts, 1 per (right) output partition it seems:

(
  ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
  OR ("c32" >= 'db-01' AND "c32" <= 'keb-03')
)

This is probably related to this comment. I wrote some logic in the consumer node to dedup the predicates but it seems worth handling in DataFusion.

Following the code, in CollectLeft we derive the number of output predicates from the right side’s partition count. But iiuc CollectLeft collects the left into a single partition, so every right-side partition will see the same bounds in theory?

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions