-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Unnest Correlated Subquery #17110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Unnest Correlated Subquery #17110
Conversation
Decorrelate SubqueriesComplete decorrelation consists of three parts:
1. Rewrite Dependent JoinRewrite dependent join will rewrite plan_nodes containing subquery expressions into dependent joins, to facilitate subsequent processing of dependent joins, with the ultimate goal of eliminating dependent joins. DependentJoinRewriter structure: pub struct DependentJoinRewriter {
// ID of the currently traversed node, incremented on each f_down(), decremented on each f_up().
current_id: usize,
// Depth of the current DependentJoin,
// incremented by 1 when encountering nested DependentJoin in f_down(),
// decremented by 1 when processing DependentJoin in f_up().
subquery_depth: usize,
// current_id -> Node
// Maintains an IndexMap structure that stores all plan_nodes currently traversed.
nodes: IndexMap<usize, Node>,
// all the node ids from root to the current node
stack: Vec<usize>,
// track for each column, the nodes plan that reference to its within the tree
all_outer_ref_columns: IndexMap<Column, Vec<ColumnAccess>>,
} It performs pre-order traversal of the operator tree, then rewrites the traversed operators through f_down() and f_up(). It assigns a unique current_id to each plan_node when traversing. 1.1 f_downf_down checks whether there are Subqueries in the expressions of each plan_node, and marks plan_nodes containing Subqueries as is_dependent_join_node. 2. Decorrelate Dependent JoinDecorrelate dependent join performs decorrelation on the results generated by rewrite dependent join. Decorrelate dependent join is basically implemented according to the code in the paper, but with slight differences. The entry function is In the decorrelate() function, it performs pre-order traversal of the plan_node tree until it finds a DependentJoin: fn decorrelate(plan: &LogicalPlan, ...) -> Result<LogicalPlan> {
if let LogicalPlan::DependentJoin(djoin) = plan {
// handle dependent join
} else {
Ok(plan
.clone()
.map_children(|n| Ok(Transformed::yes(self.decorrelate(&n, true, 0)?)))?
.data)
}
} Processing DependentJoin consists of several major steps:
DependentJoinDecorrelator structure: pub struct DependentJoinDecorrelator {
// All correlated column information for the current depth.
domains: IndexSet<CorrelatedColumnInfo>,
// outer table column -> delim scan table column
correlated_map: IndexMap<Column, Column>,
// Whether the current DependentJoin is a nested DependentJoin
is_initial: bool,
// all correlated columns in current depth and downward
correlated_columns: Vec<CorrelatedColumnInfo>,
// check if we have to replace any COUNT aggregates into "CASE WHEN X IS NULL THEN 0 ELSE COUNT END"
// store a mapping between a expr and its original index in the logplan output
replacement_map: IndexMap<String, Expr>,
// if during the top down traversal, we observe any operator that requires
// joining all rows from the lhs with nullable rows on the rhs
any_join: bool,
delim_scan_id: usize,
// All columns of the previously constructed delim scan.
dscan_cols: Vec<Column>,
} 2.1. Handle Left ChildFirst handle the left child, divided into two cases:
Then based on the mapping relationship in self.correlated_map, rewrite outer table columns to delim scan table columns. 2.1.1. Non-Nesting DependentJoinFor non-nested DependentJoin, it recursively calls decorrelate(left), with the purpose of inheriting the information of the current DependentJoinDecorrelator. Since it's a non-nesting DependentJoin, no changes need to be made to the correlated column information. 2.1.2. Nesting DependentJoinFor Nesting DependentJoin, it determines whether the LHS has correlated columns at the current depth (corresponding to accessing in the paper). The logic for determining the existence of correlated columns is in detect_correlated_expressions, which traverses the plan, finds all outer ref expressions for each plan_node, and checks whether they are in the domains of the current DecorrelateDependentJoin. The domains record all correlated columns of the current depth.
2.2. Handle Right ChildProcessing the RHS constructs a new DependentJoinDecorrelator. The domains of the new DependentJoinDecorrelator are parent correlated columns of current level + dependent join of current level. By inheriting the correlated columns of the parent node, it handles multi-level dependent joins. Since the RHS of DependentJoin must have correlated columns, the right child is directly processed through push_down_dependent_join. 2.3. Join ConditionIn delim_join_conditions, different join types are constructed based on the subquery type in DependentJoin, and then Join Conditions of correlated column IsNotDistinctFrom delim scan column are constructed. 2.4 Unnest FunctionAll Unnest logic is implemented in push_down_dependent_join, with different processing for each type of plan_node. 3. Eliminate Duplicated DelimGetThe above two steps completed the decorrelation operation and introduced two new logical operators Construct a new use case to illustrate how the work at this stage is carried out: SELECT s.name
FROM students s
WHERE s.grade > (
-- subquery 1
SELECT AVG(e1.score)
FROM exams e1
WHERE e1.sid = s.id
AND e1.score > (
-- subquery 2
SELECT MIN(e2.score)
FROM exams e2
WHERE e2.sid = s.id
AND e2.type = 'final'
)
); After decorrelating the above SQL, the result is generated:
First, all DelimJoins and DelimGets under DelimJoins are collected as candidates. For the joins array under each candidate, they are sorted by depth from largest to smallest, finding the DelimGet with the deepest depth.
Process the deduplication of DelimGet under each candidate separately. Taking candidate1 as an example:
The situation is divided into two categories:
|
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?