SQL + Expressions = Best friends forever. #4360

gianm · 2017-06-02T20:20:10Z

This is a follow-up to #4207, but taking a different approach. The line count is higher but that's mostly because this patch also adds important functionality (in particular, post-aggregation projections, and a lot of new functions) that the other patch didn't have.

The motivation is to make more kinds of queries possible. With the changes in this patch, these are all possible in both native queries and Druid SQL queries:

Grouping by functions of multiple columns (e.g. group by concat(lastName, ", ", firstName))
Filtering on functions of multiple columns (e.g. filter where x > y)
Post-aggregation projections on dimensions (e.g. group by mydim but include strlen(mydim) as a post-aggregation)
Ordering by post-aggregation projections (e.g. group by mydim but order by lower(mydim))

Main changes:

Use expressions as a projection layer for anything that can't be
expressed using traditional Druid extractionFns. Sometimes they're
embedded directly (like "expression" filters, builtin aggregators,
or "expression" post-aggregators). Sometimes they're referenced
through virtual columns (like dimensionSpecs, which can't innately
reference functions of more than one column without the virtual
column layer).
Add many new functions and operators, taking advantage of the
expression capability (see the querying/sql.md doc).
Improve consistency of constant reduction and of casting by
using Druid expressions for this instead of Calcite's RexExecutor.

fjy · 2017-06-02T20:39:43Z

👍 from my side for design, i didn't look at implementation but trust that it works

gianm · 2017-06-02T20:58:30Z

Note, this could be broken up into different patches, but I had written them together and it would take extra effort to pull them apart since they are somewhat inter-dependent and have a shared motivation. I hope reviewing them together isn't too bad, but if that's not possible, a break-up could be one for each of:

adjusting expression null handling
adding the ExprMacroTable concept
adding string return for "expression" post-aggregator and virtual column
adding the "expression" filter
adding the new expression functions and macros
each of the two query changes
all the SQL changes together

If people feel that is necessary then I'd at least like to have the overall design reviewed first before breaking up the code.

leventov · 2017-06-02T22:08:18Z

I vote for breaking this PR into multiple smaller PRs, because I'm not going to review this PR as a whole, but I would review some of the smaller parts.

The overall design looks good to me.

jihoonson · 2017-06-03T00:52:54Z

I feel this design is much better than #4207. And I also vote for breaking this PR into smaller ones. It is easy to miss some details when reviewing a large patch as a whole.

b-slim · 2017-06-05T14:23:03Z

Great proposal and +1 on splitting this !

gianm · 2017-06-05T18:55:51Z

Thanks @fjy @leventov @jihoonson @b-slim for reviewing the overall proposal. I will split it into smaller PRs.

gianm · 2017-06-05T23:01:45Z

Broke out one PR into #4365.

gianm · 2017-06-06T00:48:19Z

Broke out #4366 and #4367 too.

gianm · 2017-06-14T22:10:18Z

Broke out #4405.

gianm · 2017-06-15T19:39:22Z

Broke out #4406.

gianm · 2017-06-22T03:31:48Z

Broke out #4442.

gianm · 2017-06-28T17:25:48Z

@fjy @jihoonson @leventov @b-slim this PR is finally down to one commit that only touches the SQL module. I removed the WIP label since it's ready for review again.

gianm · 2017-06-29T21:19:06Z

Resolved conflicts by rebase since it looks like nobody has started reviewing yet.

- Use expressions as a projection layer for anything that can't be expressed using traditional Druid extractionFns. Sometimes they're embedded directly (like "expression" filters, builtin aggregators, or "expression" post-aggregators). Sometimes they're referenced through virtual columns (like dimensionSpecs, which can't innately reference functions of more than one column without the virtual column layer). - Add many new functions and operators, taking advantage of the expression capability (see the querying/sql.md doc). - Improve consistency of constant reduction and of casting by using Druid expressions for this instead of Calcite's RexExecutor.

jihoonson · 2017-06-30T00:36:12Z

@gianm thanks. I'll review today.

jihoonson

@gianm great work! I have two more comments.

I tested a simple query like select sum(cast(l_linenumber, integer)) from druid.lineitem but the result was 0 even though select cast(l_linenumber, integer) from druid.lineitem returned a valid result. l_linenumber is defined as a string dimension. What is more weird is that I cannot reproduce this in unit tests. Do you have any idea?
All works around this patch is to use Druid's expression as an internal projection layer for sql. I guess this is to avoid adding huge amount of codes to cover SQL's flexible expressiveness. However, this means, sometimes a sql query is parsed into Calcite's AST, and then some parts of the AST are converted to druid expression strings again. This will cause additional parsing overhead, and if the query size is exceptionally large like a few KB to MB, this can be significant. There will be some applications which uses large queries if we support join. What do you think about this?

jihoonson · 2017-06-30T06:15:23Z

docs/content/querying/sql.md

+grouping expressions or aggregated values. It can only be used together with GROUP BY.
+
+The ORDER BY clause refers to columns that are present after execution of GROUP BY. It can be used to order the results
+based on either grouping expressions or aggregated values. It can only be used together with GROUP BY.


Interesting. Do we have to loose this restriction in the future?

I'm not aware of any plans to support ORDER BY for non-aggregation queries, but it could be done in principle.

jihoonson · 2017-06-30T07:42:28Z

docs/content/querying/sql.md

+and both will evaluate to true if `col` contains an empty string. Similarly, the expression `COALESCE(col1, col2)` will
+return `col2` if `col1` is an empty string. While the `COUNT(*)` aggregator counts all rows, the `COUNT(expr)`
+aggregator will count the number of rows where expr is neither null nor the empty string. String columns in Druid are
+NULLable. Numeric columns are NOT NULL; if you query a numeric column that is not present in all segments of your Druid


Numeric columns are NOT NULL;

I'm not sure what this means. Does this mean that null numeric values are internally represented by 0 instead of null?

It means that numeric columns in a Druid table are reported as BIGINT NOT NULL or FLOAT NOT NULL. Of course, some segments might not have the column, and for those segments it's treated as if the column was present and all zeroes.

Hmm. It's still not clear to me. Numeric columns are NOT NULL because null numeric values are casted to 0? FLOAT NOT NULL sounds like that it works like the NOT NULL constraint.

I guess it's a matter of how you want to think about it.

What is really happening in the Druid runtime is that for columns that the SQL layer believes are numeric, a LongColumnSelector or FloatColumnSelector will get created. If the column doesn't actually exist in a particular segment, the StorageAdapter will generate a selector that returns all zeroes.

Maybe you could think of that as "casting null to zero". I'm not sure if that's the right way to think about it or not.

Does that make sense?

Hmm yeah. I think nulls in sql mean missing values, so casting nulls to zero makes sense.

jihoonson · 2017-06-30T07:47:29Z

docs/content/querying/sql.md

+The following table describes how SQL types map onto Druid types during query runtime. Casts between two SQL types
+that have the same Druid runtime type will have no effect, other than exceptions noted in the table. Casts between two
+SQL types that have different Druid runtime types will generate a runtime cast in Druid. If a value cannot be properly
+cast to another value, as in `CAST('foo' AS BIGINT)`, the runtime will substitute a default value.


Is this default value used for null values?

Yes. I added a sentence to clarify this:

NULL values cast to non-nullable types will also be substitued with a default value (for example, nulls cast to numbers will be converted to zeroes).

jihoonson · 2017-06-30T09:50:59Z

sql/src/main/java/io/druid/sql/calcite/aggregation/Aggregation.java

      }
    }
+
+    // Verify that all names are properly namespaced.


Would you elaborate more on namespace? Is it related to the prefix check below?

Yes, it is. I expanded the comment a bit.

jihoonson · 2017-06-30T10:16:22Z

sql/src/main/java/io/druid/sql/calcite/expression/CeilOperatorConversion.java

+        return null;
+      }
+
+      return DruidExpression.fromFunctionCall(


Why is the simpleExtraction null here unlike in TimeFloorOperatorConversion.applyTimestampFloor()?

Since we have an extractionFn that can do floor, but not ceil. I added a comment about this.

jihoonson · 2017-06-30T10:22:34Z

sql/src/main/java/io/druid/sql/calcite/expression/TimeFloorOperatorConversion.java

+import java.util.List;
+import java.util.stream.Collectors;
+
+public class TimeFloorOperatorConversion implements SqlOperatorConversion


Why are there two operatorConversions for the floor operator unlike ceil? Is it necessary to add another operatorConversion to handle dynamic granularity for ceil operator?

It's a special case because for time_floor, we can use an extractionFn if the granularity is known up front (i.e. if it's a literal). There is no such extractionFn for time_ceil. I added a comment about this.

jihoonson · 2017-06-30T11:05:38Z

sql/src/main/java/io/druid/sql/calcite/expression/Expressions.java

    }

-    MATH_TYPES = builder.build();
+    builder.put(SqlTypeName.BOOLEAN, ExprType.LONG);


I guess this is because we currently don't support three-valued logic and it will make the future conversion easier. Am I right? If so, please add some comments for this.

Yeah, booleans are treated as two valued in Druid expressions. I'll add a comment.

jihoonson · 2017-06-30T11:11:20Z

sql/src/main/java/io/druid/sql/calcite/planner/DruidConformance.java

+  }
+
+  @Override
+  public boolean isSortByOrdinal()


Is this described somewhere in documents? If not, please add it.

It's described in Calcite's documentation. I'll add a comment anyway.

Ah, I mean the document not java doc. Maybe sql.md?

Oh, I see. Sure, I'll add that.

jihoonson · 2017-06-30T11:11:33Z

sql/src/main/java/io/druid/sql/calcite/planner/DruidConformance.java

+  }
+
+  @Override
+  public boolean isSortByAlias()


Is this described somewhere in documents? If not, please add it.

It's described in Calcite's documentation. I'll add a comment anyway.

jihoonson · 2017-06-30T11:12:29Z

sql/src/main/java/io/druid/sql/calcite/planner/DruidRexExecutor.java

+import java.math.BigDecimal;
+import java.util.List;
+
+public class DruidRexExecutor implements RexExecutor


Would you add some description?

Sure, I added a javadoc.

gianm · 2017-06-30T22:14:45Z

@jihoonson, thanks for your review! About your top level comments:

I tested a simple query like select sum(cast(l_linenumber, integer)) from druid.lineitem but the result was 0 even though select cast(l_linenumber, integer) from druid.lineitem returned a valid result. l_linenumber is defined as a string dimension. What is more weird is that I cannot reproduce this in unit tests. Do you have any idea?

Good catch… I'll add a test case for this. My guess is it's because the SQL layer has some code to detect when a cast is "unnecessary" to pass down to Druid, since it will be coerced at runtime anyway. In that case the Druid query uses a direct field access, which is often faster. I believe this is actually not ok when casting string to number for an aggregator, and that's where the problem comes from.

All works around this patch is to use Druid's expression as an internal projection layer for sql. I guess this is to avoid adding huge amount of codes to cover SQL's flexible expressiveness. However, this means, sometimes a sql query is parsed into Calcite's AST, and then some parts of the AST are converted to druid expression strings again. This will cause additional parsing overhead, and if the query size is exceptionally large like a few KB to MB, this can be significant. There will be some applications which uses large queries if we support join. What do you think about this?

I think it's probably ok, since if a query is very large due to lots of joins, then I bet none of that will make it into Druid expressions anyway. If it's very large due to lots of complex expressions, then yes that will have to be parsed twice, but it should be dwarfed by execution overhead. (complex expressions are probably not cheap to execute at runtime)

gianm · 2017-07-05T17:12:32Z

@jihoonson, I pushed an updated patch, including a fix and test case for the problem with casting that you noticed.

jihoonson · 2017-07-06T01:12:32Z

@gianm thanks for the update. Changes look good, but would you check the inspection failure?

gianm · 2017-07-06T02:13:58Z

They seem spurious, as they're in sections of the code that I haven't changed. I tried restarting the teamcity build.

jihoonson · 2017-07-06T03:37:27Z

@gianm yeah, it's fine now. The latest patch looks good to me.

gianm · 2017-07-06T03:42:01Z

Thanks for the review @jihoonson!

@fjy @leventov @b-slim, any further comments?

leventov · 2017-07-06T04:08:44Z

I'm not planning to review

gianm added Design Review Feature labels Jun 2, 2017

gianm added this to the 0.10.2 milestone Jun 2, 2017

gianm force-pushed the sql-expression branch 2 times, most recently from 4afdb10 to 005accc Compare June 2, 2017 20:36

gianm force-pushed the sql-expression branch from 005accc to cbf15d1 Compare June 2, 2017 20:44

gianm mentioned this pull request Jun 2, 2017

SQL Having clause isn't working with field created via division #4264

Closed

gianm added the WIP label Jun 5, 2017

gianm mentioned this pull request Jun 5, 2017

Expressions: Add ExprMacros. #4365

Merged

gianm force-pushed the sql-expression branch from f9f9a19 to a512ae1 Compare June 6, 2017 00:37

gianm force-pushed the sql-expression branch 7 times, most recently from d2d0d48 to ea7566f Compare June 9, 2017 23:25

gianm mentioned this pull request Jun 14, 2017

Add ExpressionFilter. #4405

Merged

gianm force-pushed the sql-expression branch from ea7566f to 685630b Compare June 14, 2017 22:09

gianm force-pushed the sql-expression branch from 97cd335 to ec284cb Compare June 21, 2017 15:58

gianm mentioned this pull request Jun 22, 2017

Add some new expression functions and macros. #4442

Merged

gianm force-pushed the sql-expression branch from e74e570 to 9c6bcbe Compare June 22, 2017 03:31

leventov modified the milestones: 0.11.0, 0.10.2 Jun 26, 2017

gianm force-pushed the sql-expression branch from 9c6bcbe to 01de3e6 Compare June 28, 2017 17:22

gianm removed the WIP label Jun 28, 2017

gianm force-pushed the sql-expression branch from 01de3e6 to 528be2f Compare June 29, 2017 21:19

gianm force-pushed the sql-expression branch from 528be2f to 7ffbbc0 Compare June 29, 2017 21:38

jihoonson reviewed Jun 30, 2017

View reviewed changes

Merge branch 'master' into sql-expression

befd35d

gianm added 2 commits July 5, 2017 10:04

Fix casting bug, and other code review comments.

8244cc5

Merge branch 'master' into sql-expression

a8eb31c

Fix docs.

00dcc49

fjy merged commit 16817e4 into apache:master Jul 7, 2017

gianm deleted the sql-expression branch July 7, 2017 18:13

jon-wei mentioned this pull request Sep 28, 2017

Druid 0.11.0 release notes #4876

Closed

gianm mentioned this pull request Oct 11, 2017

SQL: Fix CASE-filtered aggregations with GROUP BY. #4943

Merged

julianhyde mentioned this pull request Jun 24, 2019

Exact distinct-COUNT with complex expression (CASE, IN) throws NullPointerException #7953

Open

SQL + Expressions = Best friends forever. #4360

SQL + Expressions = Best friends forever. #4360

Uh oh!

Conversation

gianm commented Jun 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjy commented Jun 2, 2017

Uh oh!

gianm commented Jun 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leventov commented Jun 2, 2017

Uh oh!

jihoonson commented Jun 3, 2017

Uh oh!

b-slim commented Jun 5, 2017

Uh oh!

gianm commented Jun 5, 2017

Uh oh!

gianm commented Jun 5, 2017

Uh oh!

gianm commented Jun 6, 2017

Uh oh!

gianm commented Jun 14, 2017

Uh oh!

gianm commented Jun 15, 2017

Uh oh!

gianm commented Jun 22, 2017

Uh oh!

gianm commented Jun 28, 2017

Uh oh!

gianm commented Jun 29, 2017

Uh oh!

jihoonson commented Jun 30, 2017

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Jun 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Jun 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gianm commented Jun 2, 2017 •

edited

Loading

gianm commented Jun 2, 2017 •

edited

Loading

gianm Jun 30, 2017 •

edited

Loading

gianm Jun 30, 2017 •

edited

Loading