Description
Introduction
A weekly-ish summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please leave comments on this ticket about things that I may have missed or you think should get wider attention by the community.
Side note: I am depressed with the number of great PRs that are open, but waiting on someone to help push them along. I spent some time trying to summarize them / listing them below in hopes of getting others excited. I purposly listed them with the ones that need more help at the top (and ones I am helping at the bottom)
Ongoing Projects
There are several substantial projects in various states. It would be great to get some more community eyes on these PRs -- both to help review, as well as to help figure out which to prioritize
Google Summer Of Code (@oznur-synnada)
We are hosting a Google Summer of Code project which has brought many new people to the community
async
user defined functions (@goldmedal )
Imagine calling llm functions or network from functions
select ... from my_table where ask_gpt(city, 'this city is in asia');
-- fetch data from a remote URL
select wget(url) from (select distinct url from log);
Better user defined function interface (@Blizzara @shehabgamin @jayzhan211 )
- more infor: Deprecate and eventually remove
ScalarUDF::invoke_batch
#14652 - chore: remove deprecated variants of UDF's invoke (invoke, invoke_no_args, invoke_batch) #15123 / fix: mark ScalarUDFImpl::invoke_batch as deprecated #15049
- @Omega359 adding config information to the functions: feat: Add ConfigOptions to ScalarFunctionArgs #13527
🔥 Spark Functions (@andygrove , @shehabgamin )
A bunch of DataFusion users (Sail, @Omega359 , Comet, etc) want to have spark compatibile functions. We are working on getting the basics in place so we can collaborate / maintain such a library togeter.
- More info [DISCUSSION] Add separate crate to cover spark builtin functions #5600
- Initial PR: feat: Add
datafusion-spark
crate #15168
Hardening sorting larger-than-memory datasets (@2010YOUY01 @Kontinuation @zhuqi-lucas )
It seems like more and more people are (re) sorting large datasets (seems common for reorganization).
- epic: [EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts #15271
- PRs like feat: Add config
max_temp_directory_size
to limit max disk usage for spilling queries #14975
porting tests to use insta (@blaginin)
Imagine: update expected tests as easily as sqllogictests (just run cargo insta review
) ❤
@blaginin setup the basic infrastructure, has filed a bunch of tickets, and rallied the community which is now hard at work cranking out the code
Expression pushdown @adriangb
Some file formats / systems can efficiently push down expression evaluation to the table format (e.g. Vortex, or json). DataFusion doesn't know how to do this yet, but it will!
- More info Support default values for columns in SchemaAdapter #15220
- Support Push down expression evaluation in
TableProviders
#14993
Metadata columns (@chenkovsky )
Imagine adding synthetic columns to your data source (like row number)
- more info feat: metadata columns #14057
Better integration with distributed tracing (@geoffreyclaude)
When using DataFusion in a distributed environment passing through context down to the IO is important for performance analysis. @geoffreyclaude has a PR up to help thread this down
- More info: Support "Tracing" / Spans #9415
- feat: introduce
JoinSetTracer
trait for tracing context propagation in spawned tasks #14547
New IO interface (@Xuanwo)
@Xuanwo is thinking of an API for IO in datafusion that is not tied to object_store
.
Better Error Messages (@eliaperantoni )
Imagine error messages that showed you where in the query the problem was 🤯
Changing default mapping VARCHAR
--> Utf8View
(rather than Utf8
)
Imagine CREATE TABLE foo(x varchar)
will use Utf8View
for x.
Predicate pushdown by default (@XiangpengHao)
Long standing feature in parquet reader. This gets 10-20% performance improvement for some queries
- More info: Enable parquet filter pushdown by default #3463
- Experimental parquet decoder with first-class selection pushdown support arrow-rs#6921
- Blog: https://blog.xiangpeng.systems/posts/parquet-pushdown
Beautiful expalin plans (@irenjj)
Imagine: duckdb style explain plans:
> create table foo(x int) as values (1);
0 row(s) fetched.
Elapsed 0.013 seconds.
> explain format tree select x from foo where x > 5;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ CoalesceBatchesExec │ |
| | │ -------------------- │ |
| | │ target_batch_size: │ |
| | │ 8192 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ FilterExec │ |
| | │ -------------------- │ |
| | │ predicate: x > 5 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ bytes: 112 │ |
| | │ format: memory │ |
| | │ rows: 1 │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.004 seconds.
TPCH data generator (@clflushopt )
Imagine (with the correct column names):
-- Generate TPCH directly from datafusion-cli
select * from tpch_table('lineitem', 1)
+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+------------+------------+------------+-------------------+-----------+--------------------------------------------+-----------+
| column_1 | column_2 | column_3 | column_4 | column_5 | column_6 | column_7 | column_8 | column_9 | column_10 | column_11 | column_12 | column_13 | column_14 | column_15 | column_16 | column_17 |
+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+------------+------------+------------+-------------------+-----------+--------------------------------------------+-----------+
| 5007686 | 192006 | 4526 | 4 | 22 | 24156.0 | 0.09 | 0.01 | A | F | 1994-02-19 | 1994-03-23 | 1994-03-04 | NONE | MAIL | quickly permanent excuses according to the | NULL |
| 5007686 | 32944 | 7951 | 5 | 20 | 37538.8 | 0.04 | 0.02 | A | F | 1994-03-04 | 1994-03-27 | 1994-03-14 | NONE
...
Also, I am going to scratch an itch I have had for 10+ years and generate tpch data with ALL THE CORES really fast so I don't have to wait around anymore. FYI @lmwnshn
More info:
- Make it easier to run TPCH queries with datafusion-cli #14608
- https://github.com/clflushopt/tpchgen-rs
Looking to get more involved? Please help review code! 🎣
DataFusion has a long history of community members contributing in all aspects of the project. Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.
We have docs about reviews. TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try @
mentioning one of the committers.
Help wanted
- I would love to see the community offer additional help performance testing, triaging bugs helping to make DataFusion a more stable foundation for building systems
Please feel leave your own comments on this ticket if you are looking for help
Community
- Weekly Call
- Slack/Discord: info links
Upcoming meetups:
- Help schedule some!