Description
Is your feature request related to a problem or challenge?
Introduction
This ticket is my weekly-ish summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please leave comments on this ticket about things that I may have missed or you think should get wider attention by the community. Follow on to #13970
Reminder, find new content (and please post some!) to Concepts, Readings, Events page
Community Highlights
- The Recording and slides are available from the 2025 Jan 24 Amsterdam:
DISCUSSION: January 2025 DataFusion Meetup in Amsterdam / CIDR 2025 #12988 - We are victims of our own success. At the time of writing there are over 50 PRs in various states of review check out the list. More help reviewing the better 🙏
- @comphead is working on a new frontpage: [EPIC] Redesign DataFusion main page #14389
- Lessons from CMU, courtesy of @lmwnshn [DISCUSSION] Lowering the barrier to new users (Lessons from-799 CMU Optimizer Class) #14373
- Papers we love NYC is reading the DataFusion paper this week
- @edmondop added link to the a job board: Added job board as a separate header in the documentation #14191
- @jonbjo noted new user funner.io: doc: Add funnel.io to known users list #14316
Releases!
- DataFusion 45 Release candidate is available. I have a great feeling about this one thanks to all the help testing from @shehabgamin @kevinjqliu @Omega359 and others
- Arrow Minor release completed: Release arrow-rs / parquet minor version 54.1.0 (Jan 2025) arrow-rs-object-store#27 (among other things has even faster parquet reading)
Performance
DataFusion's core value proposition is great performance without having to re-implement it yourself
- @pmcgleenon ran the numbers, and DataFusion 44 is 🌶 on ClickBench: Update ClickBench benchmarks with DataFusion
44.0.0
#13983 (45 is even better) - @XiangpengHao has a way to make parquet reading faster. We are looking for help testing. See Experimental parquet decoder with first-class selection pushdown support arrow-rs#6921
- @UBarney made reverse faster: Faster reverse() string function for ASCII-only case #14195
- @jatin510 implemented Implemented
simplify
for thestarts_with
function to convert it into a LIKE expression. #14119 - @buraksenn @ozankabak and @berkaysynnada and make AnalysisContext aware of empty sets to represent certainly false bounds #14279 some sweet sort based optimizations
- @rluvaton made
array_agg
faster 🚀 perf(array-agg): add fast path for array agg formerge_batch
#14299 - @Rachelint improved median a lot: Improve speed of
median
by implementing specialGroupsAccumulator
#13681 🚀 - And so did @2010YOUY01 perf: Improve
median
with no grouping by 2X #14399 - @pepijnve added feat: Speed up
struct
andnamed_struct
usinginvoke_with_args
#14276
Quality
Testing
- @wiedld added Logical and Physical plan invariants: Interface for physical plan invariant checking. #13986
- @himadripal added add tests to check precision loss fix #14284
- @logan-keede fixed
--complete
: fix: run sqllogictest with complete #14254 - @buraksenn added minor: add unit tests for monotonicity.rs #14307
- @duongcongtoai test: add regression test for unnesting dictionary encoded columns #14395
Bug Fixes
DataFusion is in the "we are finding all the corner case bugs now" phase of its life and people are now bashing them down
- @xudong963 fixed several limit pushdown bugs: fix: fetch is missed in the EnsureSorting #14192
- @jatin510 Add casting of
count
toUInt64
inarray_repeat
function to ensure consistent integer type handling #14236 - @waynexia fix: add support for Decimal128 and Decimal256 types in interval arithmetic #14126
- @dhegberg fix: LogicalPlan::get_parameter_types fails to return all placeholders #14312
- @zhuqi-lucas fix: FULL OUTER JOIN and LIMIT produces wrong results #14338
- @zhuqi-lucas 🐛 🔨 fix: LimitPushdown rule uncorrect remove some GlobalLimitExec #14245
- @findepi Core: Fix incorrect searched CASE optimization #14349
- @findepi Core: Fix UNION field nullability tracking #14356
- @jkosh44 bug: Fix NULL handling in array_slice, introduce
NullHandling
enum toSignature
#14289 - @cht42 fixed Fix
null
input inmap_keys/values
#14401 - @Omega359 and I fixed a bunch of coercion stuff: Fix regression list Type Coercion List with inner type struct which has large/view types #14385, Update REGEXP_MATCH scalar function to support Utf8View #14449,
- @zhuqi-lucas (again!) fix: Limits are not applied correctly #14418
Docs
Build time
- We are starting to look more seriously at build time: Build time regression #14256 (thanks @waynexia)
Cleanups 🧹
Now that we have a large useful codebase it is also important to keep it neat and tidy so we spend a non trivial time there too.
- physical-optimizer into its own crate (finally!): thanks to @logan-keede @berkaysynnada and @buraksenn. See Move
EnforceDistribution
intodatafusion-physical-optimizer
crate #14190 move projection pushdown optimization logic to ExecutionPlan trait #14235, etc - @Chen-Yuan-Lai is on a tear using BooleanBufferBuilder instead of NullBufferBuilder: refactor: switch BooleanBufferBuilder to NullBufferBuilder in correlation function #14181 et al
- @Kimahriman fixed null handling for
array_has
: Make scalar and array handling for array_has consistent #13683 - @logan-keede has been consolidating code: consolidation of examples: date_time_functions #14240
- @logan-keede also completed [Epic] Extract catalog functionality from the core to make it more modular #10782 🚧
Features
We can have nice things! (Error messages)
-
@eliaperantoni added support for source code locations for error Add related source code locations to errors #13664 and has organized a project to add more support [EPIC] Attach
Diagnostic
to more errors #14429 -
We started publishing the
datafusion-sqllogictest
crate to help testing inicerberg-rust
: Publishing `datafusion_sqllogictest` as a crate. #14229 (thanks to @liurenjie1024 for the great idea) -
@jayzhan211 unified advanced UDF argument handling: Introduce
return_type_from_args
for ScalarFunction. #14094 -
@gatesn added support for
SUM
statistics: AddColumnStatistics::Sum
#14074 -
@erenavsarogullari added Support arrays_overlap function (alias of
array_has_any
) #14217 -
@timsaucer made FFI hopefully more usable with asycn code: FFI support for versions and alternate tokio runtimes #13937
-
@davisp made insert work in FFI: Add
TableProvider::insert_into
into FFI Bindings #14391
Coming soon: Extension Types
Misc
- @Spaarsh gave us 👾 🚢
<=>
in Support spaceship operator (<=>
) support (alias forIS NOT DISTINCT FROM
#14187
Looking to get more involved? Please help review code! 🎣
DataFusion has a long history of community members contributing in all aspects of the project. Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.
We have docs about reviews. TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try @
mentioning one of the committers.
Help wanted
- I would love to see the community offer additional help testing, triaging bugs helping to make DataFusion a more stable foundation for building systems
Please feel leave your own comments on this ticket if you are looking for help
Community
- Weekly Call
- Slack/Discord: info links
Upcoming meetups:
- Help schedule some!