March 17, 2025: This week(s) in DataFusion

## Introduction
A  weekly-ish summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please  leave comments on this ticket about things that I may have missed or you think should get wider attention by the community. 

Side note: I am depressed with the number of great PRs that are open, but waiting on someone to help push them along. I spent some time trying to summarize them / listing them below in hopes of getting others excited. I purposly listed them with the ones that need more help at the top (and ones I am helping at the bottom)

# Ongoing Projects

There are several substantial projects in various states. It would be great to get some more community eyes on these PRs -- both to help review, as well as to help figure out which to prioritize

## Google Summer Of Code (@oznur-synnada)
We are hosting a Google Summer of Code project which has brought many new people to the community

- More info https://github.com/apache/datafusion/issues/14577 

## `async` user defined functions (@goldmedal )

Imagine calling llm functions or network from functions
```sql
select ... from my_table where ask_gpt(city, 'this city is in asia');
```

```sql
-- fetch data from a remote URL
select wget(url) from (select distinct url from log);
```

- More info https://github.com/apache/datafusion/pull/14837

## Better user defined function interface (@Blizzara @shehabgamin @jayzhan211 )

- more infor: https://github.com/apache/datafusion/issues/14652
- https://github.com/apache/datafusion/pull/15123 / https://github.com/apache/datafusion/pull/15049
- @Omega359  adding config information to the functions: https://github.com/apache/datafusion/pull/13527

## 🔥 Spark Functions (@andygrove , @shehabgamin )
A bunch of DataFusion users (Sail, @Omega359 , Comet, etc) want to have spark compatibile functions. We are working on getting the basics in place so we can collaborate / maintain such a library togeter. 

- More info https://github.com/apache/datafusion/issues/5600
- Initial PR: https://github.com/apache/datafusion/pull/15168

## Hardening sorting larger-than-memory datasets  (@2010YOUY01 @Kontinuation @zhuqi-lucas )
It seems like more and more people are (re) sorting large datasets (seems common for reorganization). 

- epic: https://github.com/apache/datafusion/issues/15271
- PRs like https://github.com/apache/datafusion/pull/14975


## porting tests to use insta (@blaginin)
Imagine: update expected tests as easily as sqllogictests (just run `cargo insta review`) ❤ 
@blaginin setup the basic infrastructure, has filed a bunch of tickets, and rallied the community which is now hard at work cranking out the code

- More info: https://github.com/apache/datafusion/issues/15178

## Expression pushdown @adriangb 
Some file formats / systems can efficiently push down expression evaluation to the table format (e.g. Vortex, or json). DataFusion doesn't know how to do this yet, but it will!

- More info https://github.com/apache/datafusion/issues/15220
- https://github.com/apache/datafusion/issues/14993


# Metadata columns (@chenkovsky )
Imagine adding synthetic columns to your data source (like row number)

- more info https://github.com/apache/datafusion/pull/14057



## Better integration with distributed tracing (@geoffreyclaude)
When using DataFusion in a distributed environment passing through context down to the IO is important for performance analysis. @geoffreyclaude  has a PR up to help thread this down

- More info: https://github.com/apache/datafusion/issues/9415
- https://github.com/apache/datafusion/pull/14547 


## New IO interface (@Xuanwo)
@Xuanwo  is thinking of an API for IO in datafusion that is not tied to `object_store`. 

- More info https://github.com/apache/datafusion/issues/14854


## Better Error Messages (@eliaperantoni )
Imagine error messages that showed you where in the query the problem was 🤯  
- More info https://github.com/apache/datafusion/issues/14429

## Changing default mapping `VARCHAR` --> `Utf8View` (rather than `Utf8`)
Imagine `CREATE TABLE foo(x varchar)` will use `Utf8View` for x.

- more info: https://github.com/apache/datafusion/issues/15096#issuecomment-2727513988

## Predicate pushdown by default (@XiangpengHao)
Long standing feature in parquet reader. This gets 10-20% performance improvement for some queries

- More info: https://github.com/apache/datafusion/issues/3463
- https://github.com/apache/arrow-rs/pull/6921
- Blog: https://blog.xiangpeng.systems/posts/parquet-pushdown

## Beautiful expalin plans (@irenjj)
Imagine: duckdb style explain plans:

```sql
> create table foo(x int) as values (1);
0 row(s) fetched.
Elapsed 0.013 seconds.

> explain format tree select x from foo where x > 5;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │      predicate: x > 5     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         bytes: 112        │ |
|               | │       format: memory      │ |
|               | │          rows: 1          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.004 seconds.
```

- More info: https://github.com/apache/datafusion/issues/14914

## TPCH  data generator (@clflushopt )

Imagine (with the correct column names):
```sql
-- Generate TPCH directly from datafusion-cli
select * from tpch_table('lineitem', 1)
+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+------------+------------+------------+-------------------+-----------+--------------------------------------------+-----------+
| column_1 | column_2 | column_3 | column_4 | column_5 | column_6 | column_7 | column_8 | column_9 | column_10 | column_11  | column_12  | column_13  | column_14         | column_15 | column_16                                  | column_17 |
+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+------------+------------+------------+-------------------+-----------+--------------------------------------------+-----------+
| 5007686  | 192006   | 4526     | 4        | 22       | 24156.0  | 0.09     | 0.01     | A        | F         | 1994-02-19 | 1994-03-23 | 1994-03-04 | NONE              | MAIL      | quickly permanent excuses according to the | NULL      |
| 5007686  | 32944    | 7951     | 5        | 20       | 37538.8  | 0.04     | 0.02     | A        | F         | 1994-03-04 | 1994-03-27 | 1994-03-14 | NONE
...
```

Also, I am going to scratch an itch I have had for 10+ years and generate tpch data with *ALL THE CORES* really fast so I don't have to wait around anymore. FYI @lmwnshn

More info:
- https://github.com/apache/datafusion/issues/14608
- https://github.com/clflushopt/tpchgen-rs


# Looking to get more involved? Please help review code! 🎣

DataFusion has a long history of community members [contributing in all aspects of the project](https://datafusion.apache.org/contributor-guide/index.html).  Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge  -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.

We have [docs about reviews](https://datafusion.apache.org/contributor-guide/index.html#reviewing-pull-requests). TLDR is: look for test coverage, if the change is understandable and  well documented, and if the code can be improved.  When you think the PR looks good to merge, try `@` mentioning [one of the committers](https://projects.apache.org/committee.html?datafusion). 

## Help wanted
- I would love to see the community offer additional help performance testing, triaging bugs helping to make DataFusion a more stable foundation for building systems

Please feel leave your own comments on this ticket if you are looking for help

## Community 
* [Weekly Call](https://docs.google.com/document/d/1NBpkIAuU7O9h8Br5CbFksDhX-L9TyO9wmGLPMe0Plc8/edit#heading=h.kpjkpncdmt1g)
* Slack/Discord: [info links](https://datafusion.apache.org/contributor-guide/communication.html#slack-and-discord) 

## Upcoming meetups:
* Help schedule some!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

March 17, 2025: This week(s) in DataFusion #15269

Introduction

Ongoing Projects

Google Summer Of Code (@oznur-synnada)

`async` user defined functions (@goldmedal )

Better user defined function interface (@Blizzara @shehabgamin @jayzhan211 )

🔥 Spark Functions (@andygrove , @shehabgamin )

Hardening sorting larger-than-memory datasets (@2010YOUY01 @Kontinuation @zhuqi-lucas )

porting tests to use insta (@blaginin)

Expression pushdown @adriangb

Metadata columns (@chenkovsky )

Better integration with distributed tracing (@geoffreyclaude)

New IO interface (@Xuanwo)

Better Error Messages (@eliaperantoni )

Changing default mapping `VARCHAR` --> `Utf8View` (rather than `Utf8`)

Predicate pushdown by default (@XiangpengHao)

Beautiful expalin plans (@irenjj)

TPCH data generator (@clflushopt )

Looking to get more involved? Please help review code! 🎣

Help wanted

Community

Upcoming meetups:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

March 17, 2025: This week(s) in DataFusion #15269

Description

Introduction

Ongoing Projects

Google Summer Of Code (@oznur-synnada)

async user defined functions (@goldmedal )

Better user defined function interface (@Blizzara @shehabgamin @jayzhan211 )

🔥 Spark Functions (@andygrove , @shehabgamin )

Hardening sorting larger-than-memory datasets (@2010YOUY01 @Kontinuation @zhuqi-lucas )

porting tests to use insta (@blaginin)

Expression pushdown @adriangb

Metadata columns (@chenkovsky )

Better integration with distributed tracing (@geoffreyclaude)

New IO interface (@Xuanwo)

Better Error Messages (@eliaperantoni )

Changing default mapping VARCHAR --> Utf8View (rather than Utf8)

Predicate pushdown by default (@XiangpengHao)

Beautiful expalin plans (@irenjj)

TPCH data generator (@clflushopt )

Looking to get more involved? Please help review code! 🎣

Help wanted

Community

Upcoming meetups:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`async` user defined functions (@goldmedal )

Changing default mapping `VARCHAR` --> `Utf8View` (rather than `Utf8`)