[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB)

### Is your feature request related to a problem or challenge?

I recently watched the [Biting the Bullet: Rebuilding GlareDB from the Ground Up (Sean Smith)](https://www.youtube.com/watch?v=Sor3KZpmbHg&list=PLSE8ODhjZXjZc2AdXq_Lc1JS62R48UX2L&index=9) video from the CMU database series by @scsmithr. I found it very informative (thanks for the talk)

My high level summary of the talk is "we are replacing DataFusion with our own engine, rewritten from the the ground up". It sounds like GlareDB they are still using DataFusion as the new engine is not yet at feature parity

From my perspective, it seems that GlareDB's decision is a result of their judgement on the relative benefits vs estimated costs between
1. Developing a new engine by themselves (and the associated testing, maintenance, etc)
2. Working to integrate DataFusion / work with the community to help make it better

Of course, not every organization / project will have the same tradeoffs, but several of the challenges that @scsmithr described in the talk I think are worth discussing to make them better.

Here is my summary of those challenges:

# Effort required / Regressions during upgrades 
The general sentiment was that it took a long time to upgrade to DataFusion versions. This is something I have heard from others (@paveltiunov  for example from Cube). We also experience this at InfluxData

Also, he mentioned they had hit issues where queries that used to work did not after upgrade.

**Possible Improvements**
- [ ] More testing via additional test coverage (it would be great to see some explicit investment in this area)
- [x] https://github.com/apache/datafusion/issues/13470
- [ ] Discussion on being more deliberate with API changes / slowing down breaking
- [ ] Discuss / document ways to make it easier for users (e.g. @jacksonrnewhouse mentioned keeping to the higher levels made upgrades a lot easier)

# Dependency management

Glare DB uses Lance and [delta-rs](https://github.com/delta-io/delta-rs), which both use DataFusion, and currently this requires the version of DataFusion used by GlareDB to match the(he specifically cited https://github.com/delta-io/delta-rs/pull/2886 which *just* finally merged)

For what it is worth, @matthewmturner  and I hit the same thing in dft (see https://github.com/datafusion-contrib/datafusion-dft/pull/150)

**Possible Improvements**
* I think the FFI work that @timsaucer  is working on (e.g. https://github.com/apache/datafusion/pull/12920 / https://github.com/delta-io/delta-rs/pull/3012) / some examples of how it can be used to integrate delta and iceberg and lance built with different DataFusion versions would be super helpful.

# WASM support
I don't think he said explicitly that DataFusion couldn't do WASM, but it seemed to be implied

Thanks to the work from @jonmmease @waynexia and others, DataFusion absolutely be compiled to WASM (check out this cool example from @XiangpengHao: https://parquet-viewer.haoxp.xyz/) but maybe it needs to be better documented / explained in a blog

- [ ] https://github.com/apache/datafusion/issues/13715

# Implicit assumptions in LogicalPlans
Another thing that was mentioned was the challenge of writing custom optimizer rules was challenging because there were implicit assumptions (e.g. that column names were unique)

**Possible Improvements**
- [ ] Better document what those assumptions are
- [ ] For example, https://github.com/apache/datafusion/issues/12736

# SQL Features
Repeated column names came up as an example feature that was missing in DataFusion.

```sql
select * from (values (1), (2)) v(a, a)
```


**Possible Improvements**
- [ ] I think @findepi is also thinking about this as part of  https://github.com/apache/datafusion/issues/12723 

# Distributed Planning and Execution

## Planning:
The talk mentioned that the DataFusion planner was linear in the way it resolved references and the GlareDB system needed to resolve references from a remote catalog. Their solution seems to have been to fork the DataFusion SQL planner and make it `async`

Another approach that is taken by the [`SessionContext::sql`](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.sql) Is:
1. Does  an initial pass through the parse tree to find all references (non async)
2. Then fetch all references (can be `async`)
3. Then does the planning (non async) with all the relevant references

I don't think this is particularly well documented

**Possible Improvements**
- [x] https://github.com/apache/datafusion/issues/13714


## Execution:
The talk mentioned several times that GlareDB runs distributed queries and found it challenging to use a different number of threads on different executors (it sounds like maybe they split the `ExecutionPlan`, which already has a target number of partitions baked in). It sounds like their solution was to write a new scheduler  / execution engine that didn't have the parallelism baked in

 Another potential way to achieve a different threads per execution node  is to do the distribution at the `LogicalPlan` level and then run the physical planner on each sub part of the LogicalPlan. 

**Possible Improvements**
- [ ] Maybe we could document that better / show an example of how to split / run a distributed plan

### Describe the solution you'd like

_No response_

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) #13525

Is your feature request related to a problem or challenge?

Effort required / Regressions during upgrades

Dependency management

WASM support

Implicit assumptions in LogicalPlans

SQL Features

Distributed Planning and Execution

Planning:

Execution:

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) #13525

Description

Is your feature request related to a problem or challenge?

Effort required / Regressions during upgrades

Dependency management

WASM support

Implicit assumptions in LogicalPlans

SQL Features

Distributed Planning and Execution

Planning:

Execution:

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions