Skip to content

[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) #13525

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

I recently watched the Biting the Bullet: Rebuilding GlareDB from the Ground Up (Sean Smith) video from the CMU database series by @scsmithr. I found it very informative (thanks for the talk)

My high level summary of the talk is "we are replacing DataFusion with our own engine, rewritten from the the ground up". It sounds like GlareDB they are still using DataFusion as the new engine is not yet at feature parity

From my perspective, it seems that GlareDB's decision is a result of their judgement on the relative benefits vs estimated costs between

  1. Developing a new engine by themselves (and the associated testing, maintenance, etc)
  2. Working to integrate DataFusion / work with the community to help make it better

Of course, not every organization / project will have the same tradeoffs, but several of the challenges that @scsmithr described in the talk I think are worth discussing to make them better.

Here is my summary of those challenges:

Effort required / Regressions during upgrades

The general sentiment was that it took a long time to upgrade to DataFusion versions. This is something I have heard from others (@paveltiunov for example from Cube). We also experience this at InfluxData

Also, he mentioned they had hit issues where queries that used to work did not after upgrade.

Possible Improvements

Dependency management

Glare DB uses Lance and delta-rs, which both use DataFusion, and currently this requires the version of DataFusion used by GlareDB to match the(he specifically cited delta-io/delta-rs#2886 which just finally merged)

For what it is worth, @matthewmturner and I hit the same thing in dft (see datafusion-contrib/datafusion-dft#150)

Possible Improvements

WASM support

I don't think he said explicitly that DataFusion couldn't do WASM, but it seemed to be implied

Thanks to the work from @jonmmease @waynexia and others, DataFusion absolutely be compiled to WASM (check out this cool example from @XiangpengHao: https://parquet-viewer.haoxp.xyz/) but maybe it needs to be better documented / explained in a blog

Implicit assumptions in LogicalPlans

Another thing that was mentioned was the challenge of writing custom optimizer rules was challenging because there were implicit assumptions (e.g. that column names were unique)

Possible Improvements

SQL Features

Repeated column names came up as an example feature that was missing in DataFusion.

select * from (values (1), (2)) v(a, a)

Possible Improvements

Distributed Planning and Execution

Planning:

The talk mentioned that the DataFusion planner was linear in the way it resolved references and the GlareDB system needed to resolve references from a remote catalog. Their solution seems to have been to fork the DataFusion SQL planner and make it async

Another approach that is taken by the SessionContext::sql Is:

  1. Does an initial pass through the parse tree to find all references (non async)
  2. Then fetch all references (can be async)
  3. Then does the planning (non async) with all the relevant references

I don't think this is particularly well documented

Possible Improvements

Execution:

The talk mentioned several times that GlareDB runs distributed queries and found it challenging to use a different number of threads on different executors (it sounds like maybe they split the ExecutionPlan, which already has a target number of partitions baked in). It sounds like their solution was to write a new scheduler / execution engine that didn't have the parallelism baked in

Another potential way to achieve a different threads per execution node is to do the distribution at the LogicalPlan level and then run the physical planner on each sub part of the LogicalPlan.

Possible Improvements

  • Maybe we could document that better / show an example of how to split / run a distributed plan

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions