Skip to content

Extension Types / User Defined Types #12644

Open
1 of 2 issues completed
Open
1 of 2 issues completed
@findepi

Description

@findepi

Is your feature request related to a problem or challenge?

Currently DataFusion provides a lot of built-in types which are useful when building applications / query engines on top of DataFusion. However, even plethora of types is not enough. DataFusion doesn't have types existing in other systems, limiting DataFusion applicability as "LLVM for query engines"

For example, these types commonly found in other systems do not exist today

  • char(n)
  • varchar(n)
  • timestamp with time zone (a pair of "point in time" + "time zone" information; found in Oracle, Trino, Snowflake, etc.)
    • DataFusion currently uses Arrow DataType and the closest Arrow has is "timestamp(zone)" where each value is in same zone
  • timestamp with local time zone (point in time without zone information; found in Spark, Hive, PostgreSQL)
    • DataFusion currently uses Arrow DataType and the closest Arrow has is "timestamp(zone)" with eg UTC zone. however cast to varchar for "timestamp(UTC)" and for "timestamp with local time zone" should behave differently
  • time with time zone
  • JSON
    • DataFusion currently uses Arrow DataType and the closest Arrow has Utf8 potentially with some metadata information. Utf8 might be a perfect carrier type for JSON data, but "cast(json AS T)" and "cast(utf8 AS T)" are usually pretty different operations
  • VARIANT (Open Variant Type for semi-structured data #10987)
  • geospatial Geometry types (Spatial data support #7859)
  • HLL (hyperloglog), digests (t-digest, q-digest, other statistical digests)
  • extensions for applications building on top of DF; including user defined types (UDT) ([Proposal] Support User-Defined Types (UDT) #7923)
    • ability to provide user-defined types is even broader than ability to provide extension types ("rust-defined types")

Describe the solution you'd like

  1. Introduction of DataFusion own type system
  2. Introduction of extensions in DataFusion type system allowing applications building on DataFusion to provide more types
    • the extension types -- not unlike DataFusion built-in types -- need to use Arrow types as "carrier type" for transporting
    • the Arrow type metadata weaved into schema fields can be used to indicate use of extension types to the client, when data is returned to the user in Arrow form
    • for example, a "timestamp with time zone" type could be represented as Struct with two fields: point_in_time, time_zone
  3. Ability to dynamically find operations on types during function resolution or runtime
    • for example a CAST(array<T> AS varchar) needs to know how to do cast(T AS varchar). It cannot delegate this logic fully to Arrow, because Arrow won't have a notion of extension types.
      • eg if "timestamp with time zone" uses a Struct as a carrier type, it still needs to define its own cast(... AS varchar). It cannot use the default cast(struct AS varchar).

Describe alternatives you've considered

Everything is built-in

DataFusion could provide all types needed by applications building on top of DataFusion as built-in DataFusion types.
This would be easiest to implement, but could lead to scope-creep for the project. This could also lead to conflicts where types look the same but the desired behavior differs between applications building on top of DataFusion. For example Oracle's and Trino's "timestamp with time zone" can represent political zones while Snowflake's allows only fixed offsets.

No-op

Not providing extension types. This would limit DataFusion applicability.
DataFusion cannot be considered "LLVM for query engines" if it cannot serve as an engine, or potential engine, for existing popular query engines.

Additional context

The need to create extension types was raised in the [Proposal] Decouple logical from physical types

However introduction of DataFusion own types does not require introduction of extension types.
Extension types are complex enough (especially given their impact on functions) that they deserve their own roadmap issue.

The impact of extension types on functions, functions runtime and resolution is very clear, so this relates to Simple Functions initiative:

Having ExtensionType in arrow-rs would could the implementation simpler:

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions