Description
Is your feature request related to a problem or challenge?
This is based on the wonderful writeup from @2010YOUY01 in #7977
As previously discussed in #7110 #7752 there are a few challenges with how ScalarFunctions are handled, notably that there are two distinct implementations -- BuiltinScalarFunction
and ScalarUDF
Problems with BuiltinScalarFunction
- As more functions are added, the total footprint of DataFusion grows, even for those who don't need the specific functions. This also acts to limit the number of functions built into DataFusion
- The desired semantics may be different for different users(e.g. many of the built in functions in DataFusion mirror postgres behavior, but some users wish to mimic spark behavior)
- User defined functions are treated differently from built in functions in some ways (e.g. they can't have aliases)
- Adding a new built in function requires modifications in multiple places which makes the barrier overly high.Built-in functions are implemented with
Enum BuiltinScalarFunction
, and function implementations likereturn_type()
are large methods that match every enum variant.
Problems with ScalarUDF
- The current implementation of
ScalarUDF
s is a struct, and does not cover all the functionalities of existing built-in functions - Defining a new
ScalarUDF
requires constructing a struct in an imperative way providingArc
function pointers (see examples/simple_udf.rs) for each part of the UDF, which is not familiar to Rust users where it is more common to seedyn Trait
objects
Describe the solution you'd like
I propose moving DataFusion to only use ScalarUDF
s and remove BuiltInScalarFunction
. This will ensure:
- ScalarUDFs have access to all the same functionality as "built in " functions.
- No function specific code will escape the planning phase
- DataFusion's core can remain focused, and external libraries of packages can be used to customize its use.
We will keep the existing ScalarUDF
interface as much as possible, while also potentially providing an easier way to define them (ideally via a trait object).
Describe alternatives you've considered
#7977 describes introducing a new trait and unifying both ScalarUDF and BuiltInScalarFunction with this trait.
This approach also allows gradually migrating existing built-in functions to the new one, the old UDF interface create_udf()
can keep unchanged.
However, I think it is a bigger change for users, and has the danger of making the overall complexity of DataFusion worse. As demonstrated in #8046 it is also feasible to allow new ScalarUDF
s to be defined using a trait while retaining backwards compatibility for existing ScalarUDF
implementations
Additional context
Proposed implementation steps:
- Prototype ScalarUDF interface changes (make the fields non
pub
): RFC: Make fields of ScalarUDF non pub #8039 - Prototype how registering external packages would look like (by making a prototype for some BuildInFunctions): RFC: Demonstrate what a function package might look like -- encoding expressions #8046
- Propose
ScalarUDF
API changes for real: Make fields ofScalarUDF
,AggregateUDF
andWindowUDF
nonpub
#8079 - Minor: Remove redundant BuiltinScalarFunction::supports_zero_argument() #8059
- List additional feature gaps between built in functions and ScalarUDfs and close them
- Minor: Cleanup BuiltinScalarFunction's phys-expr creation #8114
- Unify
Expr::AggregateFunction
andExpr::AggregateUDF
#8346 - Rename
expr::window_function::WindowFunction
toWindowFunctionDefintion
for consistency #8347 - Implement monotonicity for ScalarUDF #8756
- Support
Expr
creation forScalarUDF
: Resolve function calls by name during planning #8157 - Add ScalarUDFs in missing function hints / suggested errors #9392
- Specialized / Pre-compiled / Prepared ScalarUDFs #8051
- Clean internal implementation of ScalarUDF to use
ScalarUDFImpl
(rather than the function pointers) #8712 - Minor: Cleanup BuiltinScalarFunction::return_type() #8088
- Implement aliasing for ScalarUDF: Implement Aliases for ScalarUDF #8348
- Implement trait based ScalarUDF: Implement trait based API for defining ScalarUDFs #8568
- [DISCUSS] organization of functions #9100
- Create a new
datafusion-function
crate with an initial set of functions as a model (see RFC: Demonstrate what a function package might look like -- encoding expressions #8046) - Create tickets for extracting the remaining lists of packages into the
datafusion_functions
crate, file tickets to track them ([Epic] Port BuiltInFunctons todatafusion-functions-*
crates #9285) - [Epic] Port BuiltInFunctons to
datafusion-functions-*
crates #9285 - Add
FunctionRegistry::register_udaf
andFunctionRegistry::register_udwf
#9074 - File follow on tickets for applying the same treatment to
AggregateUDF
[Epic] UnifyAggregateFunction
Interface (remove built in list ofAggregateFunction
s), improve the system #8708 andWindowUDF
[Epic] UnifyWindowFunction
Interface (remove built in list ofBuiltInWindowFunction
s) #8709