Closed
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
To make a more user friendly interface, Expr.cast()
should accept a python type and attempt to convert it to the appropriate pyarrow data type. This is predicated upon pull request #750 being merged.
Describe the solution you'd like
See the below example from @datapythonista
Describe alternatives you've considered
Alternative is to leave as is, which is operable.
Additional context
This example of enhancement requests include the desired use case:
import datafusion
from datafusion import col, lit, functions as f
import pyarrow
# something like this would be implemented internally, so users can call `datafusion.read_*`
def _read_parquet(*args, **kwargs):
ctx = datafusion.SessionContext()
return ctx.read_parquet(*args, **kwargs)
datafusion.read_parquet = _read_parquet # creating an alias of `read_*` functions so users don't need to know about `SessionContext` when the defaults are fine
df = (datafusion.read_parquet("buildings.parquet")
.filter( # `.filter()` accepting multiple conditions (which will be an AND) instead of having to use `&` with its operator precedence problems
col("is_offplan") == False,
col("rooms") >= 2, # `.lit(2)` not being required, and Python literals working with operators
)
.aggregate(
[col("area_name_en")],
[f.mean(col("has_parking").cast(float))], # `.cast()` accepting Python types, which would be internally converted to the PyArrow equivalent
)
.select(
col("area_name_en").alias("Area"),
col("AVG(has_parking)").alias("Percentage of buildings with parking"), # removing the default `?table?` in column names, the column name was "AVG(?table?.has_parking)"
)
)