Closed
Description
Describe the bug
When using a pyarrow.dataset
as your source and performing a dataframe count
operation you get an error.
To Reproduce
You can point the below snippet to any parquet file.
from datafusion import SessionContext
import pyarrow.dataset as ds
ctx = SessionContext()
file_path = "/some-path/datafusion-python/examples/tpch/data/lineitem.parquet"
pyarrow_dataset = ds.dataset([file_path])
ctx.register_dataset("pyarrow_dataset", pyarrow_dataset)
df = ctx.table("pyarrow_dataset").select("l_orderkey", "l_partkey", "l_linenumber")
df.limit(3).show()
df.count()
This generates the following output. The show
is to demonstrate the file is read appropriately.
DataFrame()
+------------+-----------+--------------+
| l_orderkey | l_partkey | l_linenumber |
+------------+-----------+--------------+
| 1 | 155190 | 1 |
| 1 | 67310 | 2 |
| 1 | 63700 | 3 |
+------------+-----------+--------------+
Traceback (most recent call last):
File "/Users/tsaucer/src/personal/arrow_rs_dataset_read/count_dataset_read.py", line 16, in <module>
df.count()
File "/Users/tsaucer/src/personal/datafusion-python/python/datafusion/dataframe.py", line 507, in count
return self.df.count()
^^^^^^^^^^^^^^^
Exception: External error: Arrow error: External error: ArrowException: Invalid argument error: must either specify a row count or at least one column
Expected behavior
count()
should return the number of rows in this dataset.
Work around is to aggregate and count
from datafusion import col, functions as f
df.aggregate([], [f.count(col("l_orderkey"))]).show()
Additional context
In my investigation, I found that we register arrow datasets by creating a TableProvider
in src/dataset.rs
and then the execution calls happen in src/dataset_exec.rs
.