Skip to content

Error on dataframe count using arrow dataset #800

Closed
@timsaucer

Description

@timsaucer

Describe the bug
When using a pyarrow.dataset as your source and performing a dataframe count operation you get an error.

To Reproduce
You can point the below snippet to any parquet file.

from datafusion import SessionContext
import pyarrow.dataset as ds

ctx = SessionContext()
file_path = "/some-path/datafusion-python/examples/tpch/data/lineitem.parquet"
pyarrow_dataset = ds.dataset([file_path])

ctx.register_dataset("pyarrow_dataset", pyarrow_dataset)
df = ctx.table("pyarrow_dataset").select("l_orderkey", "l_partkey", "l_linenumber")

df.limit(3).show()
df.count()

This generates the following output. The show is to demonstrate the file is read appropriately.

DataFrame()
+------------+-----------+--------------+
| l_orderkey | l_partkey | l_linenumber |
+------------+-----------+--------------+
| 1          | 155190    | 1            |
| 1          | 67310     | 2            |
| 1          | 63700     | 3            |
+------------+-----------+--------------+
Traceback (most recent call last):
  File "/Users/tsaucer/src/personal/arrow_rs_dataset_read/count_dataset_read.py", line 16, in <module>
    df.count()
  File "/Users/tsaucer/src/personal/datafusion-python/python/datafusion/dataframe.py", line 507, in count
    return self.df.count()
           ^^^^^^^^^^^^^^^
Exception: External error: Arrow error: External error: ArrowException: Invalid argument error: must either specify a row count or at least one column

Expected behavior
count() should return the number of rows in this dataset.

Work around is to aggregate and count

from datafusion import col, functions as f
df.aggregate([], [f.count(col("l_orderkey"))]).show()

Additional context
In my investigation, I found that we register arrow datasets by creating a TableProvider in src/dataset.rs and then the execution calls happen in src/dataset_exec.rs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions