Skip to content

improve structure, list user benefits more clearly, add faq #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 6, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 92 additions & 40 deletions web/pandas/pdeps/0010-required-pyarrow-dependency.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ This PDEP proposes that:
environment when pandas is imported. This will ensure that only one warning is raised and users can
easily silence it if necessary. This warning will point to the feedback issue.
- Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string`
instead of `object`. Additionally, we will infer all dtypes that are listed below as well instead of storing as objec.
instead of `object`. Additionally, we will infer all dtypes that are listed below as well instead of storing as object.

This will bring **immediate benefits to users**, as well as opening up the door for significant further
benefits in the future.

## Background

Expand All @@ -41,11 +44,23 @@ accelerate PyArrow-backed data in pandas, notibly string and datetime types.

As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as:

1. Consistent `NA` support for all data types
2. Broader support of data types such as `decimal`, `date` and nested types
1. Consistent `NA` support for all data types;
2. Broader support of data types such as `decimal`, `date` and nested types;
3. Better interoperability with other dataframe libraries based on Arrow.

## Motivation

While all the functionality described in the previous paragraph is currently optional, PyArrow has significant
integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow
interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with
the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow
ecosystem (as well as improving interoperability with it).

### Immediate User Benefit 1: pyarrow strings

Currently, when users pass string data into pandas constructors without specifying a data type, the resulting data type
is `object`. With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default
is `object`, which has horrendous memory and performance implications.
With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default
the inferred type to the more efficient pyarrow string type.

```python
Expand All @@ -59,35 +74,10 @@ Out[2]: dtype('O')
Out[2]: string[pyarrow]
```

## Motivation

While all the functionality described in the previous paragraph is currently optional, PyArrow has significant
integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow
interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with
the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow
ecosystem to pandas users.

Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy
functionality that would be better suited by PyArrow including:

- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations

- Removing redundant functionality:
- fastparquet engine in `read_parquet`
- potentially simplifying the `read_csv` logic (needs more investigation)

- NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes:
- decimal
- binary
- nested types (list or dict data)
- strings
- time
- date

Out of this group, strings offer the most advantages for users. They use significantly less memory and are faster:

**Performance:**
Dask developers investigated performance and memory of pyarrow strings [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes),
and found them to be a significant improvement over the current `object` dtype.

Little demo:
```python
import string
import random
Expand Down Expand Up @@ -125,22 +115,70 @@ In[4]: %timeit ser_string.str.startswith("a")
11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Another advantage is I/O. PyArrow engines in pandas can provide a significant speedup. Currently, the data
are cast to NumPy dtypes, which requires roundtripping when converting back to PyArrow strings explicitly, which
hinders performance.
### Immediate User Benefit 2: Nested Datatypes

Currently, if you try storing `dict`s in a pandas `Series`, you will again get the horrendeous `object` dtype:
```python
In [6]: pd.Series([{'a': 1, 'b': 2}, {'a': 2, 'b': 99}])
Out[6]:
0 {'a': 1, 'b': 2}
1 {'a': 2, 'b': 99}
dtype: object
```

If `pyarrow` were required, this could have been auto-inferred to be `pyarrow.struct`, which again
would come with memory and performance improvements.

### Immediate User Benefit 3: Interoperability

Other Arrow-backed dataframe libraries are growing in popularity. Having the same memory representation
would improve interoperability with them, as operations such as:
```python
import pandas as pd
import polars as pl

df = pd.DataFrame(
{
'a': ['one', 'two'],
'b': [{'name': 'Billy', 'age': 3}, {'name': 'Bob', 'age': 4}],
}
)
pl.from_pandas(df)
```
could be zero-copy. Users making use of multiple dataframe libraries would more easily be able to
switch between them.

### Future User Benefits:

Requiring PyArrow would simplify the related development within pandas and potentially improve NumPy
functionality that would be better suited by PyArrow including:

- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations

**Memory**
- NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes:
- decimal
- binary
- nested types (list or dict data)
- strings
- time
- date

PyArrow backed strings use significantly less memory. Dask developers investigated this [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes).
#### Developer benefits

Short summary: PyArrow strings required 1/3 of the original memory.
First, this would simplify development of pyarrow-backed datatypes, as it would avoid
optional dependency checks.

Second, it could potentially remove redundant functionality:
- fastparquet engine in `read_parquet`;
- potentially simplifying the `read_csv` logic (needs more investigation);
- factorization;
- datetime/timezone ops.

## Drawbacks

Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow
using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires around `120MB`. An increase
of installation size would have negative implication using pandas in space-constrained development or deployment environments
using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires an additional `120MB`.
An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments
such as AWS Lambda.

Additionally, if a user is installing pandas in an environment where wheels are not available through a `pip install` or `conda install`,
Expand All @@ -154,6 +192,20 @@ Lastly, pandas development and releases will need to be mindful of PyArrow's dev
supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version
before releasing a new pandas version.

## F.A.Q.

**Q: Why can't pandas just use numpy string and numpy void datatypes instead of pyarrow string and pyarrow struct?**

**A**: NumPy strings aren't yet available, whereas pyarrow strings are. NumPy void datatype would be different to pyarrow struct,
not bringing the same interoperabitlity benefit with other arrow-based dataframe libraries.

**Q: Are all pyarrow dtypes ready? Isn't it too soon to make them the default?**

**A**: They will likely be ready by 3.0 - however, we're not making them the default (yet).
For example, `pd.Series([1, 2, 3])` will continue to be auto-inferred to be
`np.int64`. We will only change the default for dtypes which currently have no `numpy`-backed equivalent and which are
stored as `object` dtype, such as strings and nested datatypes.

### PDEP-10 History

- 17 April 2023: Initial version
Expand Down