mroeschke · mroeschke · Jul 6, 2023 · Jul 5, 2023 · Jul 5, 2023 · Jul 5, 2023
diff --git a/web/pandas/pdeps/0010-required-pyarrow-dependency.md b/web/pandas/pdeps/0010-required-pyarrow-dependency.md
@@ -23,7 +23,10 @@ This PDEP proposes that:
   environment when pandas is imported. This will ensure that only one warning is raised and users can
   easily silence it if necessary. This warning will point to the feedback issue.
 - Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string`
-  instead of `object`. Additionally, we will infer all dtypes that are listed below as well instead of storing as objec.
+  instead of `object`. Additionally, we will infer all dtypes that are listed below as well instead of storing as object.
+
+This will bring **immediate benefits to users**, as well as opening up the door for significant further
+benefits in the future.
 
 ## Background
 
@@ -41,11 +44,23 @@ accelerate PyArrow-backed data in pandas, notibly string and datetime types.
 
 As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as:
 
-1. Consistent `NA` support for all data types
-2. Broader support of data types such as `decimal`, `date` and nested types
+1. Consistent `NA` support for all data types;
+2. Broader support of data types such as `decimal`, `date` and nested types;
+3. Better interoperability with other dataframe libraries based on Arrow.
+
+## Motivation
+
+While all the functionality described in the previous paragraph is currently optional, PyArrow has significant
+integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow
+interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with
+the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow
+ecosystem (as well as improving interoperability with it).
+
+### Immediate User Benefit 1: pyarrow strings
 
 Currently, when users pass string data into pandas constructors without specifying a data type, the resulting data type
-is `object`. With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default
+is `object`, which has horrendous memory and performance implications.
+With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default
 the inferred type to the more efficient pyarrow string type.
 
 ```python
@@ -59,35 +74,10 @@ Out[2]: dtype('O')
 Out[2]: string[pyarrow]
 ```
 
-## Motivation
-
-While all the functionality described in the previous paragraph is currently optional, PyArrow has significant
-integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow
-interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with
-the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow
-ecosystem to pandas users.
-
-Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy
-functionality that would be better suited by PyArrow including:
-
-- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
-
-- Removing redundant functionality:
-  - fastparquet engine in `read_parquet`
-  - potentially simplifying the `read_csv` logic (needs more investigation)
-
-- NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes:
-  - decimal
-  - binary
-  - nested types (list or dict data)
-  - strings
-  - time
-  - date
-
-Out of this group, strings offer the most advantages for users. They use significantly less memory and are faster:
-
-**Performance:**
+Dask developers investigated performance and memory of pyarrow strings [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes),
+and found them to be a significant improvement over the current `object` dtype.
 
+Little demo:
 ```python
 import string
 import random
@@ -125,22 +115,70 @@ In[4]: %timeit ser_string.str.startswith("a")
 11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 ```
 
-Another advantage is I/O. PyArrow engines in pandas can provide a significant speedup. Currently, the data
-are cast to NumPy dtypes, which requires roundtripping when converting back to PyArrow strings explicitly, which
-hinders performance.
+### Immediate User Benefit 2: Nested Datatypes
+
+Currently, if you try storing `dict`s in a pandas `Series`, you will again get the horrendeous `object` dtype:
+```python
+In [6]: pd.Series([{'a': 1, 'b': 2}, {'a': 2, 'b': 99}])
+Out[6]:
+0     {'a': 1, 'b': 2}
+1    {'a': 2, 'b': 99}
+dtype: object
+```
+
+If `pyarrow` were required, this could have been auto-inferred to be `pyarrow.struct`, which again
+would come with memory and performance improvements.
+
+### Immediate User Benefit 3: Interoperability
+
+Other Arrow-backed dataframe libraries are growing in popularity. Having the same memory representation
+would improve interoperability with them, as operations such as:
+```python
+import pandas as pd
+import polars as pl
+
+df = pd.DataFrame(
+  {
+    'a': ['one', 'two'],
+    'b': [{'name': 'Billy', 'age': 3}, {'name': 'Bob', 'age': 4}],
+  }
+)
+pl.from_pandas(df)
+```
+could be zero-copy. Users making use of multiple dataframe libraries would more easily be able to
+switch between them.
+
+### Future User Benefits:
+
+Requiring PyArrow would simplify the related development within pandas and potentially improve NumPy
+functionality that would be better suited by PyArrow including:
+
+- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
 
-**Memory**
+- NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes:
+  - decimal
+  - binary
+  - nested types (list or dict data)
+  - strings
+  - time
+  - date
 
-PyArrow backed strings use significantly less memory. Dask developers investigated this [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes).
+#### Developer benefits
 
-Short summary: PyArrow strings required 1/3 of the original memory.
+First, this would simplify development of pyarrow-backed datatypes, as it would avoid
+optional dependency checks.
 
+Second, it could potentially remove redundant functionality:
+- fastparquet engine in `read_parquet`;
+- potentially simplifying the `read_csv` logic (needs more investigation);
+- factorization;
+- datetime/timezone ops.
 
 ## Drawbacks
 
 Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow
-using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires around `120MB`. An increase
-of installation size would have negative implication using pandas in space-constrained development or deployment environments
+using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires an additional `120MB`.
+An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments
 such as AWS Lambda.
 
 Additionally, if a user is installing pandas in an environment where wheels are not available through a `pip install` or `conda install`,
@@ -154,6 +192,20 @@ Lastly, pandas development and releases will need to be mindful of PyArrow's dev
 supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version
 before releasing a new pandas version.
 
+## F.A.Q.
+
+**Q: Why can't pandas just use numpy string and numpy void datatypes instead of pyarrow string and pyarrow struct?**
+
+**A**: NumPy strings aren't yet available, whereas pyarrow strings are. NumPy void datatype would be different to pyarrow struct,
+  not bringing the same interoperabitlity benefit with other arrow-based dataframe libraries.
+
+**Q: Are all pyarrow dtypes ready? Isn't it too soon to make them the default?**
+
+**A**: They will likely be ready by 3.0 - however, we're not making them the default (yet).
+  For example, `pd.Series([1, 2, 3])` will continue to be auto-inferred to be
+  `np.int64`.  We will only change the default for dtypes which currently have no `numpy`-backed equivalent and which are
+  stored as `object` dtype, such as strings and nested datatypes.
+
 ### PDEP-10 History
 
 - 17 April 2023: Initial version