Skip to content

[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct #31738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions python/docs/source/user_guide/arrow_pandas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -410,3 +410,12 @@ described in `SPARK-29367 <https://issues.apache.org/jira/browse/SPARK-29367>`_
``pandas_udf``\s or :meth:`DataFrame.toPandas` with Arrow enabled. More information about the Arrow IPC change can
be read on the Arrow 0.15.0 release `blog <https://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140>`_.

Setting Arrow ``self_destruct`` for memory savings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since Spark 3.2, the Spark configuration ``spark.sql.execution.arrow.pyspark.selfDestruct.enabled`` can be used to enable PyArrow's ``self_destruct`` feature, which can save memory when creating a Pandas DataFrame via ``toPandas`` by freeing Arrow-allocated memory while building the Pandas DataFrame.
This option is experimental, and some operations may fail on the resulting Pandas DataFrame due to immutable backing arrays.
Typically, you would see the error ``ValueError: buffer source array is read-only``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be good to say a workaround is to make a copy of the column(s) used in the operation? I suppose they could just disable the setting is most cases though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but still worth a brief mention.

Newer versions of Pandas may fix these errors by improving support for such cases.
You can work around this error by copying the column(s) beforehand.
Additionally, this conversion may be slower because it is single-threaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we explicitly say which version pandas will trigger the bug ?

Currently my test show that pandas version > 1.0.5 will trigger the bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I haven't fully explained the nature of this - it's not any single issue in Pandas, nor is it specific to any particular version. Instead, it's just that depending on how each Pandas operation was implemented underneath, it may or may not have been declared to accept an immutable backing array, independently of whether that operation could be implemented on an immutable array. So whether you see this will depend on what exactly you do with the dataframe, and there's no one version range we can list or one issue we can link to. And indeed, you could see this error see this without this Arrow option enabled; it's just much less likely, since there will be few cases that Arrow can perform a zero-copy conversion in that case.

Original file line number Diff line number Diff line change
Expand Up @@ -2035,8 +2035,8 @@ object SQLConf {

val ARROW_PYSPARK_SELF_DESTRUCT_ENABLED =
buildConf("spark.sql.execution.arrow.pyspark.selfDestruct.enabled")
.doc("When true, make use of Apache Arrow's self-destruct and split-blocks options " +
"for columnar data transfers in PySpark, when converting from Arrow to Pandas. " +
.doc("(Experimental) When true, make use of Apache Arrow's self-destruct and split-blocks " +
"options for columnar data transfers in PySpark, when converting from Arrow to Pandas. " +
"This reduces memory usage at the cost of some CPU time. " +
"This optimization applies to: pyspark.sql.DataFrame.toPandas " +
"when 'spark.sql.execution.arrow.pyspark.enabled' is set.")
Expand Down