-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct #31738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -410,3 +410,12 @@ described in `SPARK-29367 <https://issues.apache.org/jira/browse/SPARK-29367>`_ | |
``pandas_udf``\s or :meth:`DataFrame.toPandas` with Arrow enabled. More information about the Arrow IPC change can | ||
be read on the Arrow 0.15.0 release `blog <https://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140>`_. | ||
|
||
Setting Arrow ``self_destruct`` for memory savings | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Since Spark 3.2, the Spark configuration ``spark.sql.execution.arrow.pyspark.selfDestruct.enabled`` can be used to enable PyArrow's ``self_destruct`` feature, which can save memory when creating a Pandas DataFrame via ``toPandas`` by freeing Arrow-allocated memory while building the Pandas DataFrame. | ||
This option is experimental, and some operations may fail on the resulting Pandas DataFrame due to immutable backing arrays. | ||
Typically, you would see the error ``ValueError: buffer source array is read-only``. | ||
Newer versions of Pandas may fix these errors by improving support for such cases. | ||
You can work around this error by copying the column(s) beforehand. | ||
Additionally, this conversion may be slower because it is single-threaded. | ||
HyukjinKwon marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we explicitly say which version pandas will trigger the bug ? Currently my test show that pandas version > 1.0.5 will trigger the bug. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I haven't fully explained the nature of this - it's not any single issue in Pandas, nor is it specific to any particular version. Instead, it's just that depending on how each Pandas operation was implemented underneath, it may or may not have been declared to accept an immutable backing array, independently of whether that operation could be implemented on an immutable array. So whether you see this will depend on what exactly you do with the dataframe, and there's no one version range we can list or one issue we can link to. And indeed, you could see this error see this without this Arrow option enabled; it's just much less likely, since there will be few cases that Arrow can perform a zero-copy conversion in that case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be good to say a workaround is to make a copy of the column(s) used in the operation? I suppose they could just disable the setting is most cases though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, but still worth a brief mention.