apache · lidavidm · Mar 4, 2021 · Mar 29, 2021 · BryanCutler · Mar 18, 2021
diff --git a/python/docs/source/user_guide/arrow_pandas.rst b/python/docs/source/user_guide/arrow_pandas.rst
@@ -410,3 +410,12 @@ described in `SPARK-29367 <https://issues.apache.org/jira/browse/SPARK-29367>`_
 ``pandas_udf``\s or :meth:`DataFrame.toPandas` with Arrow enabled. More information about the Arrow IPC change can
 be read on the Arrow 0.15.0 release `blog <https://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140>`_.
 
+Setting Arrow ``self_destruct`` for memory savings
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since Spark 3.2, the Spark configuration ``spark.sql.execution.arrow.pyspark.selfDestruct.enabled`` can be used to enable PyArrow's ``self_destruct`` feature, which can save memory when creating a Pandas DataFrame via ``toPandas`` by freeing Arrow-allocated memory while building the Pandas DataFrame.
+This option is experimental, and some operations may fail on the resulting Pandas DataFrame due to immutable backing arrays.
+Typically, you would see the error ``ValueError: buffer source array is read-only``.
+Newer versions of Pandas may fix these errors by improving support for such cases.
+You can work around this error by copying the column(s) beforehand.
+Additionally, this conversion may be slower because it is single-threaded.
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -2035,8 +2035,8 @@ object SQLConf {
 
   val ARROW_PYSPARK_SELF_DESTRUCT_ENABLED =
     buildConf("spark.sql.execution.arrow.pyspark.selfDestruct.enabled")
-      .doc("When true, make use of Apache Arrow's self-destruct and split-blocks options " +
-        "for columnar data transfers in PySpark, when converting from Arrow to Pandas. " +
+      .doc("(Experimental) When true, make use of Apache Arrow's self-destruct and split-blocks " +
+        "options for columnar data transfers in PySpark, when converting from Arrow to Pandas. " +
         "This reduces memory usage at the cost of some CPU time. " +
         "This optimization applies to: pyspark.sql.DataFrame.toPandas " +
         "when 'spark.sql.execution.arrow.pyspark.enabled' is set.")