Closed
Description
Observed in #208 :
FAILED tests/test_common.py::test_asarray_copy[dask.array] - assert False
+ where False = <function test_asarray_copy.<locals>.<lambda> at 0x7ff78986c220>(dask.array<getitem, shape=(), dtype=float32, chunksize=(), chunktype=numpy.ndarray> == 0.0)
Note that it
- fails on python 3.11 job (https://github.com/data-apis/array-api-compat/actions/runs/12174769944/job/33957309550?pr=208) which installs dask 2024.12 (released 4 Dec 2024), and
- passes on python 3.9 which installs dask 2014.08 (https://github.com/data-apis/array-api-compat/actions/runs/12174769944/job/33957305292?pr=208)
Metadata
Metadata
Assignees
Labels
No labels
Activity
ev-br commentedon Dec 5, 2024
The failing test is
A prime suspect is dask/dask#11524.
ev-br commentedon Dec 5, 2024
This is indeed a breaking change in dask: 2024.12:
dask.from_asarray
does3588 -> if is_arraylike(x) and hasattr(x, "copy"):
3589 x = x.copy()
Here
x
is a numpy array which we constructed from anarray.array
instance at https://github.com/data-apis/array-api-compat/blame/main/array_api_compat/dask/array/_aliases.py#L142So the meaning of
array_api_compat.dask.array.asarray(..., copy=None)
changes: there is no way it could reuse the existing buffer anymore, and copy=None is equivalent to copy=True.Does this sound right @asmeurer ?
ev-br commentedon Dec 5, 2024
Also ping @lithomas1 : would be able to weigh in on the dask.array changes?
lithomas1 commentedon Dec 5, 2024
Yeah, part of the reason we wrap
asarray
is because dask's asarray doesn't have a copy keyword.(I do a manual copy in there for copy=True)
I think your analysis is right that dask is always doing a copy now.
dcherian commentedon Dec 5, 2024
cc @phofl for awareness
ev-br commentedon Dec 5, 2024
So I suppose the main question for this project is whether there's a supported way going forward to convert a numpy array (or anything that supports the buffer protocol) into a
dask.array
without copying, so that the data is shared and remains writeable.crusaderky commentedon Dec 5, 2024
I disagree.
The definition of copy=None is
"if possible" should be something left at the library developer's discretion, so a change of behaviour there should not be treated as a breaking change. I believe instead that the test is over-eager in expecting a specific outcome.
As @phofl and @fjetter correctly observe on dask/dask#11524, it is extremely unhealthy to store large arrays inside the dask graph.
asarray
andfrom_array
are meant to be used strictly for small data - e.g. kilobytes to a few megabytes - whereas to load gigabytes one must always use dask's compute-time data loading methods (from_zarr
etc.). So it makes perfect sense to disallow sharing the buffer between these small data arrays and the dask graph, as it gives no material performance benefit and is a major source of trouble, e.g. when the input array is a smaller view of something worth gigabytes or when the user accidentally (or deliberately !!!!!) modifies the input data after using it to build the graph.lithomas1 commentedon Dec 6, 2024
Hm, not sure I completely understand here why this can't be possible.
In the linked issue, there is this comment:
This suggests that dask has an internal API somewhere that can create a dask
Array
from a numpy array or something else in a zero-copy many. Maybe we can ask for that to be exposed instead?This'd be pretty useful in the case where I know I won't use the original array I create the dask array from anymore after passing it into
asarray
.crusaderky commentedon Dec 6, 2024
It's not impossible, it's just very, very unhealthy to let something at the root of your dask graph point to an array that could be updated by the user later on.
It's an antipattern to use from_array/asarray to embed large arrays in the graph to begin with, as it carries all sorts of performance issues and will very likely kill off or at least hamstring the scheduler and/or the worker processes. So the wastefulness of always deep-copying an array that must be small (<10 MB) becomes inconsequential.