Skip to content

test_asarray_copy[dask.array] fails on CI with dask==2024.12 #209

Closed
@ev-br

Description

@ev-br

Observed in #208 :

FAILED tests/test_common.py::test_asarray_copy[dask.array] - assert False
 +  where False = <function test_asarray_copy.<locals>.<lambda> at 0x7ff78986c220>(dask.array<getitem, shape=(), dtype=float32, chunksize=(), chunktype=numpy.ndarray> == 0.0)

Note that it

Activity

ev-br

ev-br commented on Dec 5, 2024

@ev-br
MemberAuthor

The failing test is

       # Use the standard library array to test the buffer protocol

       # ... snip ...
    
        a = array.array('f', [1.0])
        b = asarray(a, copy=None)
        assert is_lib_func(b)
        a[0] = 0.0
        if library == 'cupy':
            # A copy is required for libraries where the default device is not CPU
            assert all(b[0] == 1.0)
        else:
           assert all(b[0] == 0.0)      # <<<<< here

A prime suspect is dask/dask#11524.

ev-br

ev-br commented on Dec 5, 2024

@ev-br
MemberAuthor

This is indeed a breaking change in dask: 2024.12: dask.from_asarray does

3588 -> if is_arraylike(x) and hasattr(x, "copy"):
3589 x = x.copy()

Here x is a numpy array which we constructed from an array.array instance at https://github.com/data-apis/array-api-compat/blame/main/array_api_compat/dask/array/_aliases.py#L142

So the meaning of array_api_compat.dask.array.asarray(..., copy=None) changes: there is no way it could reuse the existing buffer anymore, and copy=None is equivalent to copy=True.

Does this sound right @asmeurer ?

ev-br

ev-br commented on Dec 5, 2024

@ev-br
MemberAuthor

Also ping @lithomas1 : would be able to weigh in on the dask.array changes?

lithomas1

lithomas1 commented on Dec 5, 2024

@lithomas1
Contributor

Yeah, part of the reason we wrap asarray is because dask's asarray doesn't have a copy keyword.
(I do a manual copy in there for copy=True)

I think your analysis is right that dask is always doing a copy now.

dcherian

dcherian commented on Dec 5, 2024

@dcherian

cc @phofl for awareness

ev-br

ev-br commented on Dec 5, 2024

@ev-br
MemberAuthor

So I suppose the main question for this project is whether there's a supported way going forward to convert a numpy array (or anything that supports the buffer protocol) into a dask.array without copying, so that the data is shared and remains writeable.

crusaderky

crusaderky commented on Dec 5, 2024

@crusaderky
Contributor

This is indeed a breaking change in dask

I disagree.
The definition of copy=None is

If None, the function must reuse existing memory buffer if possible and copy otherwise.

"if possible" should be something left at the library developer's discretion, so a change of behaviour there should not be treated as a breaking change. I believe instead that the test is over-eager in expecting a specific outcome.

As @phofl and @fjetter correctly observe on dask/dask#11524, it is extremely unhealthy to store large arrays inside the dask graph. asarray and from_array are meant to be used strictly for small data - e.g. kilobytes to a few megabytes - whereas to load gigabytes one must always use dask's compute-time data loading methods (from_zarr etc.). So it makes perfect sense to disallow sharing the buffer between these small data arrays and the dask graph, as it gives no material performance benefit and is a major source of trouble, e.g. when the input array is a smaller view of something worth gigabytes or when the user accidentally (or deliberately !!!!!) modifies the input data after using it to build the graph.

lithomas1

lithomas1 commented on Dec 6, 2024

@lithomas1
Contributor

Hm, not sure I completely understand here why this can't be possible.

In the linked issue, there is this comment:

There are other reasons that are more complex to explain (TLDR if we use a specialized API we can make certain assumption we can't make for the generic from_array)

This suggests that dask has an internal API somewhere that can create a dask Array from a numpy array or something else in a zero-copy many. Maybe we can ask for that to be exposed instead?

This'd be pretty useful in the case where I know I won't use the original array I create the dask array from anymore after passing it into asarray.

crusaderky

crusaderky commented on Dec 6, 2024

@crusaderky
Contributor

It's not impossible, it's just very, very unhealthy to let something at the root of your dask graph point to an array that could be updated by the user later on.

It's an antipattern to use from_array/asarray to embed large arrays in the graph to begin with, as it carries all sorts of performance issues and will very likely kill off or at least hamstring the scheduler and/or the worker processes. So the wastefulness of always deep-copying an array that must be small (<10 MB) becomes inconsequential.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @ev-br@dcherian@crusaderky@lithomas1

      Issue actions

        `test_asarray_copy[dask.array]` fails on CI with dask==2024.12 · Issue #209 · data-apis/array-api-compat