Skip to content

Use "native" types when possible #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
orenbenkiki opened this issue Jun 2, 2023 · 9 comments
Closed

Use "native" types when possible #319

orenbenkiki opened this issue Jun 2, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@orenbenkiki
Copy link

Problem

Currently when passing arrays between Python and Julia, then the code always uses a wrapper object (PyArray, ArrayValue, etc.), even if the data happens to be contiguous in memory. This is a problem because not all code works well (or at all) with these wrappers. In an ideal world with proper interfaces, this wouldn't have been a problem, but both Python and Julia are somewhat lax about strict interfaces for arrays, so in practice, things sometimes break (or just run slowly).

As just one trivial example, this will crash:

# Julia

function give_vector()::Vector
    return vec([1.0 2.0 3.0])
end
# Python

vector = in_julia.give_vector()
assert vector.__class__.__name__ == "VectorValue"
assert str(vector) == "[1.0, 2.0, 3.0]"
vector += 1

Because TypeError: unsupported operand type(s) for +=: 'VectorValue' and 'int'.

The point isn't about this specific missing operation (though fixing it would be nice); the point is that try as we may, we'll never make VectorValue be a 100% drop-in replacement to ndarray.

Solution

When converting arrays between Julia and Python, if the data is contiguous, then use frombuffer to wrap the memory with an ndarray for Python, or use jl_ptr_to_array to wrap the memory with an Array for Julia to use. If, however, the data is not contiguous in memory, keep the current behavior of returning a wrapper type.

Alternatives

This can be done manually of course, which is probably what I will be doing in my code for now. That said, even if this wasn't the default behavior, it would be nice to provide some convenience functions to make it easier to achieve.

Additional context

Here is some example code which worked for me to demonstrate the feasibility of the solution:

# Julia

function as_ndarray(array::Array)::PyArray
    np = pyimport("numpy")
    return PyArray(np.frombuffer(array))
end

function from_ndarray(ndarray::PyArray)::Array
    return unsafe_wrap(Array, pointer(ndarray), size(ndarray))
end

function give_numpy_vector()::PyVector
    return as_ndarray(give_vector())
end

function modify_vector(vector::AbstractVector)::Nothing
    vector[1] += 10
    return nothing
end

function modify_vector(vector::PyVector)::Nothing
    modify_vector(from_ndarray(vector))
    return nothing
end
# Python

vector = in_julia.give_vector()
assert vector.__class__.__name__ == "VectorValue"
assert str(vector) == "[1.0, 2.0, 3.0]"
in_julia.modify_vector(vector)
assert str(vector) == "[11.0, 2.0, 3.0]"

vector = in_julia.give_numpy_vector()
assert vector.__class__.__name__ == "ndarray"
assert str(vector) == "[1. 2. 3.]"
in_julia.modify_vector(vector)
assert str(vector) == "[11.  2.   3.]"

Naturally this is just a proof of concept and doesn't deal with issues such as proper mapping of dtype, testing whether the data is actually contiguous, etc.

One point woth noting is that the fact that Array "likes" to be column-major and ndarray "likes" to be row-major is not a show stopper here. I have plenty of Python code which explicitly works with column-major arrays (because that's what needed for efficiency), and Julia has PermutedDimsArray which can likewise deal with row-major order. It is always the developer's responsibility to deal with this issue (summing rows of column-major data will be much slower than summing rows of the same data in rows-major order).

As a side note, the built-in implementation in both numpy and Julia for converting data between these layouts is slow as molasses for no good reason, so I had to provide my own implementation to get reasonable performance. But that's mostly irrelevant to the issue I raised here.

@orenbenkiki orenbenkiki added the enhancement New feature or request label Jun 2, 2023
@cjdoris
Copy link
Collaborator

cjdoris commented Jun 2, 2023

When passing an array from Julia to Python, if you want it to be a numpy array, you can simply do numpy.asarray(the_array). This will be a non-copying view of the original array where possible (i.e. the underlying array supports pointer and strides and the element type corresponds to a numpy dtype).

PythonCall/JuliaCall aims to be agnostic to particular Python packages like numpy, so automatically converting to ndarray is not preferred. Instead, we rely on interfaces. In particular, a wrapped Julia array will satisfy both the Python buffer protocol and the Numpy array interface - this is what enables the no-copy conversion above with numpy.asarray. Furthermore a wrapped Julia vector will satisfy the Python sequence interface, so it behaves a lot like a list. It intentionally does not provide anything else, because there is no standard for what else an array should support - so if you want the ndarray interface then you wrap the object with numpy.asarray yourself.

Conversely, Python arrays become PyArray which is not much more than a pointer to the data. It should be as fast as Array - if you find any cases where it is slower then please raise an issue. PythonCall does not use unsafe_wrap(Array, ...) because there is no way to ensure the data outlives the resulting Array - i.e. the numpy array could get garbage-collected, freeing the underlying memory, leading to a segfault when accessing the Array. But you are free to use unsafe_wrap(Array, ...) yourself if you are careful.

@orenbenkiki
Copy link
Author

Thanks for the clarifications!

I assume the numpy.asarray could be on either the Julia or Python side?

And I take it that PyArray is therefore a DenseArray and a strided array, and would work well with Strided.jl and similar packages as well as Array does?

It would be nice if these points were mentioned in https://cjdoris.github.io/PythonCall.jl/stable/compat/

@orenbenkiki
Copy link
Author

I get your point about not depending on specific Python packages, and I see how calling nump.asarray is a good solution when passing Julia arrays to Python - one gets a "real" ndarray and gets all the optimizations that follow.

But it seems there's no equivalent in the other direction. Getting data from Python wraps things in a PyArray which is not a "real" Array. Since PyArray is a general-case wrapper it isn't even a DenseArray even when it could be, which means losing performance when using it.

Here's a concrete example of the issue:

using BenchmarkTools
using Strided
using PythonCall
a = rand(4000, 4000);
pa = PyArray(a);
b = similar(a);
pb = PyArray(b);
@btime b .= (a .+ a) ./ 2;
@btime pb .= (a .+ a) ./ 2;
@btime @strided b .= (a .+ a) ./ 2;
@btime @strided pb .= (pa .+ pa) ./ 2;

Gives me the results:

julia> using BenchmarkTools

julia> using Strided

julia> using PythonCall

julia> a = rand(4000, 4000);

julia> pa = PyArray(a);

julia> b = similar(a);

julia> pb = PyArray(b);

julia> @btime b .= (a .+ a) ./ 2;
  18.413 ms (4 allocations: 128 bytes)

julia> @btime pb .= (a .+ a) ./ 2;
  18.694 ms (4 allocations: 128 bytes)

julia> @btime @strided b .= (a .+ a) ./ 2;
  8.684 ms (17 allocations: 1.06 KiB)

julia> @btime @strided pb .= (pa .+ pa) ./ 2;
ERROR: MethodError: no method matching StridedView(::PyArray{Float64, 2, true, true, Float64})

One way to solve it is to keep PyArray for the general case, and create a new PyDenseArray to use only in the common case that the memory layout allows for it. Since PyDenseArray will be a subtype of DenseArray, any Julia optimizations that kick in for DenseArray will be usable for it (e.g. Strided.jl).

@orenbenkiki orenbenkiki reopened this Jun 3, 2023
@cjdoris
Copy link
Collaborator

cjdoris commented Jun 5, 2023

A PyArray satisfies the Julia Base notion of being a strided array, namely it supports strides and unsafe_convert(Ptr{T}, ::PyArray{T}):

julia> using PythonCall

julia> @py import numpy as np

julia> x = np.random.randn(3,4)
Python:
array([[-0.65880647,  0.99940984, -0.45758152,  1.60004262],
       [ 0.82161821, -1.41043098,  0.62529857,  0.37647898],
       [ 0.10345882,  0.37506426,  0.12554197, -1.03708004]])

julia> y = PyArray(x)
3×4 PyArray{Float64, 2}:
 -0.658806   0.99941   -0.457582   1.60004
  0.821618  -1.41043    0.625299   0.376479
  0.103459   0.375064   0.125542  -1.03708

julia> pointer(y)
Ptr{Float64} @0x000002817d41eac0

julia> Base.unsafe_convert(Ptr{Float64}, y)
Ptr{Float64} @0x000002817d41eac0

julia> strides(y)
(4, 1)

I'm not familiar with Strided.jl, but maybe there is some integration work to do to make PyArray work with it.

PyArray is not a dense array in general - the data is strided but not necessarily contiguous.

@cjdoris
Copy link
Collaborator

cjdoris commented Jun 5, 2023

One way to solve it is to keep PyArray for the general case, and create a new PyDenseArray to use only in the common case that the memory layout allows for it.

That would increase complexity quite a bit! Presumably Strides.jl relies on some trait somewhere that we could overload. Maybe in ArrayInterface.jl?

I assume the numpy.asarray could be on either the Julia or Python side?

Yep.

@orenbenkiki
Copy link
Author

orenbenkiki commented Jun 5, 2023

Yes, they do have a trait they call "StridedView" - in https://github.com/Jutho/StridedViews.jl - which in theory could be implemented for PyArray. That would be nice but is is just one example of one library. If, on the other hand, there was a PyDenseArray in the (extremely common case) that the data is actually dense (the strides work out), then this would work for "everything" that takes advantage of the standard DenseArray trait (restricted though this may be).

As far as complexity goes, both implementation of PyArray and PyDenseArray would be "mostly identical" so perhaps some creative use of include or something would keep the code DRY?

Edit: added a link.

@orenbenkiki
Copy link
Author

Another implementation strategy, hopefully simpler: since DenseArray is an abstract type, it should be possible to write DenseView which has an Array field (using jl_ptr_to_array), as well as a second field of type Any to keep whatever-is-holding-the-data-for-gc. Declare DenseView to be a DenseArray, and forward the relevant methods. Then it would just be a matter of wrapping PyArray with a DenseView (obviously, only if the strides work out). "In theory", this should allow anything that expects a DenseArray to "just work"...?

@orenbenkiki
Copy link
Author

I'm getting good results with the following. I can use it to wrap a PyArray and get a DenseArray for which all sort of optimizations kick in. Since the general problem is not specific to PythonCall, I'm closing this issue. Right now this is buried inside my (as yet unpublished) package - I'm not sure this deserves its own package...?

module AsDense

using LinearAlgebra
using Base: @propagate_inbounds

export as_dense_if_possible
export as_dense_or_copy
export as_dense_or_fail

struct DenseView{T, N} <: DenseArray{T, N}
    parent::AbstractArray{T, N}
end

@inline Base.IndexStyle(::Type{<:DenseView}) = Base.IndexLinear()
@inline Base.length(dense_view::DenseView) = length(dense_view.parent)
@inline Base.size(dense_view::DenseView) = size(dense_view.parent)
@inline Base.strides(dense_view::DenseView) = strides(dense_view.parent)
@inline function Base.stride(dense_view::DenseView, axis::Integer) = stride(dense_view.parent, axis)
@inline function Base.elsize(dense_view::Type{<:DenseView{T}})::Any where {T} = Base.isbitstype(T) ? sizeof(T) : (Base.isbitsunion(T) ? Base.bitsunionsize(T) : sizeof(Ptr))
@inline function Base.parent(dense_view::DenseView) = dense_view.parent
@propagate_inbounds @inline function Base.getindex(dense_view::DenseView, indices::Integer...) = getindex(dense_view.parent, indices...)
@propagate_inbounds @inline function Base.setindex!(dense_view::DenseView, value::Any, indices::Integer...) = setindex!(dense_view.parent, value, indices...)
@propagate_inbounds @inline function Base.view(dense_view::DenseView, indices::Integer...) = as_dense_if_possible(view(dense_view.parent, indices...))
@inline function Base.similar(dense_view::DenseView, dimensions::Integer...) = as_dense_if_possible(similar(dense_view.parent, dimensions...))
@inline function Base.similar(dense_view::DenseView, type::Type, dimensions::Integer...) = as_dense_if_possible(similar(dense_view.parent, type, dimensions...))
@inline function Base.reshape(dense_view::DenseView, dimensions::Integer...) = as_dense_if_possible(reshape(dense_view.parent, dimensions...))
@inline function LinearAlgebra.transpose(dense_view::DenseView) = DenseView(transpose(dense_view.parent))

"""
    function as_dense_if_possible(array::AbstractArray{T, N})::AbstractArray{T, N} where {T, N}

Given any `array`, return a `DenseArray` sharing the same memory if possible, or the original `array` otherwise.

This will return any `DenseArray` as-is without creating a wrapper.
"""
function as_dense_if_possible(array::AbstractArray{T, N})::AbstractArray{T, N} where {T, N}
    if array isa DenseArray
        return array
    end
    try
        array_strides = strides(array)
        array_sizes = size(array)
        if (array_strides[1] == 1 && array_strides[2] == array_sizes[1]) ||
           (array_strides[1] == array_sizes[2] && array_strides[2] == 1)  # only seems untested
            return DenseView(array)
        else
            return array  # untested
        end
    catch MethodError  # only seems untested
        return array
    end
end

"""
    function as_dense_or_copy(array::AbstractArray{T, N})::DenseArray{T, N} where {T, N}

Given any `array`, return a `DenseArray` sharing the same memory if possible, or a dense `Array` copy of the data.

This will return any `DenseArray` as-is without creating a wrapper or doing any copying.
"""
function as_dense_or_copy(array::AbstractArray{T, N})::DenseArray{T, N} where {T, N}
    as_dense = as_dense_if_possible(array)
    if as_dense isa DenseArray
        return as_dense
    else
        return Array(as_dense)
    end
end

"""
    function as_dense_or_fail(array::AbstractArray{T, N})::DenseArray{T, N} where {T, N}

Given any `array`, return a `DenseArray` sharing the same memory if possible, or fail with an error message.

This will return any `DenseArray` as-is without creating a wrapper or doing any copying.
"""
function as_dense_or_fail(array::AbstractArray{T, N})::DenseArray{T, N} where {T, N}
    as_dense = as_dense_if_possible(array)
    if as_dense isa DenseArray
        return as_dense
    else
        error("the array: $(typeof(array)) is not dense")
    end
end

end # module

@cjdoris
Copy link
Collaborator

cjdoris commented Jun 15, 2023

That DenseView would be a nice think to package up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants