Skip to content

Numpy improvements #383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Numpy improvements #383

wants to merge 4 commits into from

Conversation

nevion
Copy link
Contributor

@nevion nevion commented Sep 4, 2016

It is clear pains were taken to avoid bringing in numpy's c-api, but I don't think this was the correct choice - going at it pure python half-defeats the productivity promises pybind11 gives and is something others including myself wish worked better.

Instead we should embrace the c-api, which greatly simplifies typical array usage and improves performance by not jumping needlessly back into python land. The cost is a warning that we might get fixed up stream having to do with the _import_module symbol, and calling that function in the appropriate (initialization) places.

Various other improvements were made such as allowing "mapped" arrays and supporting array creation flags.

@aldanor
Copy link
Member

aldanor commented Sep 4, 2016

@nevion

  • I've already implemented direct access to NumPy attributes such as ndim/size/data etc and will be finished/merged reasonably soon I hope, see this branch: NumPy arrays rework aldanor/pybind11#2 -- note that this doesn't require including numpy.h which is a pretty big win, it'd be nice to keep it that way
  • I don't think you need ::empty() constructor since that's the default behaviour of array constructors (if a pointer is not passed). That being said, ::zeros could be added. I can add this plus a few things from this PR like retrieving an element at index to my branch.
  • Re: copy flag, the implementation here is unsafe since it may result in owner of the pointer being destroyed before the NumPy array. A proper solution would be to supply the optional owner object (which may be also set to something like py::none to indicate you don't care about ownership but copy wouldn't be made). This will be implemented in my branch shortly, too.
  • Re: flags being passed to the array, this is already implemented on the type level, see NumPy arrays rework aldanor/pybind11#2.

(also see discussion: #338)

@aldanor
Copy link
Member

aldanor commented Sep 4, 2016

So, to summarize, almost everything in this branch except for the dtype_record_builder will be implemented in aldanor#2; if you don't mind I could bring a few extra things from this PR like data_at as well, and make shape and strides return pointers rather than vectors so it's zero cost.

dtype_record_builder seems useful. Maybe nest it under dtype itself? (Like dtype::record_builder, or dtype::make_structured or something). Also, itemsize could be optional, NumPy assumes tight packing if you don't provide itemsize and infers it from the furthest element.

@nevion
Copy link
Contributor Author

nevion commented Sep 4, 2016

@aldanor I didn't see your work when I first started toying around a few days ago... I hate it when this happens. Happens too often.

Not including numpy.h has a cost of more python calls.... and I got around to thinking tonight - why do we need to avoid the c-api like the plague again? Perhaps we can fix these reasons upstream. Do we just not want to find the include - which is available from numpy.get_include? Then we get to rip out all of this facade library-in-a-library repeating core declarations that we're betting will not change. The only other real wart I see is the complaint from the api about the _import function not being used and that a user would need to call import_array() somewhere prior to use.... like their init function.

I'd still rather keep empty mind you, principle of least surprise. Perhaps we can just let it construct the object directly though, if it bothers you so to call the function explicitly.

copy flag being unsafe... it's purpose is to let you cross the casm into a ndarray for handing around to routines when you need a python handle to it temporarily. This happens all the time in image processing like with cv::Mat_ (images and number data, sound familiar?), Eigen::Map, most C++ image processing libraries. I cannot find it a reasonable argument to say NIMBY. This is not a new concept and is very important in some contexts. You have to explicitly say copy= false for it go down this code path too, btw - so it's opt-in for the experts.

@aldanor
Copy link
Member

aldanor commented Sep 4, 2016

@nevion No worries. It would be a complete PITA to rebase my branch of top of this one though if it gets merged (particularly so since some things are done differently, like flags for array), so I'd much prefer to do it all in one place. That being said, there's a few benign non-conflicting things here like the dtype record builder.


As for numpy.h API -- I'm not sure which Python calls are you talking about? The NumPy API that's exposed through Python is just raw function pointers, and they are cached after the first call to NumPy, so there is no overhead in using it.

I agree it's a bit clunky in the current state; and if we choose to replace it in favor of just including NumPy headers directly, then the existing code should also be changed to use that.

@wjakob are there any other points for/against importing NumPy API at runtime through Python that I'm missing?

// There's not too much to gain though, aside from a few replaceable macros and PyArray / PyArray_Descr object definitions (which could all be done without including numpy.h -- see the implementation in my branch)


Setting copy = false does not ensure the data behind the pointer will not be copied. NumPy doesn't guarantee that and there's no ensure_no_copy flag. The inverse is enforceable via NPY_ENSURECOPY.

One of the common use cases is to create NumPy "views" where data is not owned by NumPy but is a member pointer of a C++ object, in which case we can bind the lifetime of the object to that of the view, e.g.:

.def("get_view", [](Foo& self) { 
    return array(self.size, self.ptr, /* owner = */ py::cast(self)); 
})

In some cases when you are sure that lifetimes won't go bad (e.g. create a numpy view with the lifetime of the current scope), we can allow passing e.g. None, to indicate that we want a view, but there's no explicit Python owner (this is the use case you had in mind?):

auto view = array(size, ptr, py::none());

Would this work for you?

Both of these would disable copying. Note, however, it's not guaranteed that the data won't be copied. I thought of having an "ensure no copy" flag, which would verify that the resulting data pointer is the same as the data pointer passed; it might be useful in some cases.

@wjakob
Copy link
Member

wjakob commented Sep 4, 2016

Hi,

the goal of all this work with function pointers is to avoid a super-annoying pybind11->NumPy dependency. In this way, we can distribute modules for various python versions that will work no matter what NumPy release is installed (and importantly, NumPy is not needed at all at compile time). At runtime, the NumPy import must obviously happen at some point, but that is fine. I think the most sensible thing at this point is to port the missing bits to @aldanor's branch.

Cheers,
Wenzel

@nevion
Copy link
Contributor Author

nevion commented Sep 4, 2016

While its a sort of cool feature, why is having numpy at compile time when including the ndarray a bad thing?

The same mostly applies if the bet is right that things are unchanging/compatible.

What is the super annoying part of including the numpy api - most anyone only have a system installation and you could override. The facade library inside is betting on a modest amount of ABI not changing. If the underlying assumption is true on both of them, function should also work. Further flags, and structure definitions + macros under these assumptions means the numpy library gets you out of the inner facade library.

I was thinking of trying to fix the common complaints people have with the ndarray c-api - so if you have a complaint, please voice it. I can tell you from years of boost.python, including anything with ndarray meant including and initializing the c-api for me.

@wjakob
Copy link
Member

wjakob commented Sep 5, 2016

It's reasonably simple to mirror the (very small) portion of the stable API that we are using within pybind11--that's why it is done here to cut down on a compile-time dependency. Another positive aspect is that it's always very clear what parts of the API we are using, and that compiled projects will run with a variety of older NumPy releases.

Whether reducing compile-time dependencies is generally a useful thing to do is obviously subjective. As someone who deploys binaries for a large number of pybind11-based projects to different machines, it is important to me.

@nevion
Copy link
Contributor Author

nevion commented Sep 5, 2016

@aldanor numpy does not make a copy of data if you hand it in, you don't need a flag for that, it is the behavior that it has had for a long time. If you google around you'll find this is what people have done when they want to use it. If they haven't stated it as a guarantee, then it simply fell by the wayside in the many evolutions of numpy. Source routine: PyArray_NewFromDescr_int in ctors.c

My suggestion is not to build unsteady wrapping that does it's own thing rather than follow numpy's behavior - focus on being a way to use the library, not to change it's behavior.

Even creating dtypes with specified fields/names/offsets/dtypes is undocumented behavior.

@nevion
Copy link
Contributor Author

nevion commented Sep 5, 2016

@wjakob you are not the first to make that decision but it's creating a self-propagating issue which you and other "users" of numpy on the c-side undergo. The savings from not having the compile time dependency are mostly not there - you're already on a developers machine with this library at compile time, who has python-devel installed and if they're using numpy, it's a non issue to get numpy-devel installed, or is preinstalled in the case of distributions like anaconda.

My first thought was simply separating out the import declarations they make which cause warnings. The ABI + API stability is of the library is something that will have to be question with upstream.

But rather than cause trouble for you for now by breaking compatibility with old stuff, lets think of that as the short term fix and a better long term one with the c-api, at least to the benefit of other projects. What do you think could/should be done to benefit and eventually include into your own project?

Perhaps we can fix this for the next API expansion - numpy is not slowing down on development anytime soon and I'd hate to see the facade, enumerations and flags, keep growing.

@wjakob
Copy link
Member

wjakob commented Sep 5, 2016

I prefer the current approach for now but would be willing to reconsider if NumPy at some point includes a compact and reasonably clean header file that is meant for "consumers" of the library (vs NumPy itself).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants