-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Use PyCapsule for internal datetime functions #51525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Of note - this PR is huge, but it was really difficult finding an intermediate point that wasn't unnecessarily convoluted. Generally, the changes here can be abstracted as:
|
pandas/_libs/__init__.py
Outdated
# Below import needs to happen first to ensure pandas top level | ||
# module gets monkeypatched with the pandas_datetime_CAPI | ||
# see pandas_datetime_exec in pd_datetime.c | ||
import pandas._libs.pandas_datetime as pandas_datetime # noqa # isort: skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of this, but capsules currently cannot be attached to anything but a top level package. See python/cpython#6898 which I think would solve this, but for now this has to be attached to the top level namespace and really come before anything else to avoid circular import errors
npy_int32 hrs, min, sec, ms, us, ns, seconds, microseconds, nanoseconds; | ||
} pandas_timedeltastruct; | ||
|
||
const npy_datetimestruct _AS_MIN_DTS = {1969, 12, 31, 23, 59, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were originally declared with extern
linkage and defined within the source file, which was included during compilation of each additional library. I think this was a bit heavy handed, so went with a direct definition here in the header with this refactor
@@ -337,7 +335,7 @@ def run(self): | |||
if os.environ.get("PANDAS_CI", "0") == "1": | |||
extra_compile_args.append("-Werror") | |||
if debugging_symbols_requested: | |||
extra_compile_args.append("-g") | |||
extra_compile_args.append("-g3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit tangential but g3 provides visibility into macros with gcc, which is useful for this PR
setup.py
Outdated
"sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"], | ||
# TODO: xstrtod symbol visibility is really murky here | ||
"include": klib_include, | ||
"sources": ["pandas/_libs/src/parser/tokenizer.c"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the np_datetime.c source yielded complaints from the linker about not being able to find xstrtod
, which is defined in tokenizer.c
; a more Python approach may be to make tokenizer a capsule in another PR
return -1; | ||
} | ||
|
||
if (PyModule_AddObject(pandas, "pandas_datetime_CAPI", capsule) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the monkeypatching of pandas actually occurs
PyErr_NoMemory(); | ||
return -1; | ||
} | ||
capi->npy_datetimestruct_to_datetime = npy_datetimestruct_to_datetime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The macros in the header file dispatch to these struct members, which in turn reference static functions defined within this module
@MarcoGorelli @Dr-Irv would either of you be able to try this on windows and see if it fixes parallel compilation? I think there is still some risk of file clashing due to the |
return -1; | ||
} | ||
|
||
if (PyModule_AddObject(pandas, "_pandas_datetime_CAPI", capsule) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self - CPython documentation prefers PyModule_AddObjectRef
over this. Maybe worth trying instead of using datetime.h pattern:
https://docs.python.org/3/c-api/module.html#c.PyModule_AddObject
@WillAyd test via |
Gave it a shot, and have the same errors. |
Hmm that's unfortunate. That was after a clean right? |
Yes |
Are there any tangible impacts? perf? build size/time? will this be easier to maintain (for those of us who don't know about capsule stuff)? I'm generally happy to defer to you on [gestures broadly at C files], but the pinning of attributes to the pandas namespace is pretty ugly, and im not a fan of the import_datetime/import_array/import_pandas_datetime pattern (which is easy for beginners to mess up and get segfaults) Should we just say "screw it" and move the relevant code to cython? |
Assuming this helps parallel compilation should help build time a lot, and yea will reduce the size of some of the libraries, but not expecting by a noticable amount. For the PR that you tried to replace the dynamic Python attribute access calls with C struct level calls this will definitely help I agree about the pandas namespace monkeypatch. Really suboptimal but unfortunately seems to be a limitation of the language for now... As far as converting to Cython I personally think it would be pretty awkward to do that, but if someone wanted to try its certainly up for discussion. |
@Dr-Irv any chance you can share the build log and exact error message? maybe worth trying a |
Did the
(pandas-dev) c:\Code\pandas_dev\pandas>python setup.py clean
C:\Anaconda3\envs\pandas-dev\lib\site-packages\setuptools\config\pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
warnings.warn(msg, _BetaConfiguration)
running clean
(pandas-dev) c:\Code\pandas_dev\pandas>python setup.py build_ext -j 4 Creating library build\te m p .Cwriena-taimndg 64-clpibryatrhyo nb-u3i8l\dR\etleemaps.ew\ipna-nadmads6\4_-cplyithobns-\ar3r8\Raeyls.cepa3s8e-\wpiann_daamds6\4_.lliibbs \ahnads hoibnjge.cctp 3b8u-iwlidn_taemmdp.w6i4n.-amld64i-cpbyth ona-n3d8\ Roebljeeacste \bpuainldda\st\e_mlpi.bwsi\na-rarmady6s4.-ccpp3y8t-hwoinn-_3a8m\dR6e4l.eeaxspe |
Awesome very helpful. In that log I see this as the error:
Does anything look out of the ordinary with your file permissions on |
Ignore that question - I don't think it has to do with your file permissions. Let me see what I can do to get that to work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments @jbrockmendel and @lithomas1 helped simplify this a good deal. I think this is mergable in its current state. The main follow up I want to do is get rid of the wrapper functions in parsers.pyx, but trying to make an MRE of those is going to take some time
#include "pd_parser.h" | ||
#include "src/parser/io.h" | ||
|
||
static int to_double(char *item, double *p_value, char sci, char decimal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and floatify were moved from parse_helper.h
- seem to more naturally fit in a module than a header
@@ -28,41 +28,6 @@ This file is derived from NumPy 1.7. See NUMPY_LICENSE.txt | |||
#include "np_datetime.h" | |||
|
|||
|
|||
const npy_datetimestruct _AS_MIN_DTS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were moved to the header file. The extern
declaration puts the onus on the linker to resolve these symbols, which doesn't play nicely with setuptools. I am not aware of any downside to placing as static const
in the header; upside is it plays nicer with our build system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just put them directly in the cython file. IIRC i only put them here originally bc doing so is less verbose than instantiating them in cython.
Any other thoughts on this? |
The sdist build is failing (I tagged this with the "Build" label a while back). |
Ah cool - didn't realize the tag changed CI. Let me take a look |
All green looks like I needed to update the MANIFEST. In a follow up I think it would be nice to more logically restructure our non-Cython files into src/include/lib folders to match a traditional project layout |
Any other feedback on this? I think can enable parallel builds back in CI and finish working on the performance improvements @jbrockmendel started in the datetime functions as quick follow ups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's give this a try.
if @lithomas1 is on board to handle the lottery scenario, im in. |
Thanks @jbrockmendel @lithomas1 for review |
This reverts commit 251b512.
closes #47305
Our datetime functions are used to support other extensions. We've historically managed this by compiling the source files from our datetime modules into the target library, but that doesn't play nicely with parallel compilation and makes a bit of a mess of our symbol management
The documented way to handle an internal C API for other extension modules is to use a capsule; this is similar to how the datetime import works in python so this is heavily modeled after that source code
This also should help @jbrockmendel work in #51368.
The capsule itself has its own quirks, but hoping it represents a better way to manage this