Skip to content

Add bindings for POSIX regexes #1719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 13, 2020
Merged

Add bindings for POSIX regexes #1719

merged 1 commit into from
Apr 13, 2020

Conversation

Minoru
Copy link
Contributor

@Minoru Minoru commented Apr 4, 2020

regex.h is part of POSIX, so I believe it's in the scope of this crate.

Function definitions are the same for BSDs and Linux-lilkes, but I can't put them in src/unix/mod.rs since I believe not all Unix systems adhere to POSIX.

This is a draft because I want to see and fix test failures caused by wrong structure sizes. I only tested this on Linux locally.

@rust-highfive
Copy link

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @JohnTitor (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@Minoru
Copy link
Contributor Author

Minoru commented Apr 4, 2020

I've got a couple questions to maintainers.

regex.h provides a struct, regex_t, which isn't meant to be tampered with; POSIX does not define the fields of this struct. I chose to represent this with an opaque type, i.e. an alias to an array of fixed size. This also lets me avoid adding more types, since regex_t implementations sometimes contain other structs.

Now, there are two problems:

  1. my bindings aren't strictly the same as #include in C, because they hide some details;
  2. I don't know the size of regex_t on all supported platforms. CI only gives me a part of the picture, since it doesn't test e.g. FreeBSD i686, or NetBSD on any architecture.

Regarding problem 1: is it okay if I hide some stuff like this? As far as I understand, turning an array into a struct later would be an API change.

Regarding problem 2: should I only add definitions for targets for which I know the size for sure, or can I add blanket definitions and hope that people run tests on their platforms and will fix any inconsistencies? I'm not willing to chase the source for all OSes, and not willing to spin up VMs to find out the size experimentally.

@bors
Copy link
Contributor

bors commented Apr 5, 2020

☔ The latest upstream changes (presumably #1699) made this pull request unmergeable. Please resolve the merge conflicts.

@Minoru
Copy link
Contributor Author

Minoru commented Apr 5, 2020

I rebased onto current master, but I'll wait for maintainer's reply before pushing it. No point wasting CI resources when I already know what the result would be.

@JohnTitor
Copy link
Member

Sorry for the delay!

regex.h provides a struct, regex_t, which isn't meant to be tampered with; POSIX does not define the fields of this struct. I chose to represent this with an opaque type, i.e. an alias to an array of fixed size. This also lets me avoid adding more types, since regex_t implementations sometimes contain other structs.

If it's an opaque struct, I think an empty enum is suitable, like:

#[cfg_attr(feature = "extra_traits", derive(Debug))]
pub enum fpos64_t {} // FIXME: fill this out with a struct

Regarding problem 2: should I only add definitions for targets for which I know the size for sure, or can I add blanket definitions and hope that people run tests on their platforms and will fix any inconsistencies? I'm not willing to chase the source for all OSes, and not willing to spin up VMs to find out the size experimentally.

If it's unsure, I'd favor the former, small PR is better anyway :)

@Minoru
Copy link
Contributor Author

Minoru commented Apr 12, 2020

Thanks for the feedback! I pushed some changes as a separate commit, so it's easier to see what changed.

If it's an opaque struct, I think an empty enum is suitable

Unfortunately, regex_t is allocated directly by users, so we can't use an empty enum here. It has to be either an array or a struct. I kept using arrays in my latest commit; is this okay?

The fpos64_t example gaves me hope that it's okay to use arrays; if someone needs their internals, another PR can add that.

@Minoru
Copy link
Contributor Author

Minoru commented Apr 12, 2020

Arrays don't work nearly as well as I hoped. By moving regex_t down the module hierarchy, I'm forcing myself to move functions there as well. That'll lead to code duplication. Bad.

I'll try to do this "the proper way" and provide structure definitions (for musl, glibc, and BSDs).

@Minoru Minoru marked this pull request as ready for review April 12, 2020 17:26
@Minoru
Copy link
Contributor Author

Minoru commented Apr 12, 2020

Doing this the "proper" way turned out way easier than I anticipated! Thank you for the guidance, @JohnTitor.

There is no need to review individual commits — I expect this will be squashed into a single one upon merge. If that's not how things are done around here, tell me and I'll force-push the squashed version. Don't want to do it proactively because that'd be a waste of CI time IMHO.

@Minoru
Copy link
Contributor Author

Minoru commented Apr 12, 2020

Oh, for reference, here are the headers I used while working on this:

@JohnTitor
Copy link
Member

Looks good! Then I'll wait for fixups.

@Minoru
Copy link
Contributor Author

Minoru commented Apr 13, 2020

Force-pushed a squashed commit.

@Minoru
Copy link
Contributor Author

Minoru commented Apr 13, 2020

Um, GitHub says the checks are successful, but SemVer jobs actually failed, both on Linux and on macOS. The only difference I see relative to the previous build is that today's build use Nightly 3712e11a8 2020-04-12, whereas yesterday's used e82734e56 2020-04-11.

@JohnTitor
Copy link
Member

Semver check depends on nightly rustc API and the failure is unlrelated to this PR, I'll fix it later :)

Copy link
Member

@JohnTitor JohnTitor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@JohnTitor JohnTitor merged commit f90bd77 into rust-lang:master Apr 13, 2020
@Minoru Minoru deleted the feature/regex branch April 13, 2020 14:40
@Minoru Minoru mentioned this pull request Apr 13, 2020
@semarie
Copy link
Contributor

semarie commented Apr 14, 2020

FYI, it broke the build on OpenBSD. REG_ENOSYS is undefined here. I will look to address it by moving the NetBSD definition where it should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants