Skip to content

Expanded RegEx Support? #2432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hcldan opened this issue Mar 3, 2023 · 18 comments
Open

Expanded RegEx Support? #2432

hcldan opened this issue Mar 3, 2023 · 18 comments

Comments

@hcldan
Copy link
Contributor

hcldan commented Mar 3, 2023

@pvdrz
Following on the heels of #2328 ... I've changed to start using the new option and it's working great but I ran into a snag...

Some of the older c libs that I'm using have structs that don't always follow proper naming patterns.
Rather than having to exclude them from the bindgen, I'd like to just filter them from having a derive applied to them.

Here's a sample of the RegEx I was trying to use:
https://regex101.com/r/pjeUkB/1

Basically, If the struct starts with "My" and doesn't contain Foo in it, then apply the derive.
This doesn't seem to work, and seems to silently ignore the regex. Am I doing this wrong?

--with-derive-custom-struct '^My((?!Foo).)+$'=MyStructTrait

This doesn't seem to match anything. I have verified that this does though:

--with-derive-custom-struct '^My.+$'=MyStructTrait

But it matches the ones I don't want as well.
Does the regex impl in use not support lookahead or negative lookahead?

I would have expected the regex compile to fail or something.

Bindgen Invocation

--with-derive-custom-struct '^My((?!Foo).)+$'=MyStructTrait

Actual Results

No structs are labeled with a MyStructTrait derive.

Expected Results

I expect to see:

#[derive(MyStructTrait)]
struct MyStruct {}

struct MyFoo {}
@pvdrz
Copy link
Contributor

pvdrz commented Mar 6, 2023

@hcldan
Copy link
Contributor Author

hcldan commented Mar 6, 2023

I did not think so. I did try with and without the ^$ and the regex works in either case as long as I'm not using the negative lookahead

@hcldan
Copy link
Contributor Author

hcldan commented Mar 6, 2023

I'll try again without them.

@hcldan
Copy link
Contributor Author

hcldan commented Mar 6, 2023

removing the ^ and $ had no effect, the regex still matches nothing now.

@pvdrz
Copy link
Contributor

pvdrz commented Mar 6, 2023

Hmmm ok yeah I get what you mean. Sadly regex does not support negative lookaheads: rust-lang/regex#127. Apparently other crates like fancy-regex support this syntax.

That being said, I think changing the regex crate could have some side effects that we would have to investigate: Performance, breaking changes, etc. So, we would need to discuss this more broadly.

For the time being a workaround would be to use the ParseCallbacks API and write the logic you want by hand.

@hcldan
Copy link
Contributor Author

hcldan commented Mar 6, 2023

or maybe conditionally use fancy-regex? :) sigh... I'm so close to not needing to use the parsecallbacks

@hcldan
Copy link
Contributor Author

hcldan commented Mar 6, 2023

at a minimum... should this panic if it tries to compile a regex with a feature we don't support (hopefully yet)?

@pvdrz
Copy link
Contributor

pvdrz commented Mar 6, 2023

I agree

at a minimum... should this panic if it tries to compile a regex with a feature we don't support (hopefully yet)?

Yeah I agree, right now we emit a warn log entry but most people won't see it.

@pvdrz
Copy link
Contributor

pvdrz commented Mar 8, 2023

@hcldan I have news!

So regarding diagnostics, I'm working on a little something to actually display warnings and I'll have something ready soon.
image

About the actual issue you're having I think we might abuse the fact that we have the experimental feature to test this by replacing regex by another crate when this feature is enabled.

That being said I'm not sure if fancy-regex is the right crate just because it doesn't offer the same API as the regex crate. In particular, it doesn't have a regex set type which we heavily rely on for two reasons:

  • First, it let us know which pattern inside the set was matched and with that we can tell the user if a pattern was unused.
  • Second, it is as efficient as a single regex.
    Reason 1 means that we cannot achieve the same by concatenating patterns using |. Reason 2 means that we cannot use a Vec<Regex> instead as it would be less performant.

If we could solve these issues, we could move forward.

@hcldan
Copy link
Contributor Author

hcldan commented Mar 8, 2023

I was hoping to read in that ticket you linked that they would add support for lookaround via a crate feature...
Oh well...

@hcldan
Copy link
Contributor Author

hcldan commented Mar 8, 2023

for the first bullet... wouldn't looking at the CaptureNames and correlating them to the CaptureMatches tell you what part of the regex matched or didn't?
image

@pvdrz
Copy link
Contributor

pvdrz commented Mar 8, 2023

maybe? It all depends on the order/labeling of the groups. If someone passes the regexes Foo and Bar and we build something like (Foo)|(Bar) then we know which group matched what. But if they pass (Foo) and Bar then we get ((Foo))|(Bar) and we might get confused deciding which groups are ours and which groups aren't.

Even if groups are named it could be that users name their groups for some odd reason. We would have to use some obfuscated enough names to do so.

@hcldan
Copy link
Contributor Author

hcldan commented Mar 8, 2023

Where are the regexes being smooshed together?

I can certainly understand that people want performant regex, and I would not mind at all if the features that make it potentially less-performant were roped off into a feature...

It's a hack but I'd be fine with doc that says: "don't use names like this" or something.
Hmm... so what do you do today to group everything up? Smash all the regexes together and process each cli arg target against the 1 regex?

Is it really that much less performant to run n different regexes than 1 regex that's a concatenation of all n?
I mean I would believe that you are correct, but is that cost actually prohibitive? I would think that the larger concern would be the worst-case performance of a look-around.

@hcldan
Copy link
Contributor Author

hcldan commented Mar 8, 2023

Alternatively.... do you know of a way that I can write a derive macro to fail silently, without an error and not actually implementing the trait, based on some logic I have in the macro? Can I fail without a panic and just have the type not be derived?

It's certainly not ideal to have it labelled as #[derive(Thing)] and have it not actually be derived...

@pvdrz
Copy link
Contributor

pvdrz commented Mar 8, 2023

For the record, I'm not saying that we shouldn't have this. Just that we have to figure out the best way to handle it.

Hmm... so what do you do today to group everything up? Smash all the regexes together and process each cli arg target against the 1 regex?

No, we use regex::RegexSet which does that for us. Given that fancy-regex does not have anything like it, we would need to do some smooshing ourselves and that's where the capture groups might become a headache.

Is it really that much less performant to run n different regexes than 1 regex that's a concatenation of all n? I mean I would believe that you are correct, but is that cost actually prohibitive? I would think that the larger concern would be the worst-case performance of a look-around.

I'd say yes. The extreme scenario would be having two expressions that only differ in the last symbol. Their automaton would be the same except for the last transition. If I understand correctly, if you had something like Foo1 and Foo2 in the same RegexSet it would be compiled to something akin to Foo[12] which removes the cost of matching Foo twice. The issue is that these RegexSets are used everytime bindgen sees an ident of the right kind (functions, types, vars, etc) so any performance hit there could impact the overall speed of bindgen siginifcantly.

bindgen already has a way to measure how much time is spent on each step. Maybe we can measure how bad is it.

Alternatively.... do you know of a way that I can write a derive macro to fail silently, without an error and not actually implementing the trait, based on some logic I have in the macro? Can I fail without a panic and just have the type not be derived?

Well you would have to define your own derive macro crate for that. But If I were to do that, I'd do the name handling there instead of relying it to bindgen.

@hcldan
Copy link
Contributor Author

hcldan commented Mar 8, 2023

For the record, I'm not saying that we shouldn't have this. Just that we have to figure out the best way to handle it.

Oh, I know. I very much appreciate the discussion here. I'm just exploring options.

@hcldan
Copy link
Contributor Author

hcldan commented Mar 14, 2023

Well you would have to define your own derive macro crate for that. But If I were to do that, I'd do the name handling there instead of relying it to bindgen.

I went down that path. I'm not thrilled of having a derive trait that just doesn't do it. I found a way to emit warnings, but I still feel uneasy about devs seeing a derive and it not having those traits. Oh well. I'm unblocked, and while it would be cool to have lookarounds for this project, I get why it might be an issue. I guess this issue can just fill the need to emit an error/warning for patterns that aren't supported.

@pvdrz
Copy link
Contributor

pvdrz commented Mar 14, 2023

We're working already on the diagnostics part. Yeah if we can figure out a way to make fancy-regex work maybe we can give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants