Skip to content

Commit c250f8b

Browse files
committed
Include information about native regexps.
1 parent 61b0230 commit c250f8b

File tree

1 file changed

+37
-39
lines changed

1 file changed

+37
-39
lines changed

active/0000-regexps.md

Lines changed: 37 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,9 @@ exposes byte indices.
9090
## Word boundaries, word characters and Unicode
9191

9292
At least Python and D define word characters, word boundaries and space
93-
characters with Unicode character classes. I propose we do the same.
93+
characters with Unicode character classes. My implementation does the same
94+
by augmenting the standard Perl character classes `\d`, `\s` and `\w` with
95+
corresponding Unicode categories.
9496

9597
## Leftmost-first
9698

@@ -147,31 +149,39 @@ an expression.
147149
## The `regexp!` macro
148150

149151
With syntax extensions, it's possible to write an `regexp!` macro that compiles
150-
an expression when a Rust program is compiled. In my case, it seemed simplest
151-
to compile it to *static* data. For example:
152+
an expression when a Rust program is compiled. This includes translating the
153+
matching algorithm to Rust code specific to the expression given. This "ahead
154+
of time" compiling results in a performance increase. Namely, it elides all
155+
heap allocation.
152156

153-
static re: Regexp = regexp!("a*");
157+
I've called these "native" regexps, whereas expressions compiled at runtime are
158+
"dynamic" regexps. The public API need not impose this distinction on users,
159+
other than requiring the use of a syntax extension to construct a native
160+
regexp. For example:
154161

155-
At first this seemed difficult to accommodate, but it turned out to be
156-
relatively easy with a type like this:
162+
let re = regexp!("a*");
157163

158-
pub enum MaybeStatic<T> {
159-
Dynamic(Vec<T>),
160-
Static(&'static [T]),
161-
}
164+
After construction, `re` is indistinguishable from an expression created
165+
dynamically:
166+
167+
let re = Regexp::new("a*").unwrap();
168+
169+
In particular, both have the same type. This is accomplished with a
170+
representation resembling:
162171

163-
Another option is for the `regexp!` macro to produce a non-static value, but I
164-
found this difficult to do with zero-runtime cost. Either way, the ability to
165-
statically declare a regexp is pretty cool I think.
172+
enum MaybeNative {
173+
Dynamic(~[Inst]),
174+
Native(fn(MatchKind, &str, uint, uint) -> ~[Option<uint>]),
175+
}
166176

167-
Note that the syntax extension is the reason for the `regexp_macros` crate. It's
168-
very small and contains the macro registration function. I'm not sure how this
169-
fits into the Rust distribution, but my vote is to document the `regexp!` macro
170-
in the `regexp` crate and hide the `regexp_macros` crate from public
171-
documentation. (Or link it to the `regexp` crate.)
177+
This syntax extension requires a second crate, `regexp_macros`, where the
178+
`regexp!` macro is defined. Technically, this could be provided in the `regexp`
179+
crate, but this would introduce a runtime dependency on `libsyntax` for any use
180+
of the `regexp` crate.
172181

173-
It seems like the `regexp!` macro will become a bit nicer to use once
174-
[#11640](https://github.com/mozilla/rust/issues/11640) is fixed.
182+
[@alexcrichton
183+
remarks](https://github.com/rust-lang/rfcs/pull/42#issuecomment-40320112)
184+
that this state of affairs is a wart that will be corrected in the future.
175185

176186
## Untrusted input
177187

@@ -234,11 +244,7 @@ Finally, it is always possible to persist without a regexp library.
234244

235245
# Unresolved questions
236246

237-
Firstly, I'm not entirely clear on how the `regexp_macros` crate will be handled.
238-
I gave a suggestion above, but I'm not sure if it's a good one. Is there any
239-
precedent?
240-
241-
Secondly, the public API design is fairly simple and straight-forward with no
247+
The public API design is fairly simple and straight-forward with no
242248
surprises. I think most of the unresolved stuff is how the backend is
243249
implemented, which should be changeable without changing the public API (sans
244250
adding features to the syntax).
@@ -247,8 +253,8 @@ I can't remember where I read it, but someone had mentioned defining a *trait*
247253
that declared the API of a regexp engine. That way, anyone could write their
248254
own backend and use the `regexp` interface. My initial thoughts are
249255
YAGNI---since requiring different backends seems like a super specialized
250-
case---but I'm just hazarding a guess here. (If we go this route, then we'd
251-
probably also have to expose the regexp parser and AST and possibly the
256+
case---but I'm just hazarding a guess here. (If we go this route, then we
257+
might want to expose the regexp parser and AST and possibly the
252258
compiler and instruction set to make writing your own backend easier. That
253259
sounds restrictive with respect to making performance improvements in the
254260
future.)
@@ -263,19 +269,11 @@ For now, we could mark the API as `#[unstable]` or `#[experimental]`.
263269

264270
I think most of the future work for this crate is to increase the performance,
265271
either by implementing different matching algorithms (e.g., a DFA) or by
266-
compiling a regular expression to native Rust code.
267-
268-
With regard to native compilation, there are a few notes:
272+
improving the code generator that produces native regexps with `regexp!`.
269273

270-
* If and when a DFA is implemented, care must be taken, as the size of the code
271-
required can grow rapidly.
272-
* Adding native compilation will very likely change the interface of the crate
273-
in a meaningful way, particularly if we want the interface to be consistent
274-
between natively compiled and dynamically compiled regexps. (i.e., Make
275-
`Regexp` a trait.)
274+
If and when a DFA is implemented, care must be taken when creating a code
275+
generator, as the size of the code required can grow rapidly.
276276

277277
Other future work (that is probably more important) includes more Unicode
278-
support, specifically for simple case folding. Also, words and word boundaries
279-
should also be Unicode friendly, but I plan to have this done before I submit a
280-
PR.
278+
support, specifically for simple case folding.
281279

0 commit comments

Comments
 (0)