@@ -90,7 +90,9 @@ exposes byte indices.
90
90
## Word boundaries, word characters and Unicode
91
91
92
92
At least Python and D define word characters, word boundaries and space
93
- characters with Unicode character classes. I propose we do the same.
93
+ characters with Unicode character classes. My implementation does the same
94
+ by augmenting the standard Perl character classes ` \d ` , ` \s ` and ` \w ` with
95
+ corresponding Unicode categories.
94
96
95
97
## Leftmost-first
96
98
@@ -147,31 +149,39 @@ an expression.
147
149
## The ` regexp! ` macro
148
150
149
151
With syntax extensions, it's possible to write an ` regexp! ` macro that compiles
150
- an expression when a Rust program is compiled. In my case, it seemed simplest
151
- to compile it to * static* data. For example:
152
+ an expression when a Rust program is compiled. This includes translating the
153
+ matching algorithm to Rust code specific to the expression given. This "ahead
154
+ of time" compiling results in a performance increase. Namely, it elides all
155
+ heap allocation.
152
156
153
- static re: Regexp = regexp!("a*");
157
+ I've called these "native" regexps, whereas expressions compiled at runtime are
158
+ "dynamic" regexps. The public API need not impose this distinction on users,
159
+ other than requiring the use of a syntax extension to construct a native
160
+ regexp. For example:
154
161
155
- At first this seemed difficult to accommodate, but it turned out to be
156
- relatively easy with a type like this:
162
+ let re = regexp!("a*");
157
163
158
- pub enum MaybeStatic<T> {
159
- Dynamic(Vec<T>),
160
- Static(&'static [T]),
161
- }
164
+ After construction, ` re ` is indistinguishable from an expression created
165
+ dynamically:
166
+
167
+ let re = Regexp::new("a*").unwrap();
168
+
169
+ In particular, both have the same type. This is accomplished with a
170
+ representation resembling:
162
171
163
- Another option is for the ` regexp! ` macro to produce a non-static value, but I
164
- found this difficult to do with zero-runtime cost. Either way, the ability to
165
- statically declare a regexp is pretty cool I think.
172
+ enum MaybeNative {
173
+ Dynamic(~[Inst]),
174
+ Native(fn(MatchKind, &str, uint, uint) -> ~[Option<uint>]),
175
+ }
166
176
167
- Note that the syntax extension is the reason for the ` regexp_macros ` crate. It's
168
- very small and contains the macro registration function. I'm not sure how this
169
- fits into the Rust distribution, but my vote is to document the ` regexp! ` macro
170
- in the ` regexp ` crate and hide the ` regexp_macros ` crate from public
171
- documentation. (Or link it to the ` regexp ` crate.)
177
+ This syntax extension requires a second crate, ` regexp_macros ` , where the
178
+ ` regexp! ` macro is defined. Technically, this could be provided in the ` regexp `
179
+ crate, but this would introduce a runtime dependency on ` libsyntax ` for any use
180
+ of the ` regexp ` crate.
172
181
173
- It seems like the ` regexp! ` macro will become a bit nicer to use once
174
- [ #11640 ] ( https://github.com/mozilla/rust/issues/11640 ) is fixed.
182
+ [ @alexcrichton
183
+ remarks] ( https://github.com/rust-lang/rfcs/pull/42#issuecomment-40320112 )
184
+ that this state of affairs is a wart that will be corrected in the future.
175
185
176
186
## Untrusted input
177
187
@@ -234,11 +244,7 @@ Finally, it is always possible to persist without a regexp library.
234
244
235
245
# Unresolved questions
236
246
237
- Firstly, I'm not entirely clear on how the ` regexp_macros ` crate will be handled.
238
- I gave a suggestion above, but I'm not sure if it's a good one. Is there any
239
- precedent?
240
-
241
- Secondly, the public API design is fairly simple and straight-forward with no
247
+ The public API design is fairly simple and straight-forward with no
242
248
surprises. I think most of the unresolved stuff is how the backend is
243
249
implemented, which should be changeable without changing the public API (sans
244
250
adding features to the syntax).
@@ -247,8 +253,8 @@ I can't remember where I read it, but someone had mentioned defining a *trait*
247
253
that declared the API of a regexp engine. That way, anyone could write their
248
254
own backend and use the ` regexp ` interface. My initial thoughts are
249
255
YAGNI---since requiring different backends seems like a super specialized
250
- case---but I'm just hazarding a guess here. (If we go this route, then we'd
251
- probably also have to expose the regexp parser and AST and possibly the
256
+ case---but I'm just hazarding a guess here. (If we go this route, then we
257
+ might want to expose the regexp parser and AST and possibly the
252
258
compiler and instruction set to make writing your own backend easier. That
253
259
sounds restrictive with respect to making performance improvements in the
254
260
future.)
@@ -263,19 +269,11 @@ For now, we could mark the API as `#[unstable]` or `#[experimental]`.
263
269
264
270
I think most of the future work for this crate is to increase the performance,
265
271
either by implementing different matching algorithms (e.g., a DFA) or by
266
- compiling a regular expression to native Rust code.
267
-
268
- With regard to native compilation, there are a few notes:
272
+ improving the code generator that produces native regexps with ` regexp! ` .
269
273
270
- * If and when a DFA is implemented, care must be taken, as the size of the code
271
- required can grow rapidly.
272
- * Adding native compilation will very likely change the interface of the crate
273
- in a meaningful way, particularly if we want the interface to be consistent
274
- between natively compiled and dynamically compiled regexps. (i.e., Make
275
- ` Regexp ` a trait.)
274
+ If and when a DFA is implemented, care must be taken when creating a code
275
+ generator, as the size of the code required can grow rapidly.
276
276
277
277
Other future work (that is probably more important) includes more Unicode
278
- support, specifically for simple case folding. Also, words and word boundaries
279
- should also be Unicode friendly, but I plan to have this done before I submit a
280
- PR.
278
+ support, specifically for simple case folding.
281
279
0 commit comments