-
Notifications
You must be signed in to change notification settings - Fork 484
Closed
Description
We have a lot of regexps and those in turn use a lot of memory. If you look at a heap profile, you can find something like this staring back at you:

I made the following change:
$ git diff
diff --git a/regex-automata/src/dfa/onepass.rs b/regex-automata/src/dfa/onepass.rs
index 700f2b1..d4517ab 100644
--- a/regex-automata/src/dfa/onepass.rs
+++ b/regex-automata/src/dfa/onepass.rs
@@ -39,6 +39,8 @@ configuring a one-pass DFA.
//
// Thus, in this crate, we call it a one-pass DFA.
+use std::eprintln;
+
use alloc::{vec, vec::Vec};
use crate::{
@@ -878,6 +880,12 @@ impl<'a> InternalBuilder<'a> {
self.dfa
.table
.extend(core::iter::repeat(Transition(0)).take(self.dfa.stride()));
+
+ eprintln!(
+ "len = {}, cap = {}",
+ self.dfa.table.len(),
+ self.dfa.table.capacity()
+ );
// The default empty value for 'PatternEpsilons' is sadly not all
// zeroes. Instead, a special sentinel is used to indicate that there
// is no pattern. So we need to explicitly set the pattern epsilons to
And then I compiled a random regexp I found with regex-cli
:
$ cargo run --package regex-cli --bin regex-cli -- debug onepass '^(?<v2set_furl__gs5>(?<v0set_furl_isperm__1>(?:http[s]?://foo\.example\.com(?:[:][0-9]+)?/(?<p1>.*))))'
Finished `dev` profile [optimized + debuginfo] target(s) in 0.02s
Running `target/debug/regex-cli debug onepass '^(?<v2set_furl__gs5>(?<v0set_furl_isperm__1>(?:http[s]?://foo\.example\.com(?:[:][0-9]+)?/(?<p1>.*))))'`
len = 64, cap = 64
len = 128, cap = 128
len = 192, cap = 256
len = 256, cap = 256
len = 320, cap = 512
len = 384, cap = 512
len = 448, cap = 512
len = 512, cap = 512
len = 576, cap = 1024
len = 640, cap = 1024
len = 704, cap = 1024
len = 768, cap = 1024
len = 832, cap = 1024
len = 896, cap = 1024
len = 960, cap = 1024
len = 1024, cap = 1024
len = 1088, cap = 2048
len = 1152, cap = 2048
len = 1216, cap = 2048
len = 1280, cap = 2048
len = 1344, cap = 2048
len = 1408, cap = 2048
len = 1472, cap = 2048
len = 1536, cap = 2048
len = 1600, cap = 2048
len = 1664, cap = 2048
len = 1728, cap = 2048
len = 1792, cap = 2048
len = 1856, cap = 2048
len = 1920, cap = 2048
len = 1984, cap = 2048
len = 2048, cap = 2048
len = 2112, cap = 4096
len = 2176, cap = 4096
len = 2240, cap = 4096
len = 2304, cap = 4096
parse time: 381.541µs
translate time: 140.625µs
compile nfa time: 743.584µs
compile one-pass DFA time: 369.459µs
memory: 18436
states: 36
pattern len: 1
alphabet len: 41
stride: 64
onepass::DFA(
D 000000:
000001: \x0F => 2 (S-0-2/A)
000002: \x18 => 3
000003: \x18 => 4
000004: \x15 => 5
000005: \x06 => 7, \x17 => 6
000006: \x06 => 7
000007: \x04 => 8
000008: \x04 => 9
000009: \r => 10
000010: \x14 => 11
000011: \x14 => 12
000012: \x03 => 13
000013: \x0C => 14
000014: \x1A => 15
000015: \x08 => 16
000016: \x12 => 17
000017: \x15 => 18
000018: \x11 => 19
000019: \x0C => 20
000020: \x03 => 21
000021: \n => 22
000022: \x14 => 23
000023: \x12 => 24
000024: \x04 => 34, \x06 => 25
000025: \x05 => 27
000026: \x1C => 30
000027: \x04 => 34, \x05 => 27
000028: \x1C-\x1E => 35
000029: \x1E => 28
000030: \x1C-\x1E => 28
000031: \x1C-\x1D => 28
000032: \x1D-\x1E => 30
000033: \x1C-\x1E => 30
* 000034 (0/S-1-3-4-5): \x00 => 35 (S-4), \x02-\x1B => 35 (S-4), ' ' => 28 (S-4), ! => 29 (S-4), \" => 30 (S-4), # => 31 (S-4), $ => 30 (S-4), % => 32 (S-4), & => 33 (S-4), \' => 26 (S-4)
* 000035 (0/S-1-3-5): \x00 => 35, \x02-\x1B => 35, ' ' => 28, ! => 29, \" => 30, # => 31, $ => 30, % => 32, & => 33, \' => 26
START(ALL): 1
state length: 36
pattern length: 1
)
The key lines are these ones:
len = 2048, cap = 2048
len = 2112, cap = 4096
len = 2176, cap = 4096
len = 2240, cap = 4096
len = 2304, cap = 4096
We extended the vec to 2304 entries, but the memory usage is dictated by capacity, which is 4096.
Is this an opportunity to reduce memory usage by right sizing the vec?
migueldemoura
Metadata
Metadata
Assignees
Labels
No labels