-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Use a trie to quickly skip templates in union creation #59759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Use a trie to quickly skip templates in union creation #59759
Conversation
@typescript-bot test it |
function hasAnyMatch(input: string) { | ||
for (const _ of iterateAllMatches(input)) { | ||
return true; | ||
} | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused in this PR but would be used in #59058.
Hey @jakebailey, I've packed this into an installable tgz. You can install it for testing by referencing it in your
and then running There is also a playground for this build and an npm module you can use via |
@jakebailey, the perf run you requested failed. You can check the log here. |
Hey @jakebailey, the results of running the DT tests are ready. Everything looks the same! |
@jakebailey Here are the results of running the user tests with tsc comparing Everything looks good! |
@jakebailey Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
@jakebailey Here are the results of running the top 400 repos with tsc comparing Everything looks good! |
} | ||
} | ||
|
||
function set(prefix: string, suffix: string, fn: (value: T | undefined) => T | undefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting thing about building trie
s dynamically is that they can grow quite large in memory consumed rather quickly when you do something like this and always create new trie
nodes for every new match state. When you're matching a whole dictionary, for example, this can quickly become untenably large. All is not lost, however - there are a large number of sub-trie
s that are identical (for example, the -es
suffix trie), and you can save quite a bit of space at the cost of some time by compressing the trie by unifying all those subtrees. So rather than having a trie
like
stateDiagram-v2
1:_
2:_
3:_
4:_
5:_
6:_
7:_
8:_
9:_
10:_
11:_
12:_
[*] --> 1 : s
1 --> 2 : p
2 --> 3 : a
3 --> 4 : c
4 --> 5 : e
5 --> 6 : s
6 --> [*] : End of string
[*] --> 7 : p
7 --> 8 : l
8 --> 9 : a
9 --> 10 : c
10 --> 11 : e
11 --> 12 : s
12 --> [*] : End of string
you compress it like so
stateDiagram-v2
s:_
p:_
a:_
c:_
e:_
s`:_
p`:_
l:_
[*] --> s : s
s --> p : p
p --> a : a
a --> c : c
c --> e : e
e --> s` : s
s` --> [*] : End of string
[*] --> p` : p
p` --> l : l
l --> a : a
or, better yet,
stateDiagram-v2
s:_
p:_
p`:_
l:_
[*] --> s : s
s --> p : p
p --> [*] : aces
[*] --> p` : p
p` --> l : l
l --> [*] : aces
Which, a quick search seems to indicate is referred to either as a "compressed trie" or just a "suffix tree". Extra memory efficiency, since JS is GC'd, can change the breakpoint where it's worth it to swap between the naive scan and the trie-based structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about compressing the trie into a proper radix tree, but wasn't sure if that would be worth it in the end (and the implementation does get more complicated). But I can give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem is just that the edges being strings means that I don't know how to walk down them without iterating over each of them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess radix tress ensure that the children are keyed on equal length prefixes, so that would work.
However, your example coming together like this I don't think works out for our use case here; otherwise as we walk down the graph, we'll merge together with template literals that don't actually match (because they came from another branch) and not reduce the number of inferences that much.
A proper radix tree doesn't converge like this so would at least still be optimal, I think, so I may attempt that if I have a change.
For #59655
Right now,
removeStringLiteralsMatchedByTemplateLiterals
iterates over the entire combinatorial explosion of string literals and templates in a union to ensure that any string literal already captured by a template is removed.In #59655, we have 1,733 template literals and 626,376 string literals, which means quite literally a billion string literal to template literal inferences, all of which "fail" and don't end up removing anything.
My alternative PR #59705 simply says "no, this is too much work, error" like we do in subtype reduction. However, for this particular case, there are actually algorithmic changes we could make to go faster.
The thing that's actually taking the most time in #59655 is the fast path (!!!) where we bail early if the string doesn't start/end with the first/last part of the template type. We're doing some billion
startsWith
/endsWith
, where a significant number of those have common prefixes/suffixes such that we're scanning strings over and over and over.There's a data structure which can help; the trie. We can put all of our template literals into one trie, then walk it for each string literal. This lets us check every template literal in one traversal, skipping impossible paths to greatly reduce which template literals we'll have to do a full inference on.
Since a trie is just a prefix tree, this PR introduces a data structure which is actually a prefix trie of suffix tries; when walking down the prefix trie looking for potential matches, each of those nodes also contains a suffix trie which is walked down as well. This prevents us from having bad perf when it's actually the suffixes that are long (and opens this to use in other contexts, see the end of this PR description).
For #59655, this nets:
This speeds up one of our slowest test cases:
TODO:
This two-way trie turns out to be exactly the data structure we'd need for #59058; our glob patterns are just a prefix and a suffix with a wildcard in the middle. The only difference is that while the template literal case needs to iterate through all possible matches, our file glob matching only needs to check if there's one match.
Though, paths/exports/etc are ordered, so it may still have to iterate all then sort by priority or something.