Skip to content

Conversation

veanes
Copy link
Contributor

@veanes veanes commented Nov 3, 2021

Main updates:

  • Updated BDD table serialization to be based on byte[] instead of long[] for saving serialization space used for these arrays. Overall this cut space requirements by at least half.
  • Removed the table for \w, instead deriving it from the 8 Unicode categories 0,1,2,3,4,5,8,18
  • Made the generation algorithm of the BDD tables for ignore-case at least 2x faster if this would be used dynamically -- further optimization are probably possible, this change was using direct improvements involving better use of BDD operations.
  • Limited CharSetSolver._charPredTable to ASCII only as it is almost never used for NonASCII but took up128kB space for all Unicode chars but essentially for no good reason.

@ghost ghost added community-contribution Indicates that the PR has been added by a community member area-System.Text.RegularExpressions and removed community-contribution Indicates that the PR has been added by a community member labels Nov 3, 2021
@ghost
Copy link

ghost commented Nov 3, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Main updates:

  • Updated BDD table serialization to be based on byte[] instead of long[] for saving serialization space used for these arrays. Overall this cut space requirements by at least half.
  • Removed the table for \w, instead deriving it from the 8 Unicode categories 0,1,2,3,4,5,8,18
  • Made the generation algorithm of the BDD tables for ignore-case at least 2x faster if this would be used dynamically -- further optimization are probably possible, this change was using direct improvements involving better use of BDD operations.
  • Limited CharSetSolver._charPredTable to ASCII only as it is almost never used for NonASCII but took up128kB space for all Unicode chars but essentially for no good reason.
Author: veanes
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@danmoseley
Copy link
Member

#58828

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cheaper ways to build up a BDD? Maybe the caching involved helps, but it seems like otherwise this is going to incrementally build up the BDD by creating 15 intermediate ones that are then thrown away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are better ways, I think, but this would involve using e.g. a designated array and non-object base representation with own memory-management over that array.
However this incremental build only happens once per ASCII character, I think it is negligible.

@veanes veanes force-pushed the updateUnicodeBDDs branch from edd1b5a to f67d79c Compare November 3, 2021 18:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants