Skip to content

Conversation

linkmauve
Copy link
Contributor

char::is_ascii_lowercase() only returns true for alphabetical characters which are lowercase, which makes very common domain characters like '.' miss out on this optimisation. Instead we use char::is_ascii() && !char::is_ascii_uppercase() to reach the expected outcome.

I have also added a test to not have that regress.

This was found with this commit in the jid crate:
https://gitlab.com/xmpp-rs/xmpp-rs/-/merge_requests/205

src/lib.rs Outdated
@@ -126,7 +126,7 @@ fn is_prohibited_bidirectional_text(s: &str) -> bool {
pub fn nameprep(s: &str) -> Result<Cow<'_, str>, Error> {
// fast path for ascii text
if s.chars()
.all(|c| c.is_ascii_lowercase() && !tables::ascii_control_character(c))
.all(|c| c.is_ascii() && !c.is_ascii_uppercase() && !tables::ascii_control_character(c))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we also need to reject ascii spaces, which are prohibited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified this expression, as ASCII domains only allow letters, digits, '-' and '.' actually.

@sfackler
Copy link
Owner

The nodeprep fast path seems like it could be cleaned up as well - is_ascii_lowercase and ascii_control_character should be disjoint.

char::is_ascii_lowercase() only returns true for alphabetical characters
which are lowercase, so also add digits, '.' and '-' which are the only
characters allowed in a non-IDN domain name.

I have also added a test to not have that regress.

This was found with this merge request in the jid crate:
https://gitlab.com/xmpp-rs/xmpp-rs/-/merge_requests/205
Not every character is included, it’s missing '!', ';', '=' and '?'
which each add one branch, for almost no usage in the wild.

From a very unscientific dataset formed by my personal roster + my
bookmarks, this reduces the time to parse all JIDs once from 128 µs to
35 µs.
The two expressions are equivalent, but the new one decreases the time
spent parsing full JIDs by 1.2%..11.6% depending on the size of their
resource.
@linkmauve linkmauve force-pushed the fix-nameprep-ascii-check branch from 9504a5d to 914d6ff Compare July 15, 2023 23:30
@linkmauve
Copy link
Contributor Author

linkmauve commented Jul 15, 2023

I have also cleaned and optimised both nodeprep and resourceprep, the time spent parsing my entire roster and bookmarks has now been reduced from 127.9 µs to 35.2 µs on my i7-8700K, with only two occurrences using characters from higher codepoints than ASCII. This will obviously differ from dataset to dataset, but it should already be a pretty nice improvement in almost all cases.

@sfackler sfackler merged commit 1da0b55 into sfackler:master Jul 16, 2023
@sfackler
Copy link
Owner

Thanks!

@linkmauve linkmauve deleted the fix-nameprep-ascii-check branch July 16, 2023 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants