Double-encoded IDNA labels don't roundtrip #603

TimothyGu · 2021-05-17T19:00:45Z

Consider xn--xn---epa. It appears that using the current domain to Unicode algorithm (as implemented by Node.js), this would get converted to xn--é. But applying domain to ASCII on xn--é would produce a Punycode decoding failure. It sounds like domain to Unicode (or even UTS 46) should return failure on such labels.

This is somewhat important because Firefox can create such double-encoded labels:

» new URL('http://xn--é').hostname
❮ "xn--xn---epa"

The text was updated successfully, but these errors were encountered:

rmisev · 2021-05-17T20:43:22Z

I think double-encoded labels must be considered invalid. I see two ways to achieve this:

Set CheckHyphens to true
Or modify UTS 46 validity criteria: add following criteria (suppose after 3):
"If CheckHyphens = false, the label must not begin with xn--"

sleevi · 2021-05-17T20:59:57Z

Setting CheckHyphens to true sounds like a good solution if we're saying "hosts are DNS" (#397 ). RFC 5890 sets aside the Reserved LDH labels, which not only includes xn--, but all DNS labels with -- in the third/fourth characters.

Note that this would also seem to impact #438 depending on where things land.

TimothyGu · 2021-05-17T23:01:14Z

@rmisev Indeed! As evident in UTS 46, the CheckHyphens boolean was first introduced to allow YouTube labels of form "r3---sn-apo3qvuoxuxbt-j5pe". (Previously the boolean was effectively always "true".) But I suppose the Unicode folks didn't consider the possibility of xn--, which now needs to be forbidden explicitly.

dnsguru · 2021-05-18T06:11:37Z

Domains (in GTLD) are generally restricted from dashes in the 3rd and 4th positions /^[A-z0-9][A-z0-9]--/ if this helps. IDNA takes advantage of this and was the catalyst.

…

On Mon, May 17, 2021, 4:01 PM Timothy Gu ***@***.***> wrote: @rmisev <https://github.com/rmisev> Indeed! As evident in UTS 46, the *CheckHyphens* boolean was first introduced to allow YouTube labels of form "r3---sn-apo3qvuoxuxbt-j5pe". (Previously it was always "true".) But I suppose the Unicode folks didn't consider the possibility of xn--, which now needs to be forbidden explicitly. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#603 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACQTJMX6K4GSTMXPSU262DTOGN4XANCNFSM45BA2GJQ> .

annevk · 2022-12-09T12:12:10Z

@macchiati not sure what we should do here, but wanted to bring this to your attention.

valenting · 2022-12-09T12:27:44Z

This is somewhat important because Firefox can create such double-encoded labels:
» new URL('http://xn--é').hostname
❮ "xn--xn---epa"

That is no longer the case. Firefox now throws for new URL('http://xn--é')

annevk · 2022-12-09T12:40:09Z

@valenting does that follow from the specification in any way though?

valenting · 2022-12-09T13:10:10Z

The change happened as a consequence of Bug 1724233 - IDNA does not conform to RFC and is interpreted as a different hostname.

macchiati · 2022-12-09T15:16:48Z

From the bug report, it looks like this was just not following the spec

…

On Fri, Dec 9, 2022, 05:10 Valentin Gosu ***@***.***> wrote: The change happened as a consequence of Bug 1724233 - IDNA does not conform to RFC and is interpreted as a different hostname <https://bugzilla.mozilla.org/show_bug.cgi?id=1724233>. — Reply to this email directly, view it on GitHub <#603 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMDDXUQBILYEM2UINRTWMMVT5ANCNFSM45BA2GJQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

annevk · 2023-01-10T12:30:15Z

I think @macchiati is correct.

https://www.unicode.org/reports/tr46/#ToASCII is where we start as that is what the URL Standard invokes.
https://www.unicode.org/reports/tr46/#Processing is what gets invoked first.
Step 4 there is the interesting one. In our case the input starts with xn--.
So we enter https://www.rfc-editor.org/rfc/rfc3492.html#section-6.2. Pseudo-code, great.
The fifth step there reads as follows:

consume all code points before the last delimiter (if there is one)
and copy them to output, fail on any non-basic code point
Now https://www.rfc-editor.org/rfc/rfc3492.html#section-5 explains what "basic" means here (not the greatest of terms), which suggests that é leads to an error here.
Now we go back and read https://www.unicode.org/reports/tr46/#ToASCII again and notice:

If an error was recorded in steps 1-4, then the operation has failed and a failure value is returned. No DNS lookup should be done.

We should add a WPT for this, but I think this case is adequately covered by the specification and CheckHyphens doesn't impact it one way or another.

annevk · 2023-01-10T13:02:23Z

There is test coverage for this already in resources/toascii.json:

  {
    "comment": "Invalid Punycode (contains non-ASCII character)",
    "input": "xn--tešla",
    "output": null
  }

Per https://wpt.fyi/results/url/toascii.window.html Chromium-based browsers have some issues there for <a> and <area>, but otherwise things seem to be in order.

Closing this therefore, but please comment if my analysis was lacking somehow.

valenting · 2023-03-09T13:06:16Z

@annevk rust-url fuzzing has found another test case for IDNA that doesn't round-trip: http://a.xn--xn-----/

input	output	Live URL Viewer
http://a.xn--xn-----/	http://a.xn----/	link
http://a.xn----/	http://a.-/	link

Should we reopen this, or open a new issue?

annevk · 2023-03-09T14:12:21Z

I'd prefer a new issue. Both of those result in failure in WebKit so I'd appreciate it if you could go through the steps as I did in #603 (comment) to find out if this is an actual bug in the specification or if we should add these as tests.

TimothyGu mentioned this issue May 17, 2021

can't parse urls starting with xn-- #438

Closed

annevk added topic: idna topic: parser labels May 18, 2021

ghost mentioned this issue Nov 14, 2021

Consider switching to an Order Sorted Algebra model? alwinb/url-specification#13

Closed

annevk closed this as completed Jan 10, 2023

TimothyGu mentioned this issue Mar 9, 2023

More IDNA roundtrippability issues #760

Closed

rmisev mentioned this issue Nov 6, 2023

IdnaTestV2.json "xn--xn--a--gua.pt" test case problem #803

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Double-encoded IDNA labels don't roundtrip #603

Double-encoded IDNA labels don't roundtrip #603

TimothyGu commented May 17, 2021

rmisev commented May 17, 2021

Uh oh!

sleevi commented May 17, 2021

Uh oh!

TimothyGu commented May 17, 2021 •

edited

Loading

Uh oh!

dnsguru commented May 18, 2021 via email

Uh oh!

annevk commented Dec 9, 2022

Uh oh!

valenting commented Dec 9, 2022

Uh oh!

annevk commented Dec 9, 2022

Uh oh!

valenting commented Dec 9, 2022

Uh oh!

macchiati commented Dec 9, 2022 via email

Uh oh!

annevk commented Jan 10, 2023

Uh oh!

annevk commented Jan 10, 2023

Uh oh!

valenting commented Mar 9, 2023

Uh oh!

annevk commented Mar 9, 2023

Uh oh!

Double-encoded IDNA labels don't roundtrip #603

Double-encoded IDNA labels don't roundtrip #603

Comments

TimothyGu commented May 17, 2021

rmisev commented May 17, 2021

Uh oh!

sleevi commented May 17, 2021

Uh oh!

TimothyGu commented May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnsguru commented May 18, 2021 via email

Uh oh!

annevk commented Dec 9, 2022

Uh oh!

valenting commented Dec 9, 2022

Uh oh!

annevk commented Dec 9, 2022

Uh oh!

valenting commented Dec 9, 2022

Uh oh!

macchiati commented Dec 9, 2022 via email

Uh oh!

annevk commented Jan 10, 2023

Uh oh!

annevk commented Jan 10, 2023

Uh oh!

valenting commented Mar 9, 2023

Uh oh!

annevk commented Mar 9, 2023

Uh oh!

TimothyGu commented May 17, 2021 •

edited

Loading