Skip to content

Double-encoded IDNA labels don't roundtrip #603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TimothyGu opened this issue May 17, 2021 · 13 comments
Closed

Double-encoded IDNA labels don't roundtrip #603

TimothyGu opened this issue May 17, 2021 · 13 comments

Comments

@TimothyGu
Copy link
Member

Consider xn--xn---epa. It appears that using the current domain to Unicode algorithm (as implemented by Node.js), this would get converted to xn--é. But applying domain to ASCII on xn--é would produce a Punycode decoding failure. It sounds like domain to Unicode (or even UTS 46) should return failure on such labels.

This is somewhat important because Firefox can create such double-encoded labels:

» new URL('http://xn--é').hostname
 "xn--xn---epa"
@rmisev
Copy link
Member

rmisev commented May 17, 2021

I think double-encoded labels must be considered invalid. I see two ways to achieve this:

  1. Set CheckHyphens to true
  2. Or modify UTS 46 validity criteria: add following criteria (suppose after 3):
    "If CheckHyphens = false, the label must not begin with xn--"

@sleevi
Copy link

sleevi commented May 17, 2021

Setting CheckHyphens to true sounds like a good solution if we're saying "hosts are DNS" (#397 ). RFC 5890 sets aside the Reserved LDH labels, which not only includes xn--, but all DNS labels with -- in the third/fourth characters.

Note that this would also seem to impact #438 depending on where things land.

@TimothyGu
Copy link
Member Author

TimothyGu commented May 17, 2021

@rmisev Indeed! As evident in UTS 46, the CheckHyphens boolean was first introduced to allow YouTube labels of form "r3---sn-apo3qvuoxuxbt-j5pe". (Previously the boolean was effectively always "true".) But I suppose the Unicode folks didn't consider the possibility of xn--, which now needs to be forbidden explicitly.

@dnsguru
Copy link

dnsguru commented May 18, 2021 via email

@annevk
Copy link
Member

annevk commented Dec 9, 2022

@macchiati not sure what we should do here, but wanted to bring this to your attention.

@valenting
Copy link
Collaborator

This is somewhat important because Firefox can create such double-encoded labels:

» new URL('http://xn--é').hostname
 "xn--xn---epa"

That is no longer the case. Firefox now throws for new URL('http://xn--é')

@annevk
Copy link
Member

annevk commented Dec 9, 2022

@valenting does that follow from the specification in any way though?

@valenting
Copy link
Collaborator

The change happened as a consequence of Bug 1724233 - IDNA does not conform to RFC and is interpreted as a different hostname.

@macchiati
Copy link

macchiati commented Dec 9, 2022 via email

@annevk
Copy link
Member

annevk commented Jan 10, 2023

I think @macchiati is correct.

  1. https://www.unicode.org/reports/tr46/#ToASCII is where we start as that is what the URL Standard invokes.
  2. https://www.unicode.org/reports/tr46/#Processing is what gets invoked first.
  3. Step 4 there is the interesting one. In our case the input starts with xn--.
  4. So we enter https://www.rfc-editor.org/rfc/rfc3492.html#section-6.2. Pseudo-code, great.
  5. The fifth step there reads as follows:

    consume all code points before the last delimiter (if there is one)
    and copy them to output, fail on any non-basic code point

  6. Now https://www.rfc-editor.org/rfc/rfc3492.html#section-5 explains what "basic" means here (not the greatest of terms), which suggests that é leads to an error here.
  7. Now we go back and read https://www.unicode.org/reports/tr46/#ToASCII again and notice:

    If an error was recorded in steps 1-4, then the operation has failed and a failure value is returned. No DNS lookup should be done.

We should add a WPT for this, but I think this case is adequately covered by the specification and CheckHyphens doesn't impact it one way or another.

@annevk
Copy link
Member

annevk commented Jan 10, 2023

There is test coverage for this already in resources/toascii.json:

  {
    "comment": "Invalid Punycode (contains non-ASCII character)",
    "input": "xn--tešla",
    "output": null
  }

Per https://wpt.fyi/results/url/toascii.window.html Chromium-based browsers have some issues there for <a> and <area>, but otherwise things seem to be in order.

Closing this therefore, but please comment if my analysis was lacking somehow.

@annevk annevk closed this as completed Jan 10, 2023
@valenting
Copy link
Collaborator

@annevk rust-url fuzzing has found another test case for IDNA that doesn't round-trip: http://a.xn--xn-----/

input output Live URL Viewer
http://a.xn--xn-----/ http://a.xn----/ link
http://a.xn----/ http://a.-/ link

Should we reopen this, or open a new issue?

@annevk
Copy link
Member

annevk commented Mar 9, 2023

I'd prefer a new issue. Both of those result in failure in WebKit so I'd appreciate it if you could go through the steps as I did in #603 (comment) to find out if this is an actual bug in the specification or if we should add these as tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants