Skip to content

Add test cases for other valid label separators in IDN hostnames #760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 19, 2025

Conversation

OptimumCode
Copy link

@OptimumCode OptimumCode commented Feb 3, 2025

According to RFC 3490 Section 3.1, the following characters should be recognized as a dot in IDN hostname:

  • U+002E (full stop) - regular dot
  • U+3002 (ideographic full stop)
  • U+FF0E (fullwidth full stop)
  • U+FF61 (halfwidth ideographic full stop)

This PR adds test cases for the remaining characters.

Please, let me know if you have any questions.

Continues #759

@OptimumCode OptimumCode requested a review from a team as a code owner February 3, 2025 17:30
@OptimumCode
Copy link
Author

Hi, @Julian. Could you please take a look when you have a free time?

@@ -336,6 +336,36 @@
"description": "single dot",
"data": ".",
"valid": false
},
{
"description": "single ideographic full stop (RFC 3490#3.1)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a field for referencing RFC sections in test cases now even though we haven't used it a lot yet -- can you have a look at that and complain if it's undocumented? You'll at least find it in the test case schema

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will take a look at that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the documentation - there is not much information about spec links. But if you know the JSON scheme you can find out how it should be defined by reading test-schema.json file. I think it would be much easier to understand how to use links if the README had a dedicated section for links with examples.

Also, according to test-schema.json schema the specification block is allowed only at the top level - it is impossible to add a spec link for a particular test (maybe the schema should be updated to allow spec links for the top level and for a separate test). Should I create a ticket for that?

Also, I have noticed that a link used for RFC (https://www.rfc-editor.org/rfc/{spec}.txt#{section}) does not support anchors. E.g. https://www.rfc-editor.org/rfc/rfc3490.txt - there is no way to put an anchor for a certain section. Probably, an HTML version should be used instead of TXT (e.g. https://www.rfc-editor.org/rfc/rfc3490.html#section-3.1).

And also, it looks like bin/annotate-specification-links needs some adjustments to support RFC links (if I have added it correctly in the first place).

Please, let me know if I can help with something (at least I can create tickets for some points from above if you agree with them).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be much easier to understand how to use links if the README had a dedicated section for links with examples.

I definitely agree! I'm not sure if we discussed adding this when we added the capability and just forgot or something -- but yes agree for sure.

And I agree with all of the rest of what you mention as well, so all quite good feedback (and tickets + PRs definitely would be appreciated!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I think I could start with fixing the script that adds annotations with spec links (before merging this PR)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finally realised why the specification can be defined only on a test case level. It all started to make sense now) I have moved separator-related tests into a separate test case

@OptimumCode OptimumCode force-pushed the rfc3490-label-separator branch from 3e24090 to e68286a Compare February 7, 2025 16:16
@OptimumCode OptimumCode force-pushed the rfc3490-label-separator branch from e68286a to 4fa572d Compare February 7, 2025 16:56
@OptimumCode OptimumCode requested a review from Julian February 7, 2025 20:51
@OptimumCode
Copy link
Author

Hi, @Julian. Just wanted to check if you had a chance to take a look at the PR after recent changes. Please let me know if something needs to be changed or maybe somebody else should also take a look at the PR

@Julian
Copy link
Member

Julian commented Feb 16, 2025

I haven't looked yet but I'll get to it in the next couple of days if no one gets to it first!

Copy link
Member

@Julian Julian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is perfect, thanks, the citation made this quite easy to double check. LGTM!

@Julian Julian merged commit c5a9703 into json-schema-org:main Feb 19, 2025
3 checks passed
@karenetheridge
Copy link
Member

I'm applying these tests against my implementation, and I'm wondering -- should the test "dot as label separator" also be applied to hostname.json?

@Julian
Copy link
Member

Julian commented Mar 8, 2025

Sounds likely yeah but needs double checking in that rfc

@OptimumCode
Copy link
Author

I think this case is already covered by this test

@OptimumCode
Copy link
Author

OptimumCode commented Mar 8, 2025

Maybe instead the existing tests for hostname (one which test that a single dot is not allowed and one which test that a dot is allowed as a separator) should be grouped into a separate test case and have a link to a corresponding RFC?

@karenetheridge
Copy link
Member

There's something different about "a.b" specifically, because my implementation passed all previous tests, but format: idn-hostname failed on this data instance... and now I see that "a.b" fails on hostname as well:

$ json-schema-eval --validate_formats
enter data instance, followed by ^D:
"a.b"
^D
enter schema, followed by ^D:
{"type":"string","allOf":[{"format":"hostname"},{"format":"idn-hostname"}]}
^D
{
  "errors": [
    {
      "error": "not a valid hostname",
      "instanceLocation": "",
      "keywordLocation": "/allOf/0/format"
    },
    {
      "error": "not a valid idn-hostname",
      "instanceLocation": "",
      "keywordLocation": "/allOf/1/format"
    },
    {
      "error": "subschemas 0, 1 are not valid",
      "instanceLocation": "",
      "keywordLocation": "/allOf"
    }
  ],
  "valid": false
}

@OptimumCode
Copy link
Author

OptimumCode commented Mar 9, 2025

Hm... the only reason for the failure that comes into my mind is domain label length. But I couldn't find any rfc that would restrict label length to >1 characters.
However, I have found rfc1035 where it is clear that a label can be a single character

For example, the following strings identify hosts in the Internet:
A.ISI.EDU XX.LCS.MIT.EDU SRI-NIC.ARPA

@OptimumCode
Copy link
Author

Just out of curiosity: @karenetheridge could you please check if it still fails once each label has 2 characters? E.g. ab.cd

@karenetheridge
Copy link
Member

fails: "x.y", "a.xy", "a.b.c"

passes: "x.ca", "a.ca", "a.b.ca", "ab.cd"

The hostname format calls directly into https://metacpan.org/dist/Data-Validate-Domain/source/lib/Data/Validate/Domain.pm#L27 for this check (and idn-hostname does an ascii conversion first). I see in the code that it actually checks to see if the TLD exists: so .ca is accepted, but .xy is not. .cd is the TLD for Democratic Republic of the Congo, so it is accepted.

@karenetheridge
Copy link
Member

karenetheridge commented Mar 9, 2025

Aha!

return $hostname if $opt->{domain_disable_tld_validation};

Setting this option looks like it will fix all my discrepancies.
It helps if I RTFM ;)

(edit: I had to set 'domain_allow_single_label' as well, to remove the remainder of my 'todo' items for this format.)

https://metacpan.org/pod/Data::Validate::Domain#is_domain($domain,-\%options)

karenetheridge added a commit to karenetheridge/JSON-Schema-Modern that referenced this pull request Mar 9, 2025
@OptimumCode
Copy link
Author

OptimumCode commented Mar 9, 2025

@karenetheridge @Julian what do you think about creating a separate test case for hostname and idn-hostname checking that labels with length 1 (and 2 based on results @karenetheridge got here) are valid? Probably, this test case can also include existing tests that check maximum length of the domain name and label.

note sure for now which RFC can be used as a reference for idn-hostname

@karenetheridge
Copy link
Member

sounds good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants