Skip to content

Detect confusing unicode characters and show the alternative... #29837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 17, 2015

Conversation

ghost
Copy link

@ghost ghost commented Nov 14, 2015

fixes #25957

@rust-highfive
Copy link
Contributor

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @nikomatsakis (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@ghost
Copy link
Author

ghost commented Nov 14, 2015

@Manishearth So, here's some rough artwork for you :)

"rough", because I haven't checked it yet (just wanted to show the progress). I suppose we should also add tests for this?

@ghost
Copy link
Author

ghost commented Nov 14, 2015

Ah! limited to 100 chars (I thought it shared Servo's 120 chars limit)...

@@ -0,0 +1,156 @@
const ASCII_ARRAY: &'static [(char, &'static str)] = &[('_', "Low Line"), ('-', "Hyphen-Minus"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs the MPL header

@Manishearth
Copy link
Member

Also, yes, there should be some parse-fail tests for this.

@ghost ghost force-pushed the unicode_chars branch 3 times, most recently from d5a4945 to 25a86fa Compare November 15, 2015 07:31
('}', "Right Curly Brace"),
('*', "Asterisk"),
('/', "Slash"),
('\\', "Back Slash"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single word

@ghost ghost force-pushed the unicode_chars branch from 25a86fa to 39e6bfa Compare November 15, 2015 12:58
@ghost
Copy link
Author

ghost commented Nov 15, 2015

@Manishearth I've made the changes and now I've gone for a variation of what you'd suggested (I've used StringReader in our checking method, which emits stuff based on the situation). Also, I need some clarification regarding the test. In the test, I'm checking for an error which is something rustc was doing even before the change. Since I've only added the help comment along with it, is the test really necessary? or, should we test this in a different way?


fn main() {
let y = 0;
//~^ ERROR unknown start of token: \u{37e}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the help message here too

compile-fail tests don't require all helps and notes to be listed, but if you do list a help or note, and the program fails to emit it, the test will fail.

@ghost ghost force-pushed the unicode_chars branch from 39e6bfa to 56647c3 Compare November 16, 2015 05:10
@ghost
Copy link
Author

ghost commented Nov 16, 2015

@Manishearth r?

.map(|idx| {
let (_, u_name, ascii_char) = UNICODE_ARRAY[idx];
let span = make_span(reader.last_pos, reader.pos);
match ASCII_ARRAY.iter().position(|&(c, _)| c == ascii_char) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .find, not .position.

@Manishearth
Copy link
Member

LGTM, small nits involving style.

For future reference, you should be rarely indexing arrays and things in Rust. Most of the time you should use iterators (.position+indexing doesn't count 😄 ), and iterators are safer in that they can't cause additional panics due to out of bounds indexing.

@nikomatsakis
Copy link
Contributor

This patch looks pretty decent. I second @Manishearth's suggestions. Also, note that there is a tidy error because some of the lines in the parse-fail test are more than 100 characters. You can add a comment like // ignore-tidy-linelength on that file.

@ghost ghost force-pushed the unicode_chars branch from 56647c3 to c2c416c Compare November 17, 2015 06:30
@ghost
Copy link
Author

ghost commented Nov 17, 2015

@nikomatsakis @Manishearth Agreed, thanks! (and done). r?

@@ -174,6 +174,9 @@ impl SpanHandler {
self.handler.emit(Some((&self.cm, sp)), msg, Bug);
panic!(ExplicitBug);
}
pub fn span_bug_no_panic(&self, sp: Span, msg: &str) {
self.handler.emit(Some((&self.cm, sp)), msg, Bug);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention, can we add self.handler.bump_err_count(); here too?

@ghost ghost force-pushed the unicode_chars branch from c2c416c to 7f63c7c Compare November 17, 2015 07:05
@Manishearth
Copy link
Member

@bors r+

thanks!

@bors
Copy link
Collaborator

bors commented Nov 17, 2015

📌 Commit 7f63c7c has been approved by Manishearth

@ghost
Copy link
Author

ghost commented Nov 17, 2015

@Manishearth Thank you! :)

@Havvy
Copy link
Contributor

Havvy commented Nov 17, 2015

❤️ 💓 ❤️

bors added a commit that referenced this pull request Nov 17, 2015
@bors
Copy link
Collaborator

bors commented Nov 17, 2015

⌛ Testing commit 7f63c7c with merge 1b26148...

@bors bors merged commit 7f63c7c into rust-lang:master Nov 17, 2015
@Manishearth
Copy link
Member

😀 Congrats on your first PR!

@ghost ghost deleted the unicode_chars branch November 17, 2015 09:52
@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Nov 17, 2015
@brson
Copy link
Contributor

brson commented Nov 17, 2015

Nice polish.

@huonw
Copy link
Member

huonw commented Nov 17, 2015

Is there a reason this doesn't include U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK as possible subsitutions for "? (If not, I can submit a patch to add them.)

@Manishearth
Copy link
Member

No reason. It contains the single quotes. I did Ctrl-F for those, but didn't bother to check the double quotes.

Go ahead!

@ghost
Copy link
Author

ghost commented Nov 18, 2015

It's again worth mentioning that this still doesn't have all the substitutions - only the printable ones from http://www.unicode.org/Public/security/revision-06/confusables.txt. So, feel free to add more :)

@huonw
Copy link
Member

huonw commented Nov 18, 2015

Oh, I think I see why QUOTATION MARK was missed: things (including that) are considered confusable with APOSTROPHE, APOSTROPHE rather than ".

@Manishearth
Copy link
Member

The universe is starting to hit and appreciate this feature https://twitter.com/joeranweiler/status/678691374292590593 :D

bors added a commit that referenced this pull request May 3, 2016
Add more aliases for Unicode confusable chars

Building upon #29837, this PR:

* added aliases for space characters,
* distinguished square brackets from parens, and
* added common CJK punctuation characters as aliases.

This will especially help CJK users who may have forgotten to switch off IME when coding.
bors added a commit that referenced this pull request May 5, 2016
Add more aliases for Unicode confusable chars

Building upon #29837, this PR:

* added aliases for space characters,
* distinguished square brackets from parens, and
* added common CJK punctuation characters as aliases.

This will especially help CJK users who may have forgotten to switch off IME when coding.
nnethercote added a commit to nnethercote/rust that referenced this pull request Dec 14, 2023
It's unclear why this is used here. All entries in the third column of
`UNICODE_ARRAY` are covered by `ASCII_ARRAY`, so if the lookup fails
it's a genuine compiler bug. It was added way back in rust-lang#29837, for no
clear reason.

This commit changes it to `span_bug`, which is more typical.
nnethercote added a commit to nnethercote/rust that referenced this pull request Dec 14, 2023
It's unclear why this is used here. All entries in the third column of
`UNICODE_ARRAY` are covered by `ASCII_ARRAY`, so if the lookup fails
it's a genuine compiler bug. It was added way back in rust-lang#29837, for no
clear reason.

This commit changes it to `span_bug`, which is more typical.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better Error Message When Parsing Greek Question Mark (and similar confusing characters)
7 participants