-
Notifications
You must be signed in to change notification settings - Fork 294
Replace invalid characters with U+FFFD (fixes #96) #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Critic review: https://critic.hoppipolla.co.uk/r/1752 This is an external review system which you may optionally use for the code review of your pull request. In order to help critic track your changes, please do not make in-place history rewrites (e.g. via |
Oops, I must have run the tests wrong—didn't see all those failures. I guess they imply that we want a ParseError with the specific character still intact. |
It's deliberate. Our behaviour is what the spec defines. Go complain at Hixie if you want this changed! |
I won't, but thanks :) |
@gsnedders, would a patch containing a subclass of |
@marciof if you're trying to sort out the |
@gsnedders, I think I understand what you mean. If I understood you correctly before, the original patch was rejected because removing these characters violates the spec. If we change Correct? |
|
Thanks, that clears it up! |
This fix simply repeats the encoding-specific replacement with a general one using invalid_unicode_re. It corresponds to section 12.2.2.5. I can't quite tell what the spec says to do if one of these characters is encountered, but the rest of the spec replaces other characters with U+FFFD, so I did that (despite Simon's preference of the empty string).
I can submit a test for this (AFAICT I have to do that separately).