Skip to content

Replace invalid characters with U+FFFD (fixes #96) #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

lastorset
Copy link

This fix simply repeats the encoding-specific replacement with a general one using invalid_unicode_re. It corresponds to section 12.2.2.5. I can't quite tell what the spec says to do if one of these characters is encountered, but the rest of the spec replaces other characters with U+FFFD, so I did that (despite Simon's preference of the empty string).

I can submit a test for this (AFAICT I have to do that separately).

@hoppipolla-critic-bot
Copy link

Critic review: https://critic.hoppipolla.co.uk/r/1752

This is an external review system which you may optionally use for the code review of your pull request.

In order to help critic track your changes, please do not make in-place history rewrites (e.g. via git rebase -i or git commit --amend) when updating this pull request.

@lastorset
Copy link
Author

Oops, I must have run the tests wrong—didn't see all those failures. I guess they imply that we want a ParseError with the specific character still intact.

@gsnedders
Copy link
Member

It's deliberate. Our behaviour is what the spec defines. Go complain at Hixie if you want this changed!

@gsnedders gsnedders closed this Jun 8, 2014
@lastorset
Copy link
Author

I won't, but thanks :)

@marciof
Copy link

marciof commented Jul 24, 2014

@gsnedders, would a patch containing a subclass of html5lib.tokenizer.HTMLTokenizer that does the mentioned replacement be a better approach? That way it would be optional.

@gsnedders
Copy link
Member

@marciof if you're trying to sort out the lxml stuff, you just want to fix ihatexml and ensure everything for the tree-builder goes through it; if you want it for other reason, say what it is?

@lastorset
Copy link
Author

@gsnedders, I think I understand what you mean. etree_lxml uses ihatexml when building a tree for lxml, and ihatexml.InfosetFilter.coerceCharacters to clean up inserted text (among other things). So if we add code in the latter method to transform control characters, that will fix the problem.

If I understood you correctly before, the original patch was rejected because removing these characters violates the spec. If we change InfosetFilter, it is acceptable because using lxml and etree is optional—dom is still available to follow the spec.

Correct?

@gsnedders
Copy link
Member

InfosetFilter by definition creates trees different to what the spec requires; it should roughly do what the spec says for infoset coercion. It should do all the coercion through finding invalid characters using ihatexml, and the fact that it doesn't is the bug.

@lastorset
Copy link
Author

Thanks, that clears it up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants