Replace invalid characters with U+FFFD (fixes #96) #162

lastorset · 2014-06-06T13:41:49Z

This fix simply repeats the encoding-specific replacement with a general one using invalid_unicode_re. It corresponds to section 12.2.2.5. I can't quite tell what the spec says to do if one of these characters is encountered, but the rest of the spec replaces other characters with U+FFFD, so I did that (despite Simon's preference of the empty string).

I can submit a test for this (AFAICT I have to do that separately).

hoppipolla-critic-bot · 2014-06-06T13:41:52Z

Critic review: https://critic.hoppipolla.co.uk/r/1752

This is an external review system which you may optionally use for the code review of your pull request.

In order to help critic track your changes, please do not make in-place history rewrites (e.g. via git rebase -i or git commit --amend) when updating this pull request.

lastorset · 2014-06-07T20:02:49Z

Oops, I must have run the tests wrong—didn't see all those failures. I guess they imply that we want a ParseError with the specific character still intact.

gsnedders · 2014-06-08T00:00:57Z

It's deliberate. Our behaviour is what the spec defines. Go complain at Hixie if you want this changed!

lastorset · 2014-06-11T15:31:06Z

I won't, but thanks :)

marciof · 2014-07-24T11:28:10Z

@gsnedders, would a patch containing a subclass of html5lib.tokenizer.HTMLTokenizer that does the mentioned replacement be a better approach? That way it would be optional.

gsnedders · 2014-07-24T15:09:25Z

@marciof if you're trying to sort out the lxml stuff, you just want to fix ihatexml and ensure everything for the tree-builder goes through it; if you want it for other reason, say what it is?

lastorset · 2014-07-25T08:55:00Z

@gsnedders, I think I understand what you mean. etree_lxml uses ihatexml when building a tree for lxml, and ihatexml.InfosetFilter.coerceCharacters to clean up inserted text (among other things). So if we add code in the latter method to transform control characters, that will fix the problem.

If I understood you correctly before, the original patch was rejected because removing these characters violates the spec. If we change InfosetFilter, it is acceptable because using lxml and etree is optional—dom is still available to follow the spec.

Correct?

gsnedders · 2014-07-25T16:11:30Z

InfosetFilter by definition creates trees different to what the spec requires; it should roughly do what the spec says for infoset coercion. It should do all the coercion through finding invalid characters using ihatexml, and the fact that it doesn't is the bug.

lastorset · 2014-07-25T16:48:59Z

Thanks, that clears it up!

Replace invalid characters with U+FFFD (fixes html5lib#96)

603440e

lastorset mentioned this pull request Jun 6, 2014

lxml doesn’t like control characters #96

Open

gsnedders closed this Jun 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace invalid characters with U+FFFD (fixes #96) #162

Replace invalid characters with U+FFFD (fixes #96) #162

Uh oh!

lastorset commented Jun 6, 2014

Uh oh!

hoppipolla-critic-bot commented Jun 6, 2014

Uh oh!

lastorset commented Jun 7, 2014

Uh oh!

gsnedders commented Jun 8, 2014

Uh oh!

lastorset commented Jun 11, 2014

Uh oh!

marciof commented Jul 24, 2014

Uh oh!

gsnedders commented Jul 24, 2014

Uh oh!

lastorset commented Jul 25, 2014

Uh oh!

gsnedders commented Jul 25, 2014

Uh oh!

lastorset commented Jul 25, 2014

Uh oh!

Uh oh!

Replace invalid characters with U+FFFD (fixes #96) #162

Replace invalid characters with U+FFFD (fixes #96) #162

Uh oh!

Conversation

lastorset commented Jun 6, 2014

Uh oh!

hoppipolla-critic-bot commented Jun 6, 2014

Uh oh!

lastorset commented Jun 7, 2014

Uh oh!

gsnedders commented Jun 8, 2014

Uh oh!

lastorset commented Jun 11, 2014

Uh oh!

marciof commented Jul 24, 2014

Uh oh!

gsnedders commented Jul 24, 2014

Uh oh!

lastorset commented Jul 25, 2014

Uh oh!

gsnedders commented Jul 25, 2014

Uh oh!

lastorset commented Jul 25, 2014

Uh oh!

Uh oh!