Description
After looking at #31312 closer and trying some things out, I wanted to open a new issue with a more specific and succinct proposal:
Features
At a high level, the desired features are:
- Line and column tracking (actually offset tracking, see below)
- A way to get at the original element and attribute name text before lower casing.
Motivation
The motivation for 1 is cases where a caller needs to know the context of a token or node so this can be reported to the user. Indicating the position of an error, or where in the source HTML a particular element comes from, etc.
The motivation for 2 is for situations where the caller simply needs to know what was originally there before lower-casing. It could be useful in error reporting to show the original element name for example. (The case for me specifically is I'm doing code generation based on HTML and so allowing the user to use mixed case in a tag name means they can specify the exact name of a Go struct with CamelCase - the documents in question are mostly standard HTML but of course these elements are treated differently.)
Observations
-
Tracking the byte offset of each token and node is simple and unambiguous.
-
Line and column information can be derived from this offset, but raises other questions like the definition of a line ending, how do tabs impact the reported column position, multi-byte characters, etc.
-
Thus these concerns would seem to be better handled outside the parser - if the parser reports the byte offset into the original input, that's enough.
-
Providing an option to keep the original case of an element in the
Data
field seems, in retrospect, dangerous. I can't prove that this won't cause things to break (the parser does many weird and wonderful things internally to handle special cases with different tags - for which I have a new found respect after diving deeper into the source :) ). A new field seems much more sensible. -
Both tracking offset information and preserving the original text require fields to be added to the appropriate structs (see below). The performance impact of them seems negligible. I thus opted to not implement them as ParserOptions, as it would provide little benefit to disable these features - the fields would still be there just unpopulated.
Changes
Here's a summary of how this could work (how it works in the prototype):
-
Add
OrigData string; Offset int
toToken
. OrigData is the same as data but before lower casing. (If they are the same they point to the exact same bytes.) -
Add
OrigData string; Offset int
toNode
, same behavior (fields are just copied from Token to Node). -
Add
OrigKey
toAttribute
. Original Key before lower casing. (In retrospect Attribute probably could have an Offset too, I might add that.) -
And basically just the book keeping to fill out those fields.
-
Separately, a LineCounter struct is added which is wraps an io.Reader and records the positions of every line ending and provides a call to go from an offset to line number and line start offset. Callers can use this to reconstruct "line:col" given a byte offset from Node.Offset, etc. Notice that the question of how to deal with tabs or multi-byte characters becomes the caller's responsibility; this is intentional. (I didn't bother to implement alternate line endings, it assumes \n for now, but this could be done easily.) I'm not sure if it is appropriate to have a LineCounter inside the HTML package, it could just as well be somewhere else.
Working Prototype
The working code with these modifications is here: https://github.com/vugu/html .
The main commit that touches the internal stuff is: vugu/html@da33d26
Some basic tests are in there, I'm sure more could be done. (I also haven't looked at code that writes out HTML to see how the originally-cased data can be used there too or if this would be an option or what.) I also understand if some of this needs to be split up, this proposal is a few different things lumped together.
I believe these features could be generally useful and so I wanted to see if this is something that could potentially end up being merged, and if so discuss what's required to get to that point.
Metadata
Metadata
Assignees
Type
Projects
Status