Skip to content

Preserve attributes on HTML paragraphs #10850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Preserve attributes on HTML paragraphs #10850

wants to merge 4 commits into from

Conversation

Valgard
Copy link

@Valgard Valgard commented May 17, 2025

This PR implements the preservation of attributes on HTML paragraphs, addressing issue #10768.

HTML reader now wraps attributed <p> tags in a Div with wrapper="1".
HTML writer unwraps these Divs back to attributed <p> tags.

This approach is similar to the Djot reader/writer as discussed in #10768, ensuring that semantic information in HTML attributes on paragraphs is preserved during conversion.

Valgard added 4 commits May 17, 2025 19:50
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`.
- HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag.
- Add tests for HTML paragraph attribute roundtrip.
- Update EPUB golden files to reflect new AST for attributed paragraphs.
Split pPara into pParaWithWrapper and pParaSimple helpers.
Ensure pParaWithWrapper correctly discards invalid align attributes.
Add specific tests for align attribute in HTML reader and writer.
- Update MANUAL.txt to reflect `native_divs` wrapping of
  attributed `<p>` tags.
- Add test cases for HTML to native, native to HTML, HTML to HTML,
  and HTML to HTML5 conversions
- Verify preservation of id, class, and data attributes on p tags
Copy link
Owner

@jgm jgm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. As noted I didn't understand the special treatment of "align".

Another question I have is how common it is for paragraphs to have classes or other attributes in HTML in the wild. If it is very common, then I suppose this change will lead to more cluttered HTML -> markdown conversions and we'd need to weight that.

pParaWithWrapper :: PandocMonad m => Attr -> TagParser m Blocks
pParaWithWrapper (ident, classes, kvs) = do
guardEnabled Ext_native_divs -- Ensure native_divs is enabled for this behavior
pInhalt <- trimInlines <$> pInTags "p" inline
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually use the naming convention of beginning parsers with p; so it would be better to use something like inhalt instead for this name.

Comment on lines +636 to +640
let otherKVs = filter (\(k,_) -> k /= "align") kvs
let validAlignKV = case alignValue of
Just algn | algn `elem` ["left","right","center","justify"] -> [("align", algn)]
_ -> []
let finalKVs = wrapperAttr : (validAlignKV ++ otherKVs)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for treating the "align" attribute specially in this way?

Comment on lines +652 to +654
return (case alignValue of
Just algn | algn `elem` ["left","right","center","justify"] ->
B.divWith ("", [], [("align", algn)]) paraBlock
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the motivation for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants