Skip to content

Invalid XML should not break parsing when IsSuppressingErrors = true #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cmxl opened this issue Apr 7, 2021 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@cmxl
Copy link

cmxl commented Apr 7, 2021

Bug Report

Prerequisites

  • [✓] Can you reproduce the problem in a MWE?
  • [✓] Are you running the latest version of AngleSharp?
  • [✓] Did you check the FAQs to see if that helps you?
  • [✓] Are you reporting to the correct repository? (there are multiple AngleSharp libraries, e.g., AngleSharp.Css for CSS support)
  • [✓] Did you perform a search in the issues?

For more information, see the CONTRIBUTING guide.

Description

When using IsSuppressingErrors = true in XmlParserOptions an exception is thrown when trying to parse an invalid XML.

The Stacktrace:

AngleSharp.Xml.Parser.XmlParseException: Error while parsing the provided XML document.
   at AngleSharp.Xml.Parser.XmlTokenizer.TagSelfClosing(XmlTagToken tag)
   at AngleSharp.Xml.Parser.XmlDomBuilder.ParseAsync(XmlParserOptions options, CancellationToken cancelToken)
   at AngleSharp.Xml.Parser.XmlParser.ParseAsync(XmlDocument document, CancellationToken cancel)

Steps to Reproduce

Given the following XML:

<P>
    <P>
        <FONT FACE="calibri" SIZE="14.666666666666666" COLOR="#000000"></FONT>
    </P>
    <P>
        <FONT FACE="calibri" SIZE="14.666666666666666" COLOR="#000000"></FONT>
    </P>
    <P>
        <FONT FACE="calibri" SIZE="14.666666666666666" COLOR="#0000ff">
            <U>
                <https://some.url.example.com></U>
            </FONT>
            <FONT FACE="calibri" SIZE="14.666666666666666" COLOR="#000000">
                <B></B>
            </FONT>
            <FONT FACE="calibri" SIZE="14.666666666666666" COLOR="#000000"></FONT>
        </P>
        <P>
            <FONT FACE="calibri" SIZE="14.666666666666666" COLOR="#000000"></FONT>
        </P>
        <P>
            <FONT FACE="calibri" SIZE="14.666666666666666" COLOR="#000000"></FONT>
        </P>
    </P>

The problem is the missing closing tag of the first <P>.
When parsing the xml like so, the exception from the description above is thrown:

var xml = "xml from above";
var config = Configuration
                    .Default
                    .WithXml();
var context = BrowsingContext.New(config);

var parser = new XmlParser(new XmlParserOptions { IsSuppressingErrors = true }, context);
var document = await parser.ParseDocumentAsync(xml, cancellationToken);
var html = document.ToHtml();

I know this sounds quite stupid, but I need to actually parse invalid XML data and convert it to HTML afterwards.
Is there some way to parse and/or fix an invalid XML with AngleSharp.Xml?

@cmxl cmxl added the bug Something isn't working label Apr 7, 2021
@cmxl
Copy link
Author

cmxl commented Apr 7, 2021

As of writing maybe it is not the <P> tag, but the wierd <https://some.url.example.com>. I'll check this and come back here again

@cmxl cmxl changed the title Missing closing Tag should not break parsing when IsSuppressingErrors = true Invalid XML should not break parsing when IsSuppressingErrors = true Apr 7, 2021
@cmxl
Copy link
Author

cmxl commented Apr 7, 2021

Ok it's actually just the <https://some.url.example.com> !
But how would I parse this?

@FlorianRappl
Copy link
Contributor

Hm the problem here is that this is actually touching part of the spec where there is no proper fallback.

Where is the XML file usually consumed? Because it would not pass through anything that takes proper XML. Maybe its consumed by some (X)HTML parser?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants