Skip to content

Missing DTD in parsed document model #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Rouneq opened this issue Oct 16, 2023 · 10 comments
Closed

Missing DTD in parsed document model #24

Rouneq opened this issue Oct 16, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@Rouneq
Copy link

Rouneq commented Oct 16, 2023

Bug Report

Description

I cannot see anything in the document model that seems to match the values defined in the DTD nor am I seeing the DTD when performing a round-trip on the XML. I was initially investigating self-closing tags and found Issue #11 . From there, I took the example code to test with and confirm it met my need. But I noticed my DTD wasn't getting written out. As far as I can tell, the DTD isn't brought into the parsed document.

Steps to Reproduce

(Tested in LINQPad)

  var xmlData = @"<?xml version=""1.0"" encoding=""UTF-8""?>
<Project Sdk=""Microsoft.NET.Sdk"">
    <ItemGroup>
        <PackageReference Include=""AngleSharp"" Version=""0.12.1""></PackageReference>
        <PackageReference Include=""AngleSharp.Xml"" Version=""0.12.1"" />
        <PackageReference Include=""AngleSharp.XPath"" Version=""1.1.4"" />
    </ItemGroup>   
</Project>";
  var xmlDoc = new XmlParser().ParseDocument(xmlData);

  using (var sw = new StringWriter())
  {
    xmlDoc.ToHtml(sw, xmlFormatter);

    Console.WriteLine(sw.ToString());
  }

Expected behavior: [What you expected to happen]

Output to look similar to

<?xml version="1.0" encoding="UTF-8"?>
<Project Sdk="Microsoft.NET.Sdk">
    <ItemGroup>
        <PackageReference Include="AngleSharp" Version="0.12.1" />
        <PackageReference Include="AngleSharp.Xml" Version="0.12.1" />
        <PackageReference Include="AngleSharp.XPath" Version="1.1.4" />
    </ItemGroup>  
</Project>

Actual behavior: [What actually happened]

Output is

<Project Sdk="Microsoft.NET.Sdk">
    <ItemGroup>
        <PackageReference Include="AngleSharp" Version="0.12.1" />
        <PackageReference Include="AngleSharp.Xml" Version="0.12.1" />
        <PackageReference Include="AngleSharp.XPath" Version="1.1.4" />
    </ItemGroup>  
</Project>

Environment details: [OS, .NET Runtime, ...]

Windows 10
LINQPad 7
AngleSharp 1.0.5 via NuGet
AngleSharp.Xml 1.0.0 via NuGet

Possible Solution

Am I missing some options/techniques to force the correct parse?

@Rouneq Rouneq added the bug Something isn't working label Oct 16, 2023
@FlorianRappl
Copy link
Contributor

What is xmlFormatter? Have you tried the ToXml method?

@Rouneq
Copy link
Author

Rouneq commented Oct 17, 2023

My apologies. I forgot to include the formatter declaration.

private static readonly IMarkupFormatter xmlFormatter = new XmlMarkupFormatter()
                                                            {
                                                              IsAlwaysSelfClosing = true,
                                                            };

As for whether ToXml helps, no. It's actually a bit worse because I cannot specify the formatter on the method call. Here's the output from the call:

<Project Sdk="Microsoft.NET.Sdk">
    <ItemGroup>
        <PackageReference Include="AngleSharp" Version="0.12.1"></PackageReference>
        <PackageReference Include="AngleSharp.Xml" Version="0.12.1" />
        <PackageReference Include="AngleSharp.XPath" Version="1.1.4" />
    </ItemGroup>  
</Project>

All of this is irrelevant. It isn't the output that's the problem. That's just demonstrating the issue. As I said, it appears the DTD isn't even brought into the document.

image

@FlorianRappl
Copy link
Contributor

The xml notation is not a doctype - its a preamble. Doctypes are serialized.

A DTD (what you wrote in the title) would look like

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE note SYSTEM "Note.dtd">
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note> 

So the doctype being null is expected. Regarding the preamble I don't know. I think we could make it to the formatter that is also serializes / outputs this, but in general that's only interesting for XML documents (not fragments) and if wanted. I'm open for PRs / improvements on that topic.

@Rouneq
Copy link
Author

Rouneq commented Oct 17, 2023

Thanks for correcting my misunderstanding. I shouldn't refer to it as the DTD. I've gone and familiarized myself with the correct nomenclature.

In regards to the preamble (or declaration as I'm seeing it called elsewhere), while it is optional, it is important to xml documents. The example xml used may be small, but is still a full document.

@FlorianRappl
Copy link
Contributor

FlorianRappl commented Oct 17, 2023

Yes, I fully agree. Yet the job of the serializer is to serialize nodes - i.e., fragments. As mentioned I think we can go around and make an option to output the declaration for IXmlDocument instances. The default can be true, but for me its important that one can also turn this feature off.

There are multiple reasons for not having the declaration part of the serialization (and therefore putting it in the hand of the application that uses the serializer). As an example, we don't know how / where / in what encoding the document will be stored. Therefore, going forward and specifying, e.g., UTF-8 is generally a mistake.

Maybe the best way forward is to introduce an overload for IXmlDocument using ToXml with declaration attributes to be specified (by default version (1.0) and encoding (utf-8) are given, but others / different values can be given). This would be fully backwards compatible will still allowing the production of fully qualified documents.

@Rouneq
Copy link
Author

Rouneq commented Oct 17, 2023

Perhaps an overloaded version of ToXml could take a formatter (similar to how ToHtml accepts one) where it contains options specific for a declaration (bool OmitDeclaration, string Version, string Encoding, bool? IsStandalone).

2.8 Prolog and Document Type Declaration
2.9 Standalone Document Declaration

@FlorianRappl
Copy link
Contributor

Well, ToXml is already specifies the formatter (it's the XML Formatter - if you want to customize that then use the general methods:

markup.ToHtml(XmlMarkupFormatter.Instance);
).

I think having both, OmitDeclaration and the attributes is not necessary. Either you want to have the prolog printed then you'd need to provide the options; or not. The given case runs into consistency issues when you specify, e.g., false, "1.0", "UTF-8" - what should be done now?

Note that internally, i.e., when parsing a document we already gather that information. We check that a provided version satisfies the 1.x constraint and use the standalone to determine parsing options. The encoding then switches the character interpretation (same as a meta tag in HTML does). Nevertheless, the information ingest does not have to match the output. Therefore you'd always have to specify these values I guess.

@Rouneq
Copy link
Author

Rouneq commented Oct 17, 2023

It's your library, so that's up to you. As a user, it was not intuitively obvious ToHtml would also generate xml output. I see the default ToHtml call will do a similar behavior to use a default instance of HtmlMarkupFormatter, but the overloads accept specific formatters. Having parity on ToXml makes it, as Rico Mariani puts it, a pit of success.

In regards to having different "conflicting" options, that was just a follow-on to your initial statement

"I think we can go around and make an option to output the declaration for IXmlDocument instances. The default can be true"

With this, you suggested overloads for ToXml to convey this information. I just suggested moving them into a formatter.

All of that aside, the initial drive for this is, if a declaration appears in the original xml document, it should, by default, carry on to the generated output. Override options seem within the domain for a formatter.

@FlorianRappl
Copy link
Contributor

All of that aside, the initial drive for this is, if a declaration appears in the original xml document, it should, by default, carry on to the generated output. Override options seem within the domain for a formatter.

Yeah this is where my note is important. The prolog is only important for the consumption, but again - how it should be consumed depends on the creator of the document, i.e., your application. See my remarks regarding standalone or encoding. Even the version we don't know. You could make your own formatter that uses some formats used in some future / unknown version or XML. Consequently, the original prolog really has no meaning. We consumed that document and now a completely new document (e.g., using some exotic encoding) might have been created.

I agree that the naming of ToHtml can be confusing, but thinking of AngleSharp as an HTML engine (and AngleSharp.Xml just as an extension to that) you see where this is coming from. ToString or Serialize etc. would be other / maybe more adequate options. In any case ToHtml is the general one and ToXml is a special one / alias just for Xml.

I'll be thinking of an appropriate API for this - but just to be clear: The current behavior will stay the default (to be backwards compatible) and any new API should fit nicely in (i.e., be backwards compatible, consistent to the but obvious in usage from the name).

Any suggestions appreciated.

@Rouneq
Copy link
Author

Rouneq commented Oct 17, 2023

Last response to this issue.

Yeah this is where my note is important. The prolog is only important for the consumption, but again - how it should be consumed depends on the creator of the document, i.e., your application. See my remarks regarding standalone or encoding. Even the version we don't know. You could make your own formatter that uses some formats used in some future / unknown version or XML. Consequently, the original prolog really has no meaning. We consumed that document and now a completely new document (e.g., using some exotic encoding) might have been created.

Sure. Which is the point of any override mechanism. But as I said, by default, preserve what comes in when making output.

I agree that the naming of ToHtml can be confusing, but thinking of AngleSharp as an HTML engine (and AngleSharp.Xml just as an extension to that) you see where this is coming from. ToString or Serialize etc. would be other / maybe more adequate options. In any case ToHtml is the general one and ToXml is a special one / alias just for Xml.

Unfortunately, as there is zero documentation for AngleSharp.Xml, I had no other basis to go on regarding how to use these libraries other than the exposed API surface. It wasn't even until I saw issue #11 I realized ToHtml could be used to output xml in the first place. And the Basics document doesn't describe the intention of the library.

Please close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants