Skip to content

Allow data URIs as image sources #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

lovasoa
Copy link
Contributor

@lovasoa lovasoa commented Feb 20, 2025

Data URIs allow embedding image data directly within HTML documents using the src attribute of <img> tags. This practice is generally safe and does not lead to arbitrary code execution.

I understand this is a scary pull request to merge when one is not familiar with the topic, so here is a detailed explanation:

  1. Image Rendering Context: When a Data URI is used in an <img> tag, browsers interpret and render the content strictly as an image. They do not execute any embedded scripts or code within this context.
  1. SVG Restrictions: SVGs can contain scripts, all browsers impose security restrictions. SVG images embedded via <img> tags have scripting disabled, preventing any embedded JavaScript from executing.

Using Data URIs within <img> tags is a secure method for embedding images directly into HTML documents. Browsers handle these URIs in a way that prevents code execution.

Other markdown rendering libraries, including the one used by github (comrak), do not consider image data URLs as unsafe.

Example:

use comrak::*;

fn main() {

let mut options = Options::default();
let input = "
![](http://example.com)
![](data:image/png,aaa)
![](javascript:dangerous)
";

println!("{}", markdown_to_html(input, &options));
}

Results in

<p><img src="http://example.com" alt="" />
<img src="data:image/png,aaa" alt="" />
<img src="" alt="" /></p>

fixes #163

@lovasoa lovasoa mentioned this pull request Feb 20, 2025
@wooorm
Copy link
Owner

wooorm commented Feb 20, 2025

a) the only sources here are about svg, not about other data: — there must be other sources out there
b) per #163, this is not something GH does; I think defaulting to things that GH does is good
c) there is little reason to disallow ![](javascript:dangerous) but accept ![](data:text/javascript,dangerous), your current argumentation is not about data: but about all srcs on images
d) you can have your own safety mechanism: turn this option off and implement your own sanitizer

@lovasoa
Copy link
Contributor Author

lovasoa commented Feb 21, 2025

Hello!

SVG is indeed the only image format supported in img tags that supports scripting, so I thought it was important to clarify that svg scripts are not executed. Non-image formats are just ignored, and other supported image formats cannot include scripts.

About allowing all protocols on images: I don't see any legitimate use case for javascript: or other protocols in image sources, but this is indeed not a security issue. We could add it, but I'm not sure that would be useful.

Advising users to completely disable the built-in protocol validation to allow the perfectly safe data URLs is not a great security posture.

About GitHub: as I showed, the markdown parsing and rendering library they use does allow data URLs. They indeed have a filter removing data URLs further in their pipeline, but that's still something that could be implemented with this library too. This is a product decision, not a security decision that should be enforced at the markdown renderer level.

The point is, there is absolutely no security reason to disable data URLs, or to put them behind the same flag that for instance disables JavaScript: URLs in links.

@wooorm
Copy link
Owner

wooorm commented Feb 21, 2025

SVG is […] the only image format supported in img tags […]. Non-image formats are just ignored, and other supported image formats cannot include scripts.

I believe that the HTML spec does not specify a particular list of image formats.
I was under the impression that PDFs for example were also shown.
I believe the spec also does not specify whether scripting should be allowed in them.
Or, data:.
The question is thus: could there be a browser that supports some image format that allows scripting (in the future, in the past)?

Your argumentation also begs the question: Is currently maintained browsers the target of this project? Only popular ones? Are browsers the only target or are there other user agents?

I can at least confirm that Opera from 2012 had scripting in src (see also this).

but this is indeed not a security issue. We could add it, but I'm not sure that would be useful.

I understand your argumentation to stretch against all allow-lists for src; you believe src to be always safe.
I do not understand why you would then push for a half solution. Would it not be simpler to push for allowing all values on src?

Advising users to completely disable the built-in protocol validation to allow the perfectly safe data URLs is not a great security posture.

I am not advising that. I am advising an allow list of http and https. I think that that is enough. For more complex needs, users can use their own sanitizing algorithm. Especially, when plugins are there.

Similar argumentation to data: on src of img can be used for ftp:. Or authors writing <i> and </i>. Those are also known to be safe. But, things get complex quickly.

as I showed

Comrak is as far as I am aware still not used by GH. GH uses cmark-gfm:

Though, I do believe they run this in dangerous mode, and then filter later with a tool similar but different to the HTML pipeline.

See also kivikakk/comrak#169. It looks like comrak is also doing things differently than what you propose.

@lovasoa
Copy link
Contributor Author

lovasoa commented Feb 21, 2025

I believe that the HTML spec does not specify a particular list of image formats.
I was under the impression that PDFs for example were also shown.

You are right, the HTML spec does not specify a particular list of image formats, and it allows PDFs !
But it does explicitly state that User agents must not run executable code (e.g. scripts) embedded in the image resource :

User agents must not support non-image resources with the img element (e.g. XML files whose document element is an HTML element). User agents must not run executable code (e.g. scripts) embedded in the image resource. User agents must only display the first page of a multipage resource (e.g. a PDF file). User agents must not allow the resource to act in an interactive fashion, but should honour any animation in the resource. This specification does not specify which image types are to be supported.


The question is thus: could there be a browser that supports some image format that allows scripting (in the future, in the past)?
Your argumentation also begs the question: Is currently maintained browsers the target of this project? Only popular ones? Are browsers the only target or are there other user agents?

This is indeed a good question, and is related to my argument. But let's try to keep focused on the issue at hand.
All image formats are already allowed by this library, as long as they are transmitted through HTTP or HTTPS. Allowing data URLs does not increase the attack surface in this regard. I probably should not have mentioned image formats and SVG to begin with, since this is not directly related to this pull request, which is only about allowing data URLs. The set of content types that can be embedded in a data URL is the same as the set of content types that can be returned by a normal HTTP request.

I understand your argumentation to stretch against all allow-lists for src; you believe src to be always safe.
I do not understand why you would then push for a half solution. Would it not be simpler to push for allowing all values on src?

I would be perfectly fine with that ! I can remove the protocol restrictions on image sources completely if that's what you think is best. Let me know if you want me to update the PR in that direction.

Comrak is as far as I am aware still not used by GH.

I think github uses https://github.com/gjtorikian/commonmarker , which is a wrapper around comrak.
Source: https://github.com/github/markup?tab=readme-ov-file#github-markup

I am not advising that. I am advising an allow list of http and https. I think that that is enough. For more complex needs, users can use their own sanitizing algorithm. Especially, when plugins are there.

I think this is a weaker security posture than allowing all safe protocols and forbidding unsafe ones by default. This encourages people who just need safe data URIs to switch on allow_dangerous_protocol, which will also enable javascript: URLs in links, which is a real problem.


Anyways, let's try to be productive here. I understand that you are not willing to allow data: URLs in images. If I cannot convince you that it's actually perfectly safe, then how can we proceed ? Would you be open to a pull request that makes the list of allowed protocols configurable, by storing SAFE_PROTOCOL_SRC in the configuration object instead of a hardcoded constant ? That would allow security conscious users to avoid having to set allow_dangerous_protocol to true, while still allowing all safe protocols.

@wooorm
Copy link
Owner

wooorm commented Feb 25, 2025

It does explicitly state that User agents must not run executable code (e.g. scripts) embedded in the image resource

Very nice.
Do you know when they added this?
As I have shown with Opera and DOMPurify, there have been used agents who did not follow this.
I can understand you only caring about modern browsers, but that is not necessarily what is best for everyone using this library + micromark by default.

I think github uses https://github.com/gjtorikian/commonmarker , which is a wrapper around comrak.
Source: https://github.com/github/markup?tab=readme-ov-file#github-markup

😅 When GH adds things, they do it in cmark-gfm. Then it is alive here on GH. Such as footnotes a couple years ago and more recently github/cmark-gfm@d0866dd. You can observe the bugs between what is going on here in comments/files and what is in the code there.

I think this is a weaker security posture than allowing all safe protocols and forbidding unsafe ones by default.

The latter is I believe what this project does. What is safe and unsafe, at what time, seems to be the debate.


There are indeed several discussions going on here.
One is about image formats potentially being insecure, scripts hidden in them.
One is about the data: protocol being safe.
One is about the value of src on img always being safe.
Most of your argumentation is about the third: you say everything is safe in images.
The solution you propose, is for the second, either configuring SAFE_PROTOCOL_SRC or allowing data: for everyone.

I think solutions need to match real problems. Be good to use and be hard to abuse.

Would other users allow particular protocols on images? Does it make sense for people to allow tel:, ftp:, mailto:? Which protocols would make sense? Is it easy for users to abuse such features and hurt themselves?

If users want that, would they not also want to change the protocols of href on a?

If users want all this, is that not what plugins are for? Should we not advance plugins?

Or, can we figure out that indeed in modern user agents src on images are all safe now, when that changed.
Add a feature for that: allow_image_src or so.
With docs referencing the HTML spec and links + dates to PRs (whatwg/html, Chromium) for when that changed

@lovasoa
Copy link
Contributor Author

lovasoa commented Feb 25, 2025

Hi ! I think we now both understand each other's arguments here. Let's not dive deeper into the topic of whether embedded images can execute scripts, since it is not related to allowing data: URLs or not. How do you want to proceed ?
I can:

  • keep this PR, add a note about how some rare old browsers were vulnerable to script injection in image sources (both in data: and http: URLs)
  • open a new PR making the list of allowed protocols configurable
  • split allow_dangerous_protocol into allow_dangerous_protocol_in_links and allow_dangerous_protocol_in_images

Just let me know what you think is best; I'll write the code.

@ChristianMurphy
Copy link
Collaborator

One is about image formats potentially being insecure, scripts hidden in them.
One is about the data: protocol being safe.
One is about the value of src on img always being safe.

Another consideration is what standard implementations do.
We don't necessarily want to diverge from CM or GFM, this could also be a discussion that could be taken upstream.

Add a feature for that: allow_image_src or so.

I'd be more open to that approach

@lovasoa
Copy link
Contributor Author

lovasoa commented Feb 26, 2025

I opened a separate pull request adding a crate feature: #165

@wooorm wooorm closed this in e923a3c Feb 28, 2025
@wooorm
Copy link
Owner

wooorm commented Feb 28, 2025

This feature was released in 1.0.0-alpha.23!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data URIs
3 participants