Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

&para gets transformed to ¶ #995

Closed
kaelig opened this issue Sep 19, 2024 · 9 comments
Closed

&para gets transformed to ¶ #995

kaelig opened this issue Sep 19, 2024 · 9 comments

Comments

@kaelig
Copy link

kaelig commented Sep 19, 2024

Background & Context

I found a bug concerning how HTML entities like &para (without the trailing ;) are being handled during sanitization, especially regarding how the clean output reflects the intended display of entities like the paragraph symbol ().

Bug

Input

The input HTML thrown at DOMPurify:

<span>&para</span>

Given output

The output given by DOMPurify:

<span></span>

Expected output

The expected output:

<span>&amp;para</span>

DOMPurify appears to be converting the &para entity into its equivalent Unicode symbol () in the cleaned HTML. However, the expectation is for the original HTML entity &para; to remain intact without being converted to the symbol.

@kaelig
Copy link
Author

kaelig commented Sep 19, 2024

Funnily enough, transformations in URLs is correct.

Input

<a href="https://example.com/?foo=bar&para=baz">https://example.com/?foo=bar&para=baz</a>

Given output

<a href="https://example.com/?foo=bar&amp;para=baz">https://example.com/?foo=bar¶=baz</a>

Expected output

<a href="https://example.com/?foo=bar&amp;para=baz">https://example.com/?foo=bar&amp;para=baz</a>

@cure53
Copy link
Owner

cure53 commented Sep 20, 2024

That is done by the browser or DOM engine DOMPurify uses, not by DOMPurify itself. Sadly, we cannot fix this as this is fully expected behavior and related how the browser deals with named HTML entities.

@cure53 cure53 closed this as completed Sep 20, 2024
@kaelig
Copy link
Author

kaelig commented Sep 20, 2024

The fact that an entity (without the trailing ;) gets transformed is a strange bug.

Can you please point me to where I should open a bug report?

@cure53
Copy link
Owner

cure53 commented Sep 21, 2024

I think this is not a bug but specified behavior, see HTML spec.

@kaelig
Copy link
Author

kaelig commented Sep 22, 2024

The HTML spec does not specify to render &para (without a trailing semicolon) into , so that's got to be a bug.

@cure53
Copy link
Owner

cure53 commented Sep 22, 2024

https://html.spec.whatwg.org/multipage/named-characters.html

para; 	U+000B6 	¶
para 	U+000B6 	¶

Here it does :)

@kaelig
Copy link
Author

kaelig commented Sep 24, 2024

The table you reference could be somewhat misleading. The spec is very clear that the sequence must be terminated by a semicolon character.

https://html.spec.whatwg.org/multipage/syntax.html#syntax-charref

The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).

Therefore I maintain that this is a bug.

@cure53
Copy link
Owner

cure53 commented Sep 24, 2024

Yes, but for legacy reasons, some entities work without as per spec - and para is one 🙂 No 🐞

@kaelig
Copy link
Author

kaelig commented Sep 24, 2024

You're right in pointing that browsers are required to support them in their rendering engines for legacy reasons – that said they're non-comforming (all named character references are required to end with a semicolon, and uses of named character references without a semicolon are flagged as errors.), and sadly it's overreaching into the transformed output of DOMpurify.

Although this feels like an undesirable side-effect, I now understand better why you said it's intended as per the spec!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants