&para gets transformed to ¶ #995

kaelig · 2024-09-19T20:08:25Z

Background & Context

I found a bug concerning how HTML entities like &para (without the trailing ;) are being handled during sanitization, especially regarding how the clean output reflects the intended display of entities like the paragraph symbol (¶).

Bug

Input

The input HTML thrown at DOMPurify:

<span>&para</span>

Given output

The output given by DOMPurify:

<span>¶</span>

Expected output

The expected output:

<span>&amp;para</span>

DOMPurify appears to be converting the &para entity into its equivalent Unicode symbol (¶) in the cleaned HTML. However, the expectation is for the original HTML entity ¶ to remain intact without being converted to the symbol.

The text was updated successfully, but these errors were encountered:

kaelig · 2024-09-19T20:12:26Z

Funnily enough, transformations in URLs is correct.

Input

<a href="https://example.com/?foo=bar&para=baz">https://example.com/?foo=bar&para=baz</a>

Given output

<a href="https://example.com/?foo=bar&amp;para=baz">https://example.com/?foo=bar¶=baz</a>

Expected output

<a href="https://example.com/?foo=bar&amp;para=baz">https://example.com/?foo=bar&amp;para=baz</a>

cure53 · 2024-09-20T07:36:39Z

That is done by the browser or DOM engine DOMPurify uses, not by DOMPurify itself. Sadly, we cannot fix this as this is fully expected behavior and related how the browser deals with named HTML entities.

kaelig · 2024-09-20T18:01:34Z

The fact that an entity (without the trailing ;) gets transformed is a strange bug.

Can you please point me to where I should open a bug report?

cure53 · 2024-09-21T10:26:00Z

I think this is not a bug but specified behavior, see HTML spec.

kaelig · 2024-09-22T20:05:57Z

The HTML spec does not specify to render &para (without a trailing semicolon) into ¶, so that's got to be a bug.

cure53 · 2024-09-22T20:12:20Z

https://html.spec.whatwg.org/multipage/named-characters.html

para; 	U+000B6 	¶
para 	U+000B6 	¶

Here it does :)

kaelig · 2024-09-24T18:49:43Z

The table you reference could be somewhat misleading. The spec is very clear that the sequence must be terminated by a semicolon character.

https://html.spec.whatwg.org/multipage/syntax.html#syntax-charref

The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).

Therefore I maintain that this is a bug.

cure53 · 2024-09-24T19:52:25Z

Yes, but for legacy reasons, some entities work without as per spec - and para is one 🙂 No 🐞

kaelig · 2024-09-24T22:15:11Z

You're right in pointing that browsers are required to support them in their rendering engines for legacy reasons – that said they're non-comforming (all named character references are required to end with a semicolon, and uses of named character references without a semicolon are flagged as errors.), and sadly it's overreaching into the transformed output of DOMpurify.

Although this feels like an undesirable side-effect, I now understand better why you said it's intended as per the spec!

cure53 closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

&para gets transformed to ¶ #995

&para gets transformed to ¶ #995

kaelig commented Sep 19, 2024 •

edited

Loading

kaelig commented Sep 19, 2024 •

edited

Loading

cure53 commented Sep 20, 2024

kaelig commented Sep 20, 2024

cure53 commented Sep 21, 2024

kaelig commented Sep 22, 2024

cure53 commented Sep 22, 2024

kaelig commented Sep 24, 2024

cure53 commented Sep 24, 2024

kaelig commented Sep 24, 2024 •

edited

Loading

&para gets transformed to ¶ #995

&para gets transformed to ¶ #995

Comments

kaelig commented Sep 19, 2024 • edited Loading

Background & Context

Bug

Input

Given output

Expected output

kaelig commented Sep 19, 2024 • edited Loading

Input

Given output

Expected output

cure53 commented Sep 20, 2024

kaelig commented Sep 20, 2024

cure53 commented Sep 21, 2024

kaelig commented Sep 22, 2024

cure53 commented Sep 22, 2024

kaelig commented Sep 24, 2024

cure53 commented Sep 24, 2024

kaelig commented Sep 24, 2024 • edited Loading

kaelig commented Sep 19, 2024 •

edited

Loading

kaelig commented Sep 19, 2024 •

edited

Loading

kaelig commented Sep 24, 2024 •

edited

Loading