Skip to content

Commit

Permalink
Conform encoding-label matching to Encoding spec
Browse files Browse the repository at this point in the history
This change makes the parser’s encoding-name matching conform to the current
Encoding spec at https://encoding.spec.whatwg.org/#concept-encoding-get —
which requires that only leading and trailing whitespace be removed from
a string before checking if it matches any valid encoding names.

Otherwise, without this change, the parser instead implements
https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching —
which requires deleting “all characters except a-z, A-Z, and 0-9” from
a string before checking if it matches any valid encoding names. That
difference makes us fail two html5-tests cases.
  • Loading branch information
sideshowbarker committed Aug 20, 2020
1 parent 3f48926 commit 818e72f
Showing 1 changed file with 1 addition and 3 deletions.
4 changes: 1 addition & 3 deletions src/nu/validator/htmlparser/io/Encoding.java
Original file line number Diff line number Diff line change
Expand Up @@ -254,9 +254,7 @@ public static String toNameKey(String str) {
if (c >= 'A' && c <= 'Z') {
c += 0x20;
}
if (!((c >= '\t' && c <= '\r') || (c >= '\u0020' && c <= '\u002F')
|| (c >= '\u003A' && c <= '\u0040')
|| (c >= '\u005B' && c <= '\u0060') || (c >= '\u007B' && c <= '\u007E'))) {
if (!(c == ' ' || c == '\t' || c == '\n' || c == '\f' || c == '\r')) {
buf[j] = c;
j++;
}
Expand Down

0 comments on commit 818e72f

Please sign in to comment.