-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input #115
Comments
It is my understanding that the reason is to prevent XSS attacks. Consider "<\u001b(B\u001b$Bscript" for example. |
Why is that worth protecting against if we can't protect against "<\x1b(Js\x1b(Bcript"? That is, if we can't generate U+FFFD for all of these, is it worth generating it for any of these?
The last one seems the hardest to prevent without potentially breaking some legitimate inputs. |
Does anyone wish to drive a change proposal here? Or should we just accept this and maybe document the issues a bit further? cc @jungshik |
At this point it does not seem worth it to require further implementation changes and risk compatibility issues, so instead document the quirk. Closes #115.
I decided to simply document this quirk in #155. If there are any concerns with that approach let me know. |
At this point it does not seem worth it to require further implementation changes and risk compatibility issues, so instead document the quirk. Closes #115.
I still have trouble seeing the value of generating an error when an escape is followed by another escape when we don't generate errors for the other two cases I mentioned on July 3 2017. |
FWIW, if you want to reopen this and remove that error, that's fine by me. It'd be good to know what other implementations do at this point, as changing such details is never fun for anyone involved. |
I noticed that Unicode Security Considerations say that what the Encoding Standard requires for ISO-2022-JP should be done. If we reopen this, we should ask Mark Davis and Michel Suignard for their rationale. |
Regarding the previous Gecko bug from a week ago, the reporter provides more info in a Thunderbird bug. |
Edge and IE don't generate a REPLACEMENT CHARACTER. Firefox, Chrome and Safari do. (With the caveat that my Mac is stuck on El Capitan, so I couldn't test the latest Safari.) |
It seems worth resolving the Unicode Security Considerations question relative to the observation in #115 (comment). I wonder if @srl295 or @kenlunde could help us with that. To restate, why is an escape followed by an escape considered problematic, whereas going from ASCII to ASCII, or uselessly going between ASCII and Roman, is not considered problematic? Also taking into account that an escape followed by an escape is typical in email as per https://bugzilla.mozilla.org/show_bug.cgi?id=1374149#c5. |
@annevk By "escape followed by an escape," you are referring to ISO 2022 escape sequences, not individual "escape" (U+001B) characters, right? If so, back in the day when ISO 2022 encoding was common for email clients, I often encountered data that included redundant escape sequences, meaning no-op escape sequences that would shift back to ASCII, then immediately shift back into JIS X 0208 with no intervening characters. I also recall building in support to handle such no-op escape sequences when converting to other encodings, such as EUC-JP or Shift-JIS, or to fix the ISO 2022 data by removing them. I am not sure whether that helps. |
The following paragraph is from the bottom of page 583 of CJKV Information Processing, Second Edition:
|
Right, the no-op escape sequences are the question here. https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input suggests those need to emit a U+FFFD and this is currently what the Encoding Standard does. However, that is incompatible with scenarios occurring in email as you note and also doesn't consider other potentially problematic scenarios such as an escape sequence for ASCII, some ASCII code points, and then another escape sequence for ASCII. Or that, but then between ASCII and Roman. So what we'd like to know is how much consideration we should pay to that security consideration (which seems incomplete if a real danger) relative to the need for email clients to handle such no-op escape sequences. |
(Also, thank you very much for the timely reply!) |
@annevk In my opinion, enough time has passed that anything that is Unicode-encoded should have no realistic or real-world need to convert back to legacy encodings, particularly because almost all legacy encodings cannot handle a large number of characters that can be represented via Unicode. With that said, I think that nuking the no-op escape sequences is the right approach. There's no meaningful reason to leave behind a "trail" of sorts that something that is not really text data was in the data stream. |
I am only one of the Thunderbird users in Japan. Please allow me to comment here. In Japan, many users still send e-mail with ISO-2022-JP encoding. Recently Thunderbird has been automatically updated to Version 60, however, cases where U + FFFD is inserted in the subject display of some e-mails later have been reported since then. Although there is no concrete report, I think that the e-mail software that performs encoding that does not conform to the spec is an old Outlook. Regular users are not interested in the meaning of U + FFFD. They will simply decide that it is a bug in Thunderbird. And they will downgrade Thunderbird to version 52 or choose Microsoft products. I hope that discussion will be conducted from the user's point of view. |
@EarlgreyPicard In other words, you agree with my suggestion to simply nuke the no-op escape sequences. |
I do not know what "simple nuke the no-op escape sequences" mean, but I hope that Thunderbird will not insert U + FFFD into the result of decoding the following e-mail subject. Subject: =?ISO-2022-JP?B?IBskQiVXJW0bKEIbJEIlMCVpJWAbKEI=?= Bad: |
@EarlgreyPicard Sorry. To clarify, I meant 1) to remove no-op escape sequences; and 2) to not emit U+FFFD in their presence. |
@kenlunde I understand. Thank you. I agree. |
@hsivonen Is there progress on the mailing list? |
No. I pinged the mailing list again. |
To address Mark Davis' request for formal feedback I started writing something. While doing so, I came up with this: Generate U+FFFD if:
This would actually uphold the security properties that UTR 36 tries to uphold but fails to and this would avoid the unwanted U+FFFD generation reported in the email cases. The key question is if imposing the requirement that ASCII to Roman and vice versa transitions can only happen when logically necessary and then at the last possible moment is feasible given the behavior of encoders out there. Does anyone want to volunteer to research this by checking the behavior of existing encoders of my searching archives of old Japanese email for the relevant byte patterns? |
I'm thinking of submitting a suggestion to go for non-committal middle ground for now. Thoughts? |
No replies in one and a half years. @annevk, does the document linked to from the previous sentence look OK to you for submission to the Unicode Technical Committee? (This issue was reported as a Thunderbird bug again.) |
Based on off-issue comments, I revised the draft to conclude to go for "End state 1" right away instead of leaving it as non-committal. |
Oops. That edit was internally inconsistent. Tried again. |
I have no objection, although we'd probably consider it low priority and wait for a fix to filter in from ICU. @jungshik has expressed stronger views about ISO-2022-JP implementation in the past and may have an opinion. |
What bug is that? Please tell me the bug number. |
Did no one consider always generating a mode change before the first character, and then never generating a mode change at the end? |
That would be quite a departure from the existing practice of encoders. |
Yet seemingly much more sound, and thus more useful. I guess I'm not clear whether there's any actual compatibility problems with that, whether there's a process for evaluating that, and whether something that was once broken is required to be broken forever. |
It will break the property that if the text contains only ASCII characters, the output is compatible with US-ASCII encoding. |
Section 2 of the spec already defines ISO-2022-JP and UTF-16 as "ASCII-incompatible encodings". |
I'm not saying about the spec definition. Please do not ignore "if the text contains only ASCII characters". Always generating a mode change will cause problems like UTF-8 BOM caused. I strongly disagree with that as ISO-2022-JP user. |
Encodings other than ISO-2022-JP have the property that if you concatenate two outputs from a conforming encoder and decode them together, you get the same result as when decoding them separately and then concatenating.
ISO-2022-JP lacks this property, because despite the encoder making an effort into this direction by emitting a transition to the ASCII state at the end, if the next segment being concatenated starts with a transition to a non-ASCII state, the concatenation results in zero ASCII bytes between two escapes.
Is there a strong reason for treating a transition to the ASCII state immediately followed by another escape as non-conforming? Or put the other way, what purpose does the transition to the ASCII state at the end of encode serve if not achieving the above-mentioned concatenation property that other encodings have?
This is relevant to RFC 2047 header decoding.
The text was updated successfully, but these errors were encountered: