Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input #115

hsivonen · 2017-06-19T11:03:26Z

Encodings other than ISO-2022-JP have the property that if you concatenate two outputs from a conforming encoder and decode them together, you get the same result as when decoding them separately and then concatenating.

ISO-2022-JP lacks this property, because despite the encoder making an effort into this direction by emitting a transition to the ASCII state at the end, if the next segment being concatenated starts with a transition to a non-ASCII state, the concatenation results in zero ASCII bytes between two escapes.

Is there a strong reason for treating a transition to the ASCII state immediately followed by another escape as non-conforming? Or put the other way, what purpose does the transition to the ASCII state at the end of encode serve if not achieving the above-mentioned concatenation property that other encodings have?

This is relevant to RFC 2047 header decoding.

vyv03354 · 2017-06-19T11:22:19Z

It is my understanding that the reason is to prevent XSS attacks. Consider "<\u001b(B\u001b$Bscript" for example.

hsivonen · 2017-07-03T13:29:39Z

It is my understanding that the reason is to prevent XSS attacks. Consider "<\u001b(B\u001b$Bscript" for example.

Why is that worth protecting against if we can't protect against "<\x1b(Js\x1b(Bcript"?

That is, if we can't generate U+FFFD for all of these, is it worth generating it for any of these?

Escape immediately followed by another escape.
Transition from the ASCII state to the ASCII state.
Useless transitions between the ASCII state and the Roman state.

The last one seems the hardest to prevent without potentially breaking some legitimate inputs.

annevk · 2018-04-25T09:48:00Z

Does anyone wish to drive a change proposal here? Or should we just accept this and maybe document the issues a bit further?

cc @jungshik

At this point it does not seem worth it to require further implementation changes and risk compatibility issues, so instead document the quirk. Closes #115.

annevk · 2018-08-30T12:49:18Z

I decided to simply document this quirk in #155. If there are any concerns with that approach let me know.

At this point it does not seem worth it to require further implementation changes and risk compatibility issues, so instead document the quirk. Closes #115.

hsivonen · 2018-11-12T15:12:18Z

Another bug reported about Gecko conforming to the spec.

hsivonen · 2018-11-12T15:26:02Z

I still have trouble seeing the value of generating an error when an escape is followed by another escape when we don't generate errors for the other two cases I mentioned on July 3 2017.

annevk · 2018-11-12T15:49:36Z

FWIW, if you want to reopen this and remove that error, that's fine by me. It'd be good to know what other implementations do at this point, as changing such details is never fun for anyone involved.

hsivonen · 2018-11-13T13:07:34Z

I noticed that Unicode Security Considerations say that what the Encoding Standard requires for ISO-2022-JP should be done. If we reopen this, we should ask Mark Davis and Michel Suignard for their rationale.

hsivonen · 2018-11-19T07:43:39Z

Regarding the previous Gecko bug from a week ago, the reporter provides more info in a Thunderbird bug.

hsivonen · 2018-11-19T10:25:02Z

It'd be good to know what other implementations do at this point

Edge and IE don't generate a REPLACEMENT CHARACTER. Firefox, Chrome and Safari do. (With the caveat that my Mac is stuck on El Capitan, so I couldn't test the latest Safari.)

annevk · 2018-11-21T08:55:54Z

It seems worth resolving the Unicode Security Considerations question relative to the observation in #115 (comment). I wonder if @srl295 or @kenlunde could help us with that.

To restate, why is an escape followed by an escape considered problematic, whereas going from ASCII to ASCII, or uselessly going between ASCII and Roman, is not considered problematic?

Also taking into account that an escape followed by an escape is typical in email as per https://bugzilla.mozilla.org/show_bug.cgi?id=1374149#c5.

kenlunde · 2018-11-21T13:26:55Z

@annevk By "escape followed by an escape," you are referring to ISO 2022 escape sequences, not individual "escape" (U+001B) characters, right? If so, back in the day when ISO 2022 encoding was common for email clients, I often encountered data that included redundant escape sequences, meaning no-op escape sequences that would shift back to ASCII, then immediately shift back into JIS X 0208 with no intervening characters. I also recall building in support to handle such no-op escape sequences when converting to other encodings, such as EUC-JP or Shift-JIS, or to fix the ISO 2022 data by removing them. I am not sure whether that helps.

kenlunde · 2018-11-21T13:32:23Z

The following paragraph is from the bottom of page 583 of CJKV Information Processing, Second Edition:

Besides simple code conversion, it is also very important to be able to detect the escape
sequences used in ISO-2022-JP encoding. Escape sequences signal the software when to
change modes. Good software should also keep track of the current n-byte-per-character
mode so that redundant escape sequences can be ignored (and absorbed). Remember that
Shift-JIS encoding does not use escape sequences, so you will have to make sure that they
are not written to the resulting output file.

annevk · 2018-11-21T14:46:59Z

Right, the no-op escape sequences are the question here. https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input suggests those need to emit a U+FFFD and this is currently what the Encoding Standard does. However, that is incompatible with scenarios occurring in email as you note and also doesn't consider other potentially problematic scenarios such as an escape sequence for ASCII, some ASCII code points, and then another escape sequence for ASCII. Or that, but then between ASCII and Roman.

So what we'd like to know is how much consideration we should pay to that security consideration (which seems incomplete if a real danger) relative to the need for email clients to handle such no-op escape sequences.

annevk · 2018-11-21T14:47:16Z

(Also, thank you very much for the timely reply!)

kenlunde · 2018-11-21T18:54:08Z

@annevk In my opinion, enough time has passed that anything that is Unicode-encoded should have no realistic or real-world need to convert back to legacy encodings, particularly because almost all legacy encodings cannot handle a large number of characters that can be represented via Unicode. With that said, I think that nuking the no-op escape sequences is the right approach. There's no meaningful reason to leave behind a "trail" of sorts that something that is not really text data was in the data stream.

hsivonen · 2018-11-22T11:11:37Z

I posted the question to the Unicode mailing list.

EarlgreyPicard · 2018-11-23T06:04:24Z

I am only one of the Thunderbird users in Japan. Please allow me to comment here.

In Japan, many users still send e-mail with ISO-2022-JP encoding.
Many Japanese Windows users usually use Shift_JIS instead of Unicode when dealing with Japanese texts. UTF-8 or UTF-16 is used only when it is necessary to handle characters not included in Shift_JIS.
However, sending 8-bit Shift_JIS text directly via e-mail caused problems.
For that reason, we selected ISO-2022-JP encoding that can express characters with 7 bits when sending Japanese e-mails.
This old method is still in use today.

Recently Thunderbird has been automatically updated to Version 60, however, cases where U + FFFD is inserted in the subject display of some e-mails later have been reported since then.
For example, it was posted to MozillaZine.jp, and a bug report was posted to bugzilla.mozilla.org.
However, this bug report became RESOLVED because Firefox(and Thunderbird) conforms to the spec. That is why I came here.

Although there is no concrete report, I think that the e-mail software that performs encoding that does not conform to the spec is an old Outlook.
It is easy to say, "E-mail software not compliant with the spec is bad", but users of such e-mail software are never few. I think that we should not ignore this fact.

Regular users are not interested in the meaning of U + FFFD. They will simply decide that it is a bug in Thunderbird. And they will downgrade Thunderbird to version 52 or choose Microsoft products.

I hope that discussion will be conducted from the user's point of view.

kenlunde · 2018-11-23T14:17:01Z

@EarlgreyPicard In other words, you agree with my suggestion to simply nuke the no-op escape sequences.

EarlgreyPicard · 2018-11-23T16:51:08Z

I do not know what "simple nuke the no-op escape sequences" mean, but I hope that Thunderbird will not insert U + FFFD into the result of decoding the following e-mail subject.

Subject: =?ISO-2022-JP?B?IBskQiVXJW0bKEIbJEIlMCVpJWAbKEI=?=
↓
Good:
プログラム

Bad:
プロ�グラム

kenlunde · 2018-11-23T19:20:38Z

@EarlgreyPicard Sorry. To clarify, I meant 1) to remove no-op escape sequences; and 2) to not emit U+FFFD in their presence.

EarlgreyPicard · 2018-11-23T21:14:12Z

@kenlunde I understand. Thank you. I agree.

EarlgreyPicard · 2018-12-09T09:32:55Z

@hsivonen Is there progress on the mailing list?
Japanese Thunderbird user is still waiting for the problem to be solved.

hsivonen · 2018-12-10T10:07:36Z

Is there progress on the mailing list?

No. I pinged the mailing list again.

hsivonen · 2018-12-11T12:38:26Z

To address Mark Davis' request for formal feedback I started writing something. While doing so, I came up with this:

Generate U+FFFD if:

A state transition was made such that the previous state had no content and the previous state was not the ASCII state. (I.e. stop generating U+FFFD if the zero-length state is the ASCII state.)
A state transition to the ASCII state was preceded by the Roman state and the next byte was not 0x5C, 0x7E or the end of the stream.
A state transition to the Roman state was made and the next byte was not 0x5C, 0x7E, 0x1B or the end of the stream. (0x1B is on this list to avoid a case where both this rule and the first rule would apply at the same time resulting in two U+FFFDs.)

This would actually uphold the security properties that UTR 36 tries to uphold but fails to and this would avoid the unwanted U+FFFD generation reported in the email cases.

The key question is if imposing the requirement that ASCII to Roman and vice versa transitions can only happen when logically necessary and then at the last possible moment is feasible given the behavior of encoders out there.

Does anyone want to volunteer to research this by checking the behavior of existing encoders of my searching archives of old Japanese email for the relevant byte patterns?

hsivonen · 2018-12-17T09:16:09Z

I'm thinking of submitting a suggestion to go for non-committal middle ground for now. Thoughts?

hsivonen · 2020-08-06T09:16:02Z

No replies in one and a half years. @annevk, does the document linked to from the previous sentence look OK to you for submission to the Unicode Technical Committee?

(This issue was reported as a Thunderbird bug again.)

hsivonen · 2020-08-06T10:40:14Z

Based on off-issue comments, I revised the draft to conclude to go for "End state 1" right away instead of leaving it as non-committal.

hsivonen · 2020-08-06T10:43:22Z

Oops. That edit was internally inconsistent. Tried again.

annevk · 2020-08-06T11:10:58Z

Looks okay. I wonder if @ricea and @cdumez could figure out if Chromium and WebKit would align with any potential resulting Encoding standard changes, but I don't think that should block raising this with Unicode as them giving better security advice is a win either way.

ricea · 2020-08-06T14:29:24Z

I have no objection, although we'd probably consider it low priority and wait for a fix to filter in from ICU.

@jungshik has expressed stronger views about ISO-2022-JP implementation in the past and may have an opinion.

EarlgreyPicard · 2020-08-06T16:23:25Z

(This issue was reported as a Thunderbird bug again.)

What bug is that? Please tell me the bug number.

hsivonen · 2020-08-06T18:32:34Z

@annevk, @ricea Thanks. I've requested submission of the document to the Unicode Document Register.

@EarlgreyPicard Bug 1652388

j--m · 2021-08-30T23:20:26Z

Did no one consider always generating a mode change before the first character, and then never generating a mode change at the end?

hsivonen · 2021-08-31T05:46:25Z

Did no one consider always generating a mode change before the first character, and then never generating a mode change at the end?

That would be quite a departure from the existing practice of encoders.

j--m · 2021-09-01T23:01:18Z

Did no one consider always generating a mode change before the first character, and then never generating a mode change at the end?

That would be quite a departure from the existing practice of encoders.

Yet seemingly much more sound, and thus more useful. I guess I'm not clear whether there's any actual compatibility problems with that, whether there's a process for evaluating that, and whether something that was once broken is required to be broken forever.

vyv03354 · 2021-09-01T23:12:54Z

It will break the property that if the text contains only ASCII characters, the output is compatible with US-ASCII encoding.

j--m · 2021-09-04T00:11:41Z

It will break the property that if the text contains only ASCII characters, the output is compatible with US-ASCII encoding.

Section 2 of the spec already defines ISO-2022-JP and UTF-16 as "ASCII-incompatible encodings".

vyv03354 · 2021-09-04T00:25:50Z

I'm not saying about the spec definition. Please do not ignore "if the text contains only ASCII characters".

Always generating a mode change will cause problems like UTF-8 BOM caused. I strongly disagree with that as ISO-2022-JP user.

hsivonen changed the title ~~Concatenating two ISO-2022-JP outputs from a conforming encoding doesn't result in conforming input~~ Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input Jun 19, 2017

annevk added the normative label Apr 25, 2018

annevk added a commit that referenced this issue Aug 30, 2018

ISO-2022-JP encoder: document an oddity

7f77e7a

At this point it does not seem worth it to require further implementation changes and risk compatibility issues, so instead document the quirk. Closes #115.

annevk mentioned this issue Aug 30, 2018

ISO-2022-JP encoder: document an oddity #155

Merged

annevk closed this as completed in #155 Sep 2, 2018

annevk added a commit that referenced this issue Sep 2, 2018

ISO-2022-JP encoder: document an oddity

fe4934c

At this point it does not seem worth it to require further implementation changes and risk compatibility issues, so instead document the quirk. Closes #115.

annevk reopened this Nov 21, 2018

Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input #115

Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input #115

Comments

hsivonen commented Jun 19, 2017 • edited Loading

vyv03354 commented Jun 19, 2017

hsivonen commented Jul 3, 2017

annevk commented Apr 25, 2018

annevk commented Aug 30, 2018

hsivonen commented Nov 12, 2018

hsivonen commented Nov 12, 2018

annevk commented Nov 12, 2018

hsivonen commented Nov 13, 2018

hsivonen commented Nov 19, 2018

hsivonen commented Nov 19, 2018

annevk commented Nov 21, 2018

kenlunde commented Nov 21, 2018

kenlunde commented Nov 21, 2018

annevk commented Nov 21, 2018

annevk commented Nov 21, 2018

kenlunde commented Nov 21, 2018

hsivonen commented Nov 22, 2018

EarlgreyPicard commented Nov 23, 2018

kenlunde commented Nov 23, 2018

EarlgreyPicard commented Nov 23, 2018

kenlunde commented Nov 23, 2018

EarlgreyPicard commented Nov 23, 2018

EarlgreyPicard commented Dec 9, 2018

hsivonen commented Dec 10, 2018

hsivonen commented Dec 11, 2018 • edited Loading

hsivonen commented Dec 17, 2018

hsivonen commented Aug 6, 2020

hsivonen commented Aug 6, 2020

hsivonen commented Aug 6, 2020

annevk commented Aug 6, 2020

ricea commented Aug 6, 2020

EarlgreyPicard commented Aug 6, 2020

hsivonen commented Aug 6, 2020

j--m commented Aug 30, 2021

hsivonen commented Aug 31, 2021

j--m commented Sep 1, 2021

vyv03354 commented Sep 1, 2021

j--m commented Sep 4, 2021

vyv03354 commented Sep 4, 2021

hsivonen commented Jun 19, 2017 •

edited

Loading

hsivonen commented Dec 11, 2018 •

edited

Loading