Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input #115

Open
hsivonen opened this issue Jun 19, 2017 · 39 comments

Comments

@hsivonen
Copy link
Member

hsivonen commented Jun 19, 2017

Encodings other than ISO-2022-JP have the property that if you concatenate two outputs from a conforming encoder and decode them together, you get the same result as when decoding them separately and then concatenating.

ISO-2022-JP lacks this property, because despite the encoder making an effort into this direction by emitting a transition to the ASCII state at the end, if the next segment being concatenated starts with a transition to a non-ASCII state, the concatenation results in zero ASCII bytes between two escapes.

Is there a strong reason for treating a transition to the ASCII state immediately followed by another escape as non-conforming? Or put the other way, what purpose does the transition to the ASCII state at the end of encode serve if not achieving the above-mentioned concatenation property that other encodings have?

This is relevant to RFC 2047 header decoding.

@hsivonen hsivonen changed the title Concatenating two ISO-2022-JP outputs from a conforming encoding doesn't result in conforming input Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input Jun 19, 2017
@vyv03354
Copy link
Collaborator

It is my understanding that the reason is to prevent XSS attacks. Consider "<\u001b(B\u001b$Bscript" for example.

@hsivonen
Copy link
Member Author

hsivonen commented Jul 3, 2017

It is my understanding that the reason is to prevent XSS attacks. Consider "<\u001b(B\u001b$Bscript" for example.

Why is that worth protecting against if we can't protect against "<\x1b(Js\x1b(Bcript"?

That is, if we can't generate U+FFFD for all of these, is it worth generating it for any of these?

  • Escape immediately followed by another escape.
  • Transition from the ASCII state to the ASCII state.
  • Useless transitions between the ASCII state and the Roman state.

The last one seems the hardest to prevent without potentially breaking some legitimate inputs.

@annevk
Copy link
Member

annevk commented Apr 25, 2018

Does anyone wish to drive a change proposal here? Or should we just accept this and maybe document the issues a bit further?

cc @jungshik

annevk added a commit that referenced this issue Aug 30, 2018
At this point it does not seem worth it to require further implementation changes and risk compatibility issues, so instead document the quirk.

Closes #115.
@annevk
Copy link
Member

annevk commented Aug 30, 2018

I decided to simply document this quirk in #155. If there are any concerns with that approach let me know.

annevk added a commit that referenced this issue Sep 2, 2018
At this point it does not seem worth it to require further implementation changes and risk compatibility issues, so instead document the quirk.

Closes #115.
@hsivonen
Copy link
Member Author

@hsivonen
Copy link
Member Author

I still have trouble seeing the value of generating an error when an escape is followed by another escape when we don't generate errors for the other two cases I mentioned on July 3 2017.

@annevk
Copy link
Member

annevk commented Nov 12, 2018

FWIW, if you want to reopen this and remove that error, that's fine by me. It'd be good to know what other implementations do at this point, as changing such details is never fun for anyone involved.

@hsivonen
Copy link
Member Author

I noticed that Unicode Security Considerations say that what the Encoding Standard requires for ISO-2022-JP should be done. If we reopen this, we should ask Mark Davis and Michel Suignard for their rationale.

@hsivonen
Copy link
Member Author

Regarding the previous Gecko bug from a week ago, the reporter provides more info in a Thunderbird bug.

@hsivonen
Copy link
Member Author

It'd be good to know what other implementations do at this point

Edge and IE don't generate a REPLACEMENT CHARACTER. Firefox, Chrome and Safari do. (With the caveat that my Mac is stuck on El Capitan, so I couldn't test the latest Safari.)

@annevk
Copy link
Member

annevk commented Nov 21, 2018

It seems worth resolving the Unicode Security Considerations question relative to the observation in #115 (comment). I wonder if @srl295 or @kenlunde could help us with that.

To restate, why is an escape followed by an escape considered problematic, whereas going from ASCII to ASCII, or uselessly going between ASCII and Roman, is not considered problematic?

Also taking into account that an escape followed by an escape is typical in email as per https://bugzilla.mozilla.org/show_bug.cgi?id=1374149#c5.

@annevk annevk reopened this Nov 21, 2018
@kenlunde
Copy link

@annevk By "escape followed by an escape," you are referring to ISO 2022 escape sequences, not individual "escape" (U+001B) characters, right? If so, back in the day when ISO 2022 encoding was common for email clients, I often encountered data that included redundant escape sequences, meaning no-op escape sequences that would shift back to ASCII, then immediately shift back into JIS X 0208 with no intervening characters. I also recall building in support to handle such no-op escape sequences when converting to other encodings, such as EUC-JP or Shift-JIS, or to fix the ISO 2022 data by removing them. I am not sure whether that helps.

@kenlunde
Copy link

The following paragraph is from the bottom of page 583 of CJKV Information Processing, Second Edition:

Besides simple code conversion, it is also very important to be able to detect the escape
sequences used in ISO-2022-JP encoding. Escape sequences signal the software when to
change modes. Good software should also keep track of the current n-byte-per-character
mode so that redundant escape sequences can be ignored (and absorbed). Remember that
Shift-JIS encoding does not use escape sequences, so you will have to make sure that they
are not written to the resulting output file.

@annevk
Copy link
Member

annevk commented Nov 21, 2018

Right, the no-op escape sequences are the question here. https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input suggests those need to emit a U+FFFD and this is currently what the Encoding Standard does. However, that is incompatible with scenarios occurring in email as you note and also doesn't consider other potentially problematic scenarios such as an escape sequence for ASCII, some ASCII code points, and then another escape sequence for ASCII. Or that, but then between ASCII and Roman.

So what we'd like to know is how much consideration we should pay to that security consideration (which seems incomplete if a real danger) relative to the need for email clients to handle such no-op escape sequences.

@annevk
Copy link
Member

annevk commented Nov 21, 2018

(Also, thank you very much for the timely reply!)

@kenlunde
Copy link

@annevk In my opinion, enough time has passed that anything that is Unicode-encoded should have no realistic or real-world need to convert back to legacy encodings, particularly because almost all legacy encodings cannot handle a large number of characters that can be represented via Unicode. With that said, I think that nuking the no-op escape sequences is the right approach. There's no meaningful reason to leave behind a "trail" of sorts that something that is not really text data was in the data stream.

@hsivonen
Copy link
Member Author

@EarlgreyPicard
Copy link

I am only one of the Thunderbird users in Japan. Please allow me to comment here.

In Japan, many users still send e-mail with ISO-2022-JP encoding.
Many Japanese Windows users usually use Shift_JIS instead of Unicode when dealing with Japanese texts. UTF-8 or UTF-16 is used only when it is necessary to handle characters not included in Shift_JIS.
However, sending 8-bit Shift_JIS text directly via e-mail caused problems.
For that reason, we selected ISO-2022-JP encoding that can express characters with 7 bits when sending Japanese e-mails.
This old method is still in use today.

Recently Thunderbird has been automatically updated to Version 60, however, cases where U + FFFD is inserted in the subject display of some e-mails later have been reported since then.
For example, it was posted to MozillaZine.jp, and a bug report was posted to bugzilla.mozilla.org.
However, this bug report became RESOLVED because Firefox(and Thunderbird) conforms to the spec. That is why I came here.

Although there is no concrete report, I think that the e-mail software that performs encoding that does not conform to the spec is an old Outlook.
It is easy to say, "E-mail software not compliant with the spec is bad", but users of such e-mail software are never few. I think that we should not ignore this fact.

Regular users are not interested in the meaning of U + FFFD. They will simply decide that it is a bug in Thunderbird. And they will downgrade Thunderbird to version 52 or choose Microsoft products.

I hope that discussion will be conducted from the user's point of view.

@kenlunde
Copy link

@EarlgreyPicard In other words, you agree with my suggestion to simply nuke the no-op escape sequences.

@EarlgreyPicard
Copy link

I do not know what "simple nuke the no-op escape sequences" mean, but I hope that Thunderbird will not insert U + FFFD into the result of decoding the following e-mail subject.

Subject: =?ISO-2022-JP?B?IBskQiVXJW0bKEIbJEIlMCVpJWAbKEI=?=

Good:
プログラム

Bad:
プロ�グラム

@kenlunde
Copy link

@EarlgreyPicard Sorry. To clarify, I meant 1) to remove no-op escape sequences; and 2) to not emit U+FFFD in their presence.

@EarlgreyPicard
Copy link

@kenlunde I understand. Thank you. I agree.

@EarlgreyPicard
Copy link

@hsivonen Is there progress on the mailing list?
Japanese Thunderbird user is still waiting for the problem to be solved.

@hsivonen
Copy link
Member Author

Is there progress on the mailing list?

No. I pinged the mailing list again.

@hsivonen
Copy link
Member Author

hsivonen commented Dec 11, 2018

To address Mark Davis' request for formal feedback I started writing something. While doing so, I came up with this:


Generate U+FFFD if:

  • A state transition was made such that the previous state had no content and the previous state was not the ASCII state. (I.e. stop generating U+FFFD if the zero-length state is the ASCII state.)
  • A state transition to the ASCII state was preceded by the Roman state and the next byte was not 0x5C, 0x7E or the end of the stream.
  • A state transition to the Roman state was made and the next byte was not 0x5C, 0x7E, 0x1B or the end of the stream. (0x1B is on this list to avoid a case where both this rule and the first rule would apply at the same time resulting in two U+FFFDs.)

This would actually uphold the security properties that UTR 36 tries to uphold but fails to and this would avoid the unwanted U+FFFD generation reported in the email cases.

The key question is if imposing the requirement that ASCII to Roman and vice versa transitions can only happen when logically necessary and then at the last possible moment is feasible given the behavior of encoders out there.

Does anyone want to volunteer to research this by checking the behavior of existing encoders of my searching archives of old Japanese email for the relevant byte patterns?

@hsivonen
Copy link
Member Author

I'm thinking of submitting a suggestion to go for non-committal middle ground for now. Thoughts?

@hsivonen
Copy link
Member Author

hsivonen commented Aug 6, 2020

No replies in one and a half years. @annevk, does the document linked to from the previous sentence look OK to you for submission to the Unicode Technical Committee?

(This issue was reported as a Thunderbird bug again.)

@hsivonen
Copy link
Member Author

hsivonen commented Aug 6, 2020

Based on off-issue comments, I revised the draft to conclude to go for "End state 1" right away instead of leaving it as non-committal.

@hsivonen
Copy link
Member Author

hsivonen commented Aug 6, 2020

Oops. That edit was internally inconsistent. Tried again.

@annevk
Copy link
Member

annevk commented Aug 6, 2020

Looks okay. I wonder if @ricea and @cdumez could figure out if Chromium and WebKit would align with any potential resulting Encoding standard changes, but I don't think that should block raising this with Unicode as them giving better security advice is a win either way.

@ricea
Copy link
Collaborator

ricea commented Aug 6, 2020

I have no objection, although we'd probably consider it low priority and wait for a fix to filter in from ICU.

@jungshik has expressed stronger views about ISO-2022-JP implementation in the past and may have an opinion.

@EarlgreyPicard
Copy link

(This issue was reported as a Thunderbird bug again.)

What bug is that? Please tell me the bug number.

@hsivonen
Copy link
Member Author

hsivonen commented Aug 6, 2020

@annevk, @ricea Thanks. I've requested submission of the document to the Unicode Document Register.

@EarlgreyPicard Bug 1652388

@j--m
Copy link

j--m commented Aug 30, 2021

Did no one consider always generating a mode change before the first character, and then never generating a mode change at the end?

@hsivonen
Copy link
Member Author

Did no one consider always generating a mode change before the first character, and then never generating a mode change at the end?

That would be quite a departure from the existing practice of encoders.

@j--m
Copy link

j--m commented Sep 1, 2021

Did no one consider always generating a mode change before the first character, and then never generating a mode change at the end?

That would be quite a departure from the existing practice of encoders.

Yet seemingly much more sound, and thus more useful. I guess I'm not clear whether there's any actual compatibility problems with that, whether there's a process for evaluating that, and whether something that was once broken is required to be broken forever.

@vyv03354
Copy link
Collaborator

vyv03354 commented Sep 1, 2021

It will break the property that if the text contains only ASCII characters, the output is compatible with US-ASCII encoding.

@j--m
Copy link

j--m commented Sep 4, 2021

It will break the property that if the text contains only ASCII characters, the output is compatible with US-ASCII encoding.

Section 2 of the spec already defines ISO-2022-JP and UTF-16 as "ASCII-incompatible encodings".

@vyv03354
Copy link
Collaborator

vyv03354 commented Sep 4, 2021

I'm not saying about the spec definition. Please do not ignore "if the text contains only ASCII characters".

Always generating a mode change will cause problems like UTF-8 BOM caused. I strongly disagree with that as ISO-2022-JP user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants