gh-113274: fix EUC-JP decoding of FULLWIDTH TILDE #113275

qsantos · 2023-12-19T07:37:49Z

This PR closes #113274. This is done by changing the UCS-2 codepoint 126 (~, TILDE) to 65374 (～, FULLWIDTH TILDE) in the __jisx0212_decmap reference table. Since no character is added or removed, no other changes are needed.

Bug report

Bug description:

Python decodes the bytes 0x8FA2A7 as ~ (TILDE) in EUC-JP.

assert b'\x8f\xa2\xb7'.decode('euc_jp') == '~'

This reference document is ambiguous in that it shows a simple ~ (TILDE), but most other software (iconv, Vim, Firefox, Rust's encoding_rs) interpret this as ～ (FULLWIDTH TILDE). Note that EUC-JP already includes US-ASCII, and so:

assert '~'.encode('euc-jp') == b'~'

CPython versions tested on:

3.11, CPython main branch

Operating systems tested on:

Linux

Issue: Python decodes EUC-JP 8FA2A7 as TILDE instead of FULLWIDTH TILDE #113274

cpython-cla-bot · 2023-12-19T07:37:52Z

All commit authors signed the Contributor License Agreement.

corona10 · 2023-12-19T08:03:23Z

Modules/cjkcodecs/mappings_jp.h

@@ -591,7 +591,7 @@ __jisx0208_decmap+6950,33,38},{0,0,0},{0,0,0},{0,0,0},{0,0,0},{0,0,0},{0,0,0},
 };

 static const ucs2_t __jisx0212_decmap[6179] = {
-728,711,184,729,733,175,731,730,126,900,901,U,U,U,U,U,U,U,U,161,166,191,U,U,U,
+728,711,184,729,733,175,731,730,65374,900,901,U,U,U,U,U,U,U,U,161,166,191,U,U,U,


cpython/Modules/cjkcodecs/mappings_jp.h

Line 1 in 171ebb2

// AUTO-GENERATED FILE FROM genmap_japanese.py: DO NOT EDIT

Please take a look at the comment on the first line.
I 've not taken a look at the issue deeply yet, but if you want to change the mapping file, you should modify the generator.

I totally missed that. I will do that.

Python generates the data using this as its source of truth: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT.

encoding_rs: also generated, but from https://encoding.spec.whatwg.org/index-jis0212.txt

iconv: seems to be maintained indenpendently, and we can actually seen when they replaced TILDE with FULLWIDTH TILDE in their changelog: https://git.savannah.gnu.org/gitweb/?p=libiconv.git;a=blob;f=ChangeLog;h=0e2f5f00bc6aff84932cd92bd09a7c14f802c44d;hb=refs/heads/master#l4729

I couldn't figure out how Vim and Firefox proceed (although Vim does fallback to iconv for non-internally-supported encodings). Since I do not expect to get Unicode to update the document of an obsolete standard that they only used to adapt, I see two options:

add a hack in genmap_japanese.py around line 90 after the call to loadmap(jisx0212file) to reassign jisx0212decmap[34][55] = ord('～') (it does fix __jisx0212_decmap as expected and changes nothing else), and comment that properly

switch to WhatWG as the source of truth

1 can be done pretty quickly and with minimal risk of breaking anything. 2 has the advantage that it might help discover other issues in mappings, but is obviously much more involved.

If 1 seems reasonable to you, I will update this PR accordingly. If 2, I will close it and document this in the ticket for later.

I have updated this MR with the hack in genmap_japanese.py and regenerated mappings_jp.h.

I will take a look at which will be the better, If we decide to solve this issue, option 2 will be the better way.
But I need to check the current status related to JIS0212 and side effect.
Also need to consider stability of whatwg spec.

bedevere-app · 2023-12-19T08:03:32Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

qsantos · 2023-12-24T07:34:12Z

I have made the requested changes; please review again.

corona10 · 2023-12-24T09:15:19Z

@qsantos I will take a look the PR by next week. Enjoy your Christmas.

corona10 · 2023-12-24T09:26:34Z

Okay, I like this PR, it will guarantee the round trip

a = b'\x8f\xa2\xb7'
assert a.decode('euc_jp').encode('euc_jp') == a

But as I said, I will take a look at this PR by next week.

qsantos · 2023-12-24T12:22:57Z

@qsantos I will take a look the PR by next week. Enjoy your Christmas.

Thanks, I was not sure if I had properly triggered the bot. Enjoy your Christmas as well!

corona10

Okay, here is my temporal conclusion for this change.
(If we decide to adopt the change!)

Let's use uncode.org data as same as now, whatwg is great but not stable. We are not a browser, I am not sure that we can follow every change without side effect.
Let's vendor the modified JIS0212.TXT into https://github.com/python/cpython/tree/main/Tools/unicode/python-mappings and store the diff file to https://github.com/python/cpython/tree/main/Tools/unicode/python-mappings/diff for tracking changes.

But I need to listen to the opinion of @ezio-melotti, who is a more experienced unicode expert than me.

qsantos requested a review from corona10 as a code owner December 19, 2023 07:37

bedevere-app bot added the awaiting review label Dec 19, 2023

bedevere-app bot mentioned this pull request Dec 19, 2023

Python decodes EUC-JP 8FA2A7 as TILDE instead of FULLWIDTH TILDE #113274

Open

corona10 requested changes Dec 19, 2023

View reviewed changes

bedevere-app bot added awaiting changes and removed awaiting review labels Dec 19, 2023

pythongh-113274: fix EUC-JP decoding of FULLWIDTH TILDE

1631b56

qsantos force-pushed the fix-issue-113274 branch from 171ebb2 to 1631b56 Compare December 21, 2023 16:19

Pass FULLWIDTH TILDE in euc_jisx0213

87d6103

corona10 self-assigned this Dec 24, 2023

corona10 requested changes Dec 31, 2023

View reviewed changes

corona10 assigned ezio-melotti Dec 31, 2023

corona10 added the topic-unicode label Dec 31, 2023

gh-113274: fix EUC-JP decoding of FULLWIDTH TILDE #113275

gh-113274: fix EUC-JP decoding of FULLWIDTH TILDE #113275

qsantos commented Dec 19, 2023 •

edited by bedevere-app bot

cpython-cla-bot bot commented Dec 19, 2023 •

edited

corona10 Dec 19, 2023 •

edited

qsantos Dec 19, 2023

qsantos Dec 19, 2023 •

edited

qsantos Dec 21, 2023

corona10 Dec 24, 2023 •

edited

bedevere-app bot commented Dec 19, 2023

qsantos commented Dec 24, 2023

corona10 commented Dec 24, 2023

corona10 commented Dec 24, 2023 •

edited

qsantos commented Dec 24, 2023

corona10 left a comment •

edited

gh-113274: fix EUC-JP decoding of FULLWIDTH TILDE #113275

Are you sure you want to change the base?

gh-113274: fix EUC-JP decoding of FULLWIDTH TILDE #113275

Conversation

qsantos commented Dec 19, 2023 • edited by bedevere-app bot

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

cpython-cla-bot bot commented Dec 19, 2023 • edited

corona10 Dec 19, 2023 • edited

Choose a reason for hiding this comment

qsantos Dec 19, 2023

Choose a reason for hiding this comment

qsantos Dec 19, 2023 • edited

Choose a reason for hiding this comment

qsantos Dec 21, 2023

Choose a reason for hiding this comment

corona10 Dec 24, 2023 • edited

Choose a reason for hiding this comment

bedevere-app bot commented Dec 19, 2023

qsantos commented Dec 24, 2023

corona10 commented Dec 24, 2023

corona10 commented Dec 24, 2023 • edited

qsantos commented Dec 24, 2023

corona10 left a comment • edited

Choose a reason for hiding this comment

qsantos commented Dec 19, 2023 •

edited by bedevere-app bot

cpython-cla-bot bot commented Dec 19, 2023 •

edited

corona10 Dec 19, 2023 •

edited

qsantos Dec 19, 2023 •

edited

corona10 Dec 24, 2023 •

edited

corona10 commented Dec 24, 2023 •

edited

corona10 left a comment •

edited