Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-113274: fix EUC-JP decoding of FULLWIDTH TILDE #113275

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

qsantos
Copy link

@qsantos qsantos commented Dec 19, 2023

This PR closes #113274. This is done by changing the UCS-2 codepoint 126 (~, TILDE) to 65374 (~, FULLWIDTH TILDE) in the __jisx0212_decmap reference table. Since no character is added or removed, no other changes are needed.

Bug report

Bug description:

Python decodes the bytes 0x8FA2A7 as ~ (TILDE) in EUC-JP.

assert b'\x8f\xa2\xb7'.decode('euc_jp') == '~'

This reference document is ambiguous in that it shows a simple ~ (TILDE), but most other software (iconv, Vim, Firefox, Rust's encoding_rs) interpret this as ~ (FULLWIDTH TILDE). Note that EUC-JP already includes US-ASCII, and so:

assert '~'.encode('euc-jp') == b'~'

CPython versions tested on:

3.11, CPython main branch

Operating systems tested on:

Linux

Copy link

cpython-cla-bot bot commented Dec 19, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@@ -591,7 +591,7 @@ __jisx0208_decmap+6950,33,38},{0,0,0},{0,0,0},{0,0,0},{0,0,0},{0,0,0},{0,0,0},
};

static const ucs2_t __jisx0212_decmap[6179] = {
728,711,184,729,733,175,731,730,126,900,901,U,U,U,U,U,U,U,U,161,166,191,U,U,U,
728,711,184,729,733,175,731,730,65374,900,901,U,U,U,U,U,U,U,U,161,166,191,U,U,U,
Copy link
Member

@corona10 corona10 Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// AUTO-GENERATED FILE FROM genmap_japanese.py: DO NOT EDIT

Please take a look at the comment on the first line.
I 've not taken a look at the issue deeply yet, but if you want to change the mapping file, you should modify the generator.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed that. I will do that.

Copy link
Author

@qsantos qsantos Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how Vim and Firefox proceed (although Vim does fallback to iconv for non-internally-supported encodings). Since I do not expect to get Unicode to update the document of an obsolete standard that they only used to adapt, I see two options:

  1. add a hack in genmap_japanese.py around line 90 after the call to loadmap(jisx0212file) to reassign jisx0212decmap[34][55] = ord('~') (it does fix __jisx0212_decmap as expected and changes nothing else), and comment that properly
  2. switch to WhatWG as the source of truth

1 can be done pretty quickly and with minimal risk of breaking anything. 2 has the advantage that it might help discover other issues in mappings, but is obviously much more involved.

If 1 seems reasonable to you, I will update this PR accordingly. If 2, I will close it and document this in the ticket for later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this MR with the hack in genmap_japanese.py and regenerated mappings_jp.h.

Copy link
Member

@corona10 corona10 Dec 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look at which will be the better, If we decide to solve this issue, option 2 will be the better way.
But I need to check the current status related to JIS0212 and side effect.
Also need to consider stability of whatwg spec.

@bedevere-app
Copy link

bedevere-app bot commented Dec 19, 2023

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@qsantos
Copy link
Author

qsantos commented Dec 24, 2023

I have made the requested changes; please review again.

@corona10
Copy link
Member

@qsantos I will take a look the PR by next week. Enjoy your Christmas.

@corona10 corona10 self-assigned this Dec 24, 2023
@corona10
Copy link
Member

corona10 commented Dec 24, 2023

Okay, I like this PR, it will guarantee the round trip

a = b'\x8f\xa2\xb7'
assert a.decode('euc_jp').encode('euc_jp') == a

But as I said, I will take a look at this PR by next week.

@qsantos
Copy link
Author

qsantos commented Dec 24, 2023

@qsantos I will take a look the PR by next week. Enjoy your Christmas.

Thanks, I was not sure if I had properly triggered the bot. Enjoy your Christmas as well!

Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, here is my temporal conclusion for this change.
(If we decide to adopt the change!)

  1. Let's use uncode.org data as same as now, whatwg is great but not stable. We are not a browser, I am not sure that we can follow every change without side effect.
  2. Let's vendor the modified JIS0212.TXT into https://github.com/python/cpython/tree/main/Tools/unicode/python-mappings and store the diff file to https://github.com/python/cpython/tree/main/Tools/unicode/python-mappings/diff for tracking changes.

But I need to listen to the opinion of @ezio-melotti, who is a more experienced unicode expert than me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python decodes EUC-JP 8FA2A7 as TILDE instead of FULLWIDTH TILDE
3 participants