gh-84508: tool to generate cjk traditional chinese mappings #93272

sorcio · 2022-05-26T20:37:28Z

This adds a script to generate the mapping files for Traditional Chinese Big-5-based codecs, as discussed in the issue.

I initially planned to add support for later versions of HKSCS, but I decided to keep this minimal so to close gh-84508. The topic of refreshing the mappings is split to its own issue in gh-93271.

So this generates mappings_tw.h and mappings_hk.h files identical to the existing versions (with only one new line difference).

Notes about the mapping files:

I did not include the BIG5.TXT and CP950.TXT files that are available on the Unicode website. It looks like these are available for redistribution, so I can add them to the PR if needed.
I also did not include the hkscs-2004-big5-iso.txt file, but that's a different story. The terms of use from the source website include a clause that unilaterally binds to any update to the terms (clause 2), which I believe is incompatible with redistribution as part of CPython.

corona10

Super cool, thanks for your contribution.
This issue was painful to me because we can not re-generate mapping files without this tool.

I left a small suggestion for this.

Tools/unicode/genmap_tchinese.py

bedevere-bot · 2022-05-28T02:55:28Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

corona10

I did not include the BIG5.TXT and CP950.TXT files that are available on the Unicode website. It looks like these are available for redistribution, so I can add them to the PR if needed.

Please addig it.

Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>

sorcio · 2022-05-28T09:46:00Z

I have made the requested changes; please review again

bedevere-bot · 2022-05-28T09:46:03Z

Thanks for making the requested changes!

@corona10: please review the changes made to this pull request.

ezio-melotti

The two files are somewhat big -- @corona10 do you think it's ok to include them in the repo?
If they are added the size of the repo will increase, making downloading/cloning it slower. I'm not sure if it might affect other things too.

ezio-melotti · 2022-05-28T10:12:43Z

Tools/unicode/genmap_tchinese.py

+#
+# genmap_tchinese.py: Traditional Chinese Codecs Map Generator
+#
+# Original Author:  Hye-Shik Chang <perky@FreeBSD.org>


FWIW @hyeshik doesn't seem to be active on this repo, but he signed the CLA in 2005 (https://bugs.python.org/user1298), so it should be ok to include it here.

If this script is based on his script, you might want to clarify that in the comment.

@ezio-melotti do you have a suggestion on how to amend the comment to clarify that? The mention of "Original Author" is what it currently looks like in the other genmap_* scripts, so I didn't take the liberty to sway from that style.

The original source (as for the other equivalent scripts) is https://github.com/BackupTheBerlios/cjkpython/blob/master/cjkcodecs/tools/genmap_tchinese.py which was originally under 2-clause BSD. I might have made a wrong assumption that it was ok to include in CPython, since the other scripts followed the same process?

(In case it matters, the script in this PR is heavily modified, so while it probably still counts as derivative, it would be easy to do a full rewrite)

corona10 · 2022-05-28T14:43:26Z

@ezio-melotti
Yes, it's worth adding it. There are two reasons for this or we can separate the repo in the future as the submodule.

Mapping files have existed already in the CPython repo.
https://github.com/python/cpython/tree/main/Tools/unicode/python-mappings
And I noticed that mapping files were not maintained well even if they should contain the diff file like https://github.com/python/cpython/tree/main/Tools/unicode/python-mappings/diff
I am focusing on maintaining a well reproducible environment even if the external environment is changed.
No one was interested in regenerating the mapping header files until I was concerned about the situation. This is an issue that can recur over time. Freezing the environment is important to prevent the same situation.

corona10

Please regenerate Modules/cjkcodecs/mappings_hk.h and Modules/cjkcodecs/mappings_tw.h through the new script.

bedevere-bot · 2022-05-28T14:46:21Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

corona10

LGTM, but I will send a mail to @hyeshik to discuss this issue.
And also waiting for @ezio-melotti opinion about storing mapping files.
(I prefer to store at the own mapping repo under python org as a submodule, but let's do it at a separate issue.)

ezio-melotti · 2022-05-28T17:05:44Z

The files that were there before were just a few kB (about 40kB in 9d5c071). GH-19602 introduced more files (~1.3MB, 2 years ago). GH-93309 recently added another ~1.4MB. The files included in this PR add yet another ~800kB.

The CPython source (without the Git history) is ~27MB zipped and ~92MB once extracted. With the Git history it's about 585MB (quite a bit more than I thought).

Compared to the rest, an extra 800kB is not particularly significant so it's probably ok to include them.

If you are planning to include more files, maybe it would be better to think about a different approach, depending on the goal (e.g. host copies somewhere on python.org and add a script to download them from there). Note that since the other PRs have already been merged, even if we (re)move the files from the repo we might be able to reduce the repo size by ~3MB but they will be included in the history forever.

corona10

The files that were there before were just a few kB (about 40kB in [9d5c071]>>(https://github.com/python/cpython/tree/9d5c0710609320e51631750d1cf60c90cc618172/Tools/unicode/python-mappings)). #19602 introduced more files (~1.3MB, 2 years ago). #93309 recently added another ~1.4MB. The files included in this PR add yet another ~800kB.

Okay, I never thought that it will be a burden.

@sorcio
Can you please remove Tools/unicode/python-mappings/BIG5.TXT and Tools/unicode/python-mappings/CP950.TXT as a middle ground? (Sorry for duplicated works)

@ezio-melotti
I will revert #93309 too.
But can you open the discussion about storing mapping files that can be managed by CPython developers?
I think that you can understand what I was concerned about and what I intended to do.
It will cover all files under Tools/unicode/python-mappings/.

(e.g. host copies somewhere on python.org and add a script to download them from there).

That looks like a good suggestion as much as maintaining a submodule repository through GitHub.

Thank you for understanding.

bedevere-bot · 2022-05-28T17:56:00Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

This reverts commit bec0692.

sorcio · 2022-05-29T10:07:49Z

I have made the requested changes; please review again

Sorry for duplicated works

No worries! It's good that we clarified that.

bedevere-bot · 2022-05-29T10:07:53Z

Thanks for making the requested changes!

@corona10: please review the changes made to this pull request.

Tools/unicode/genmap_tchinese.py

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

ezio-melotti

The patch looks good to me. The script could be improved a bit (e.g. more docstring), but I'm assuming those parts were already in the original script and you didn't touch them.

I was also looking at e.g. open_mapping_file (defined in genmap_support) and noticed that it could be improved to report a clearer message. This however is out of the scope of this PR. @corona10, are you planning to do more work related to this? Do you think it would be worth to refactor these scripts?

I also noticed that the PR doesn't include any tests. For the tool itself is probably ok, however the resulting mapping should be tested. Are there already tests for that? It might also be useful to add some tests for the characters that diverge from one codec variant to the other, in order to avoid regressions.

sorcio · 2022-05-30T13:11:22Z

It might also be useful to add some tests for the characters that diverge from one codec variant to the other, in order to avoid regressions.

That's a valid point. Keeping in mind that the codecs already have tests, and this PR doesn't change the behavior of the codecs in any way. But the tests rely on test data such as https://github.com/python/pythontestdotnet/blob/master/www/unicode/BIG5HKSCS-2004.TXT and the sample text at https://github.com/python/cpython/tree/main/Lib/test/cjkencodings.

I found no way to generate the test data files and I believe the sample text is hand-picked. In particular, looking at 5a15508:

the BIG5.TXT and CP950.TXT files originally were downloaded from unicode.org (sorry I missed this before, there is already a Python-hosted copy of this data!) and they are used as-is in the tests, so they should be okay
the BIG5HKSCS-2004.TXT file, on the other hand, used to refer to http://people.freebsd.org/~perky/i18n/BIG5HKSCS-2004.TXT as a source; I don't believe @hyeshik is still maintaining that repository, and it's not documented how the file was built
the sample texts similarly have no documented source, and are probably hand-crafted to cover some tricky cases, and require domain knowledge to maintain

Is it within the scope of this issue/PR to add a script to rebuild a file like BIG5HKSCS-2004.TXT?

corona10

I was also looking at e.g. open_mapping_file (defined in genmap_support) and noticed that it could be improved to report a clearer message. This however is out of the scope of this PR. @corona10, are you planning to do more work related to this? Do you think it would be worth to refactor these scripts?

Please go ahead.

lgtm

corona10 · 2022-05-31T08:52:26Z

@ezio-melotti I got a mail from @hyeshik, He is going to review this PR soon. :)

hyeshik · 2022-06-11T06:01:16Z

Thank you for your efforts in resolving this old problem, @sorcio, @ezio-melotti, and @corona10. All the changes look good to me. Also, I would like to put all my work included in the CJKCodecs releases into the public domain to allow more flexible adoptions in any derivative work. Please feel free to change the copyright conditions as needed.

corona10 · 2022-06-11T14:20:28Z

Thank you @hyeshik!

sorcio added 2 commits May 26, 2022

pythongh-84508: tool to generate cjk traditional chinese mappings

1fb94f7

strip trailing whitespace

0eae907

bedevere-bot added the awaiting review label May 26, 2022

AA-Turner requested a review from corona10 May 26, 2022

AA-Turner added the expert-unicode label May 26, 2022

corona10 self-assigned this May 27, 2022

corona10 added the skip news label May 27, 2022

corona10 requested changes May 28, 2022

View changes

Tools/unicode/genmap_tchinese.py Show resolved Hide resolved

Tools/unicode/genmap_tchinese.py Outdated Show resolved Hide resolved

Tools/unicode/genmap_tchinese.py Show resolved Hide resolved

bedevere-bot removed the awaiting review label May 28, 2022

bedevere-bot added the awaiting changes label May 28, 2022

corona10 reviewed May 28, 2022

View changes

sorcio and others added 2 commits May 28, 2022

Apply suggestions from code review

9a008bd

Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>

add unicode mapping files for big5 and cp950

bec0692

bedevere-bot added awaiting change review and removed awaiting changes labels May 28, 2022

bedevere-bot requested a review from corona10 May 28, 2022

ezio-melotti reviewed May 28, 2022

View changes

corona10 requested changes May 28, 2022

View changes

bedevere-bot added awaiting changes and removed awaiting change review labels May 28, 2022

generate mappings_tw.h and mappings_hk.h

39d77d9

corona10 approved these changes May 28, 2022

View changes

bedevere-bot added awaiting merge and removed awaiting changes labels May 28, 2022

corona10 requested changes May 28, 2022

View changes

bedevere-bot removed the awaiting merge label May 28, 2022

bedevere-bot added the awaiting changes label May 28, 2022

corona10 mentioned this pull request May 28, 2022

Revert "gh-84508: Add mapping files for Korean and Japanese." #93320

Merged

Revert "add unicode mapping files for big5 and cp950"

5d0ec20

This reverts commit bec0692.

bedevere-bot added awaiting change review and removed awaiting changes labels May 29, 2022

bedevere-bot requested a review from corona10 May 29, 2022

ezio-melotti reviewed May 30, 2022

View changes

Tools/unicode/genmap_tchinese.py Outdated Show resolved Hide resolved

Use https url for CP950.TXT

d831eac

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

ezio-melotti approved these changes May 30, 2022

View changes

bedevere-bot added awaiting merge and removed awaiting change review labels May 30, 2022

corona10 approved these changes May 30, 2022

View changes

corona10 assigned ezio-melotti and unassigned corona10 May 30, 2022

corona10 self-assigned this May 31, 2022

corona10 merged commit 733e15f into python:main Jun 11, 2022
12 checks passed

bedevere-bot removed the awaiting merge label Jun 11, 2022

gh-84508: tool to generate cjk traditional chinese mappings #93272

gh-84508: tool to generate cjk traditional chinese mappings #93272

sorcio commented May 26, 2022

corona10 left a comment

bedevere-bot commented May 28, 2022

corona10 left a comment

sorcio commented May 28, 2022

bedevere-bot commented May 28, 2022

ezio-melotti left a comment

ezio-melotti May 28, 2022

sorcio May 28, 2022

sorcio May 28, 2022

corona10 commented May 28, 2022 •

edited

corona10 left a comment •

edited

bedevere-bot commented May 28, 2022

corona10 left a comment •

edited

ezio-melotti commented May 28, 2022

corona10 left a comment •

edited

bedevere-bot commented May 28, 2022

sorcio commented May 29, 2022

bedevere-bot commented May 29, 2022

ezio-melotti left a comment

sorcio commented May 30, 2022

corona10 left a comment

corona10 commented May 31, 2022

hyeshik commented Jun 11, 2022

corona10 commented Jun 11, 2022

gh-84508: tool to generate cjk traditional chinese mappings #93272

gh-84508: tool to generate cjk traditional chinese mappings #93272

Conversation

sorcio commented May 26, 2022

corona10 left a comment

bedevere-bot commented May 28, 2022

corona10 left a comment

sorcio commented May 28, 2022

bedevere-bot commented May 28, 2022

ezio-melotti left a comment

ezio-melotti May 28, 2022

Choose a reason for hiding this comment

sorcio May 28, 2022

Choose a reason for hiding this comment

sorcio May 28, 2022

Choose a reason for hiding this comment

corona10 commented May 28, 2022 • edited

corona10 left a comment • edited

bedevere-bot commented May 28, 2022

corona10 left a comment • edited

ezio-melotti commented May 28, 2022

corona10 left a comment • edited

bedevere-bot commented May 28, 2022

sorcio commented May 29, 2022

bedevere-bot commented May 29, 2022

ezio-melotti left a comment

sorcio commented May 30, 2022

corona10 left a comment

corona10 commented May 31, 2022

hyeshik commented Jun 11, 2022

corona10 commented Jun 11, 2022

corona10 commented May 28, 2022 •

edited

corona10 left a comment •

edited

corona10 left a comment •

edited

corona10 left a comment •

edited