Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-84508: tool to generate cjk traditional chinese mappings #93272

Merged
merged 7 commits into from Jun 11, 2022

Conversation

sorcio
Copy link
Contributor

@sorcio sorcio commented May 26, 2022

This adds a script to generate the mapping files for Traditional Chinese Big-5-based codecs, as discussed in the issue.

I initially planned to add support for later versions of HKSCS, but I decided to keep this minimal so to close gh-84508. The topic of refreshing the mappings is split to its own issue in gh-93271.

So this generates mappings_tw.h and mappings_hk.h files identical to the existing versions (with only one new line difference).

Notes about the mapping files:

  • I did not include the BIG5.TXT and CP950.TXT files that are available on the Unicode website. It looks like these are available for redistribution, so I can add them to the PR if needed.
  • I also did not include the hkscs-2004-big5-iso.txt file, but that's a different story. The terms of use from the source website include a clause that unilaterally binds to any update to the terms (clause 2), which I believe is incompatible with redistribution as part of CPython.

Copy link
Member

@corona10 corona10 left a comment

Super cool, thanks for your contribution.
This issue was painful to me because we can not re-generate mapping files without this tool.

I left a small suggestion for this.

Tools/unicode/genmap_tchinese.py Show resolved Hide resolved
Tools/unicode/genmap_tchinese.py Outdated Show resolved Hide resolved
Tools/unicode/genmap_tchinese.py Show resolved Hide resolved
@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented May 28, 2022

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Copy link
Member

@corona10 corona10 left a comment

I did not include the BIG5.TXT and CP950.TXT files that are available on the Unicode website. It looks like these are available for redistribution, so I can add them to the PR if needed.

Please addig it.

sorcio and others added 2 commits May 28, 2022
@sorcio
Copy link
Contributor Author

@sorcio sorcio commented May 28, 2022

I have made the requested changes; please review again

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented May 28, 2022

Thanks for making the requested changes!

@corona10: please review the changes made to this pull request.

@bedevere-bot bedevere-bot requested a review from corona10 May 28, 2022
Copy link
Member

@ezio-melotti ezio-melotti left a comment

The two files are somewhat big -- @corona10 do you think it's ok to include them in the repo?
If they are added the size of the repo will increase, making downloading/cloning it slower. I'm not sure if it might affect other things too.

#
# genmap_tchinese.py: Traditional Chinese Codecs Map Generator
#
# Original Author: Hye-Shik Chang <perky@FreeBSD.org>
Copy link
Member

@ezio-melotti ezio-melotti May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW @hyeshik doesn't seem to be active on this repo, but he signed the CLA in 2005 (https://bugs.python.org/user1298), so it should be ok to include it here.

If this script is based on his script, you might want to clarify that in the comment.

Copy link
Contributor Author

@sorcio sorcio May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezio-melotti do you have a suggestion on how to amend the comment to clarify that? The mention of "Original Author" is what it currently looks like in the other genmap_* scripts, so I didn't take the liberty to sway from that style.

The original source (as for the other equivalent scripts) is https://github.com/BackupTheBerlios/cjkpython/blob/master/cjkcodecs/tools/genmap_tchinese.py which was originally under 2-clause BSD. I might have made a wrong assumption that it was ok to include in CPython, since the other scripts followed the same process?

Copy link
Contributor Author

@sorcio sorcio May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In case it matters, the script in this PR is heavily modified, so while it probably still counts as derivative, it would be easy to do a full rewrite)

@corona10
Copy link
Member

@corona10 corona10 commented May 28, 2022

@ezio-melotti
Yes, it's worth adding it. There are two reasons for this or we can separate the repo in the future as the submodule.

  1. Mapping files have existed already in the CPython repo.
    https://github.com/python/cpython/tree/main/Tools/unicode/python-mappings

  2. And I noticed that mapping files were not maintained well even if they should contain the diff file like https://github.com/python/cpython/tree/main/Tools/unicode/python-mappings/diff
    I am focusing on maintaining a well reproducible environment even if the external environment is changed.
    No one was interested in regenerating the mapping header files until I was concerned about the situation. This is an issue that can recur over time. Freezing the environment is important to prevent the same situation.

Copy link
Member

@corona10 corona10 left a comment

Please regenerate Modules/cjkcodecs/mappings_hk.h and Modules/cjkcodecs/mappings_tw.h through the new script.

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented May 28, 2022

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Copy link
Member

@corona10 corona10 left a comment

LGTM, but I will send a mail to @hyeshik to discuss this issue.
And also waiting for @ezio-melotti opinion about storing mapping files.
(I prefer to store at the own mapping repo under python org as a submodule, but let's do it at a separate issue.)

@ezio-melotti
Copy link
Member

@ezio-melotti ezio-melotti commented May 28, 2022

The files that were there before were just a few kB (about 40kB in 9d5c071). GH-19602 introduced more files (~1.3MB, 2 years ago). GH-93309 recently added another ~1.4MB. The files included in this PR add yet another ~800kB.

The CPython source (without the Git history) is ~27MB zipped and ~92MB once extracted. With the Git history it's about 585MB (quite a bit more than I thought).

Compared to the rest, an extra 800kB is not particularly significant so it's probably ok to include them.

If you are planning to include more files, maybe it would be better to think about a different approach, depending on the goal (e.g. host copies somewhere on python.org and add a script to download them from there). Note that since the other PRs have already been merged, even if we (re)move the files from the repo we might be able to reduce the repo size by ~3MB but they will be included in the history forever.

Copy link
Member

@corona10 corona10 left a comment

The files that were there before were just a few kB (about 40kB in [9d5c071]>>(https://github.com/python/cpython/tree/9d5c0710609320e51631750d1cf60c90cc618172/Tools/unicode/python-mappings)). #19602 introduced more files (~1.3MB, 2 years ago). #93309 recently added another ~1.4MB. The files included in this PR add yet another ~800kB.

Okay, I never thought that it will be a burden.

@sorcio
Can you please remove Tools/unicode/python-mappings/BIG5.TXT and Tools/unicode/python-mappings/CP950.TXT as a middle ground? (Sorry for duplicated works)

@ezio-melotti
I will revert #93309 too.
But can you open the discussion about storing mapping files that can be managed by CPython developers?
I think that you can understand what I was concerned about and what I intended to do.
It will cover all files under Tools/unicode/python-mappings/.

(e.g. host copies somewhere on python.org and add a script to download them from there).

That looks like a good suggestion as much as maintaining a submodule repository through GitHub.

Thank you for understanding.

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented May 28, 2022

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@sorcio
Copy link
Contributor Author

@sorcio sorcio commented May 29, 2022

I have made the requested changes; please review again

Sorry for duplicated works

No worries! It's good that we clarified that.

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented May 29, 2022

Thanks for making the requested changes!

@corona10: please review the changes made to this pull request.

@bedevere-bot bedevere-bot requested a review from corona10 May 29, 2022
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Copy link
Member

@ezio-melotti ezio-melotti left a comment

The patch looks good to me. The script could be improved a bit (e.g. more docstring), but I'm assuming those parts were already in the original script and you didn't touch them.

I was also looking at e.g. open_mapping_file (defined in genmap_support) and noticed that it could be improved to report a clearer message. This however is out of the scope of this PR. @corona10, are you planning to do more work related to this? Do you think it would be worth to refactor these scripts?

I also noticed that the PR doesn't include any tests. For the tool itself is probably ok, however the resulting mapping should be tested. Are there already tests for that? It might also be useful to add some tests for the characters that diverge from one codec variant to the other, in order to avoid regressions.

@sorcio
Copy link
Contributor Author

@sorcio sorcio commented May 30, 2022

It might also be useful to add some tests for the characters that diverge from one codec variant to the other, in order to avoid regressions.

That's a valid point. Keeping in mind that the codecs already have tests, and this PR doesn't change the behavior of the codecs in any way. But the tests rely on test data such as https://github.com/python/pythontestdotnet/blob/master/www/unicode/BIG5HKSCS-2004.TXT and the sample text at https://github.com/python/cpython/tree/main/Lib/test/cjkencodings.

I found no way to generate the test data files and I believe the sample text is hand-picked. In particular, looking at 5a15508:

  • the BIG5.TXT and CP950.TXT files originally were downloaded from unicode.org (sorry I missed this before, there is already a Python-hosted copy of this data!) and they are used as-is in the tests, so they should be okay
  • the BIG5HKSCS-2004.TXT file, on the other hand, used to refer to http://people.freebsd.org/~perky/i18n/BIG5HKSCS-2004.TXT as a source; I don't believe @hyeshik is still maintaining that repository, and it's not documented how the file was built
  • the sample texts similarly have no documented source, and are probably hand-crafted to cover some tricky cases, and require domain knowledge to maintain

Is it within the scope of this issue/PR to add a script to rebuild a file like BIG5HKSCS-2004.TXT?

Copy link
Member

@corona10 corona10 left a comment

I was also looking at e.g. open_mapping_file (defined in genmap_support) and noticed that it could be improved to report a clearer message. This however is out of the scope of this PR. @corona10, are you planning to do more work related to this? Do you think it would be worth to refactor these scripts?

Please go ahead.

lgtm

@corona10 corona10 assigned ezio-melotti and unassigned corona10 May 30, 2022
@corona10
Copy link
Member

@corona10 corona10 commented May 31, 2022

@ezio-melotti I got a mail from @hyeshik, He is going to review this PR soon. :)

@corona10 corona10 self-assigned this May 31, 2022
@hyeshik
Copy link
Contributor

@hyeshik hyeshik commented Jun 11, 2022

Thank you for your efforts in resolving this old problem, @sorcio, @ezio-melotti, and @corona10. All the changes look good to me. Also, I would like to put all my work included in the CJKCodecs releases into the public domain to allow more flexible adoptions in any derivative work. Please feel free to change the copyright conditions as needed.

@corona10 corona10 merged commit 733e15f into python:main Jun 11, 2022
12 checks passed
@corona10
Copy link
Member

@corona10 corona10 commented Jun 11, 2022

Thank you @hyeshik!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants