New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-84508: tool to generate cjk traditional chinese mappings #93272
Conversation
Super cool, thanks for your contribution.
This issue was painful to me because we can not re-generate mapping files without this tool.
I left a small suggestion for this.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
I did not include the BIG5.TXT and CP950.TXT files that are available on the Unicode website. It looks like these are available for redistribution, so I can add them to the PR if needed.
Please addig it.
Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>
I have made the requested changes; please review again |
Thanks for making the requested changes! @corona10: please review the changes made to this pull request. |
The two files are somewhat big -- @corona10 do you think it's ok to include them in the repo?
If they are added the size of the repo will increase, making downloading/cloning it slower. I'm not sure if it might affect other things too.
# | ||
# genmap_tchinese.py: Traditional Chinese Codecs Map Generator | ||
# | ||
# Original Author: Hye-Shik Chang <perky@FreeBSD.org> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW @hyeshik doesn't seem to be active on this repo, but he signed the CLA in 2005 (https://bugs.python.org/user1298), so it should be ok to include it here.
If this script is based on his script, you might want to clarify that in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezio-melotti do you have a suggestion on how to amend the comment to clarify that? The mention of "Original Author" is what it currently looks like in the other genmap_* scripts, so I didn't take the liberty to sway from that style.
The original source (as for the other equivalent scripts) is https://github.com/BackupTheBerlios/cjkpython/blob/master/cjkcodecs/tools/genmap_tchinese.py which was originally under 2-clause BSD. I might have made a wrong assumption that it was ok to include in CPython, since the other scripts followed the same process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(In case it matters, the script in this PR is heavily modified, so while it probably still counts as derivative, it would be easy to do a full rewrite)
@ezio-melotti
|
Please regenerate Modules/cjkcodecs/mappings_hk.h and Modules/cjkcodecs/mappings_tw.h through the new script.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
LGTM, but I will send a mail to @hyeshik to discuss this issue.
And also waiting for @ezio-melotti opinion about storing mapping files.
(I prefer to store at the own mapping repo under python org as a submodule, but let's do it at a separate issue.)
The files that were there before were just a few kB (about 40kB in 9d5c071). GH-19602 introduced more files (~1.3MB, 2 years ago). GH-93309 recently added another ~1.4MB. The files included in this PR add yet another ~800kB. The CPython source (without the Git history) is ~27MB zipped and ~92MB once extracted. With the Git history it's about 585MB (quite a bit more than I thought). Compared to the rest, an extra 800kB is not particularly significant so it's probably ok to include them. If you are planning to include more files, maybe it would be better to think about a different approach, depending on the goal (e.g. host copies somewhere on python.org and add a script to download them from there). Note that since the other PRs have already been merged, even if we (re)move the files from the repo we might be able to reduce the repo size by ~3MB but they will be included in the history forever. |
The files that were there before were just a few kB (about 40kB in [9d5c071]>>(https://github.com/python/cpython/tree/9d5c0710609320e51631750d1cf60c90cc618172/Tools/unicode/python-mappings)). #19602 introduced more files (~1.3MB, 2 years ago). #93309 recently added another ~1.4MB. The files included in this PR add yet another ~800kB.
Okay, I never thought that it will be a burden.
@sorcio
Can you please remove Tools/unicode/python-mappings/BIG5.TXT and Tools/unicode/python-mappings/CP950.TXT as a middle ground? (Sorry for duplicated works)
@ezio-melotti
I will revert #93309 too.
But can you open the discussion about storing mapping files that can be managed by CPython developers?
I think that you can understand what I was concerned about and what I intended to do.
It will cover all files under Tools/unicode/python-mappings/.
(e.g. host copies somewhere on python.org and add a script to download them from there).
That looks like a good suggestion as much as maintaining a submodule repository through GitHub.
Thank you for understanding.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
This reverts commit bec0692.
I have made the requested changes; please review again
No worries! It's good that we clarified that. |
Thanks for making the requested changes! @corona10: please review the changes made to this pull request. |
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
The patch looks good to me. The script could be improved a bit (e.g. more docstring), but I'm assuming those parts were already in the original script and you didn't touch them.
I was also looking at e.g. open_mapping_file
(defined in genmap_support
) and noticed that it could be improved to report a clearer message. This however is out of the scope of this PR. @corona10, are you planning to do more work related to this? Do you think it would be worth to refactor these scripts?
I also noticed that the PR doesn't include any tests. For the tool itself is probably ok, however the resulting mapping should be tested. Are there already tests for that? It might also be useful to add some tests for the characters that diverge from one codec variant to the other, in order to avoid regressions.
That's a valid point. Keeping in mind that the codecs already have tests, and this PR doesn't change the behavior of the codecs in any way. But the tests rely on test data such as https://github.com/python/pythontestdotnet/blob/master/www/unicode/BIG5HKSCS-2004.TXT and the sample text at https://github.com/python/cpython/tree/main/Lib/test/cjkencodings. I found no way to generate the test data files and I believe the sample text is hand-picked. In particular, looking at 5a15508:
Is it within the scope of this issue/PR to add a script to rebuild a file like BIG5HKSCS-2004.TXT? |
I was also looking at e.g. open_mapping_file (defined in genmap_support) and noticed that it could be improved to report a clearer message. This however is out of the scope of this PR. @corona10, are you planning to do more work related to this? Do you think it would be worth to refactor these scripts?
Please go ahead.
lgtm
@ezio-melotti I got a mail from @hyeshik, He is going to review this PR soon. :) |
Thank you for your efforts in resolving this old problem, @sorcio, @ezio-melotti, and @corona10. All the changes look good to me. Also, I would like to put all my work included in the CJKCodecs releases into the public domain to allow more flexible adoptions in any derivative work. Please feel free to change the copyright conditions as needed. |
Thank you @hyeshik! |
This adds a script to generate the mapping files for Traditional Chinese Big-5-based codecs, as discussed in the issue.
I initially planned to add support for later versions of HKSCS, but I decided to keep this minimal so to close gh-84508. The topic of refreshing the mappings is split to its own issue in gh-93271.
So this generates
mappings_tw.h
andmappings_hk.h
files identical to the existing versions (with only one new line difference).Notes about the mapping files: