Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicodedata: is_normalized claims nothing is normalized in any form when using the 3.2.0 database #101372

Open
zahlman opened this issue Jan 27, 2023 · 1 comment
Labels
expert-unicode type-bug An unexpected behavior, bug, or error

Comments

@zahlman
Copy link

zahlman commented Jan 27, 2023

Bug report

3.8 adds the .is_normalized function to the unicodedata module, which also is available as a method on the legacy unicodedata.ucd_3_2_0 database. It is supposed to check whether a string is equal to its normalization in a given form, but without having to normalize and compare.

However, the legacy version does not maintain the expected invariant. In fact, it reports that every single-character string is not normalized, regardless of the normalization form chosen. Presumably, the result is the same for every non-empty string. (It appears that the empty string works because it is special-cased at line 871-874.)

Example:

>>> import unicodedata
>>> unicodedata.ucd_3_2_0.normalize('NFC', '!') == '!'
True
>>> unicodedata.ucd_3_2_0.is_normalized('NFC', '!')
False
>>> any(unicodedata.ucd_3_2_0.is_normalized(form, chr(x)) for form in ('NFC', 'NFD', 'NFKC', 'NFKD') for x in range(0x110000))
False

The bug appears to be at line 801-804 of unicodedata.c:

    /* UCD 3.2.0 is requested, quickchecks must be disabled. */
    if (UCD_Check(self)) {
        return NO;
    }

I believe the NO should say MAYBE instead. The NO value appears to indicate that the quickcheck has determined that the string is not normalized - contrary to both the comment and expected behaviour.

Your environment

$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

Linked PRs

@zahlman zahlman added the type-bug An unexpected behavior, bug, or error label Jan 27, 2023
@AlexWaygood AlexWaygood changed the title is_normalized claims nothing is normalized in any form when using the 3.2.0 database unicodedata: is_normalized claims nothing is normalized in any form when using the 3.2.0 database Jan 27, 2023
corona10 added a commit to corona10/cpython that referenced this issue Jan 28, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Feb 6, 2023
… UCD 3… (pythongh-101388)

(cherry picked from commit 9ef7e75)

Co-authored-by: Dong-hee Na <donghee.na@python.org>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Feb 6, 2023
… UCD 3… (pythongh-101388)

(cherry picked from commit 9ef7e75)

Co-authored-by: Dong-hee Na <donghee.na@python.org>
miss-islington added a commit that referenced this issue Feb 6, 2023
gh-101388)

(cherry picked from commit 9ef7e75)

Co-authored-by: Dong-hee Na <donghee.na@python.org>
miss-islington added a commit that referenced this issue Feb 6, 2023
gh-101388)

(cherry picked from commit 9ef7e75)

Co-authored-by: Dong-hee Na <donghee.na@python.org>
@hauntsaninja
Copy link
Contributor

hauntsaninja commented Feb 25, 2023

Thanks for the report and the fix!

Serhiy mentioned he wanted to write tests here #101388 (comment) so leaving this issue open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expert-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants