unicodedata: is_normalized claims nothing is normalized in any form when using the 3.2.0 database #101372

zahlman · 2023-01-27T13:39:55Z

Bug report

3.8 adds the .is_normalized function to the unicodedata module, which also is available as a method on the legacy unicodedata.ucd_3_2_0 database. It is supposed to check whether a string is equal to its normalization in a given form, but without having to normalize and compare.

However, the legacy version does not maintain the expected invariant. In fact, it reports that every single-character string is not normalized, regardless of the normalization form chosen. Presumably, the result is the same for every non-empty string. (It appears that the empty string works because it is special-cased at line 871-874.)

Example:

>>> import unicodedata
>>> unicodedata.ucd_3_2_0.normalize('NFC', '!') == '!'
True
>>> unicodedata.ucd_3_2_0.is_normalized('NFC', '!')
False
>>> any(unicodedata.ucd_3_2_0.is_normalized(form, chr(x)) for form in ('NFC', 'NFD', 'NFKC', 'NFKD') for x in range(0x110000))
False

The bug appears to be at line 801-804 of unicodedata.c:

    /* UCD 3.2.0 is requested, quickchecks must be disabled. */
    if (UCD_Check(self)) {
        return NO;
    }

I believe the NO should say MAYBE instead. The NO value appears to indicate that the quickcheck has determined that the string is not normalized - contrary to both the comment and expected behaviour.

Your environment

$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

Linked PRs

The text was updated successfully, but these errors were encountered:

… UCD 3.2.0

gh-101388)

… UCD 3… (pythongh-101388) (cherry picked from commit 9ef7e75) Co-authored-by: Dong-hee Na <donghee.na@python.org>

gh-101388) (cherry picked from commit 9ef7e75) Co-authored-by: Dong-hee Na <donghee.na@python.org>

hauntsaninja · 2023-02-25T23:41:15Z

Thanks for the report and the fix!

Serhiy mentioned he wanted to write tests here #101388 (comment) so leaving this issue open

zahlman added the type-bug An unexpected behavior, bug, or error label Jan 27, 2023

AlexWaygood changed the title ~~is_normalized claims nothing is normalized in any form when using the 3.2.0 database~~ unicodedata: is_normalized claims nothing is normalized in any form when using the 3.2.0 database Jan 27, 2023

AlexWaygood added the expert-unicode label Jan 27, 2023

corona10 added a commit to corona10/cpython that referenced this issue Jan 28, 2023

pythongh-101372: Fix unicodedata.is_normalized to properly handle the…

3cc6f06

… UCD 3.2.0

bedevere-bot mentioned this issue Jan 28, 2023

gh-101372: Fix unicodedata.is_normalized to properly handle the UCD 3… #101388

Merged

corona10 added a commit that referenced this issue Feb 6, 2023

gh-101372: Fix unicodedata.is_normalized to properly handle the UCD 3… (

9ef7e75

gh-101388)

This was referenced Feb 6, 2023

[3.11] gh-101372: Fix unicodedata.is_normalized to properly handle the UCD 3… (gh-101388) #101597

Merged

[3.10] gh-101372: Fix unicodedata.is_normalized to properly handle the UCD 3… (gh-101388) #101598

Merged

miss-islington added a commit that referenced this issue Feb 6, 2023

gh-101372: Fix unicodedata.is_normalized to properly handle the UCD 3… (

9bd000c

gh-101388) (cherry picked from commit 9ef7e75) Co-authored-by: Dong-hee Na <donghee.na@python.org>

miss-islington added a commit that referenced this issue Feb 6, 2023

gh-101372: Fix unicodedata.is_normalized to properly handle the UCD 3… (

3325029

gh-101388) (cherry picked from commit 9ef7e75) Co-authored-by: Dong-hee Na <donghee.na@python.org>

unicodedata: is_normalized claims nothing is normalized in any form when using the 3.2.0 database #101372

unicodedata: is_normalized claims nothing is normalized in any form when using the 3.2.0 database #101372

zahlman commented Jan 27, 2023 •

edited by bedevere-bot

hauntsaninja commented Feb 25, 2023 •

edited

unicodedata: is_normalized claims nothing is normalized in any form when using the 3.2.0 database #101372

unicodedata: is_normalized claims nothing is normalized in any form when using the 3.2.0 database #101372

Comments

zahlman commented Jan 27, 2023 • edited by bedevere-bot

Bug report

Your environment

Linked PRs

hauntsaninja commented Feb 25, 2023 • edited

zahlman commented Jan 27, 2023 •

edited by bedevere-bot

hauntsaninja commented Feb 25, 2023 •

edited