Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-44987: Speed up unicode normalization of ASCII strings #28283

Merged
merged 3 commits into from Sep 11, 2021

Conversation

corona10
Copy link
Member

@corona10 corona10 commented Sep 11, 2021

@corona10
Copy link
Member Author

corona10 commented Sep 11, 2021

>>> from timeit import Timer
>>> setup="from unicodedata import normalize; s = 'reverse'"
>>> t1 = Timer('normalize("NFKC", s)', setup=setup)
>>> setup="from unicodedata import normalize; s = 'reverse'*1000"
>>> t2 = Timer('normalize("NFKC", s)', setup=setup)
>>> min(t1.repeat(repeat=7))
0.038022067994461395
>>> min(t2.repeat(repeat=7))
0.038196470006369054
0:00:00 load avg: 18.63 Run tests sequentially
0:00:00 load avg: 18.63 [1/1] test_unicodedata
beginning 6 repetitions
123456
......
test_unicodedata passed in 1 min 3 sec

== Tests result: SUCCESS ==

1 test OK.

@corona10 corona10 requested review from vstinner and removed request for vstinner Sep 11, 2021
@corona10 corona10 changed the title bpo-44987: Speed up unicode normalization of ASCII strings [WIP] bpo-44987: Speed up unicode normalization of ASCII strings Sep 11, 2021
@corona10 corona10 changed the title [WIP] bpo-44987: Speed up unicode normalization of ASCII strings bpo-44987: Speed up unicode normalization of ASCII strings Sep 11, 2021
@corona10 corona10 requested a review from vstinner Sep 11, 2021
unicodedata
-----------
* If the given string is pure ASCII string, :func:`unicode.normalize` now
handles this as an already normalized to process it in constant time.
(Contributed by Dong-hee Na in :issue:`bpo-44987`.)
Copy link
Member

@serhiy-storchaka serhiy-storchaka Sep 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not worth a new subsection in the "Improved Modules" section. An entry in the "Optimizations" section should be enough. And make it shorter. E.g. "Pure ASCII strings are now normalized in constant time."

Copy link
Member Author

@corona10 corona10 Sep 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :) Nice suggestion

@@ -0,0 +1,3 @@
If the given string is pure ASCII string, :func:`unicode.normalize` now
Copy link
Member

@serhiy-storchaka serhiy-storchaka Sep 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unicodedata

@corona10 corona10 requested a review from serhiy-storchaka Sep 11, 2021
@serhiy-storchaka serhiy-storchaka merged commit 9abd07e into python:main Sep 11, 2021
12 checks passed
@corona10 corona10 deleted the bpo-44987 branch Sep 11, 2021
@vstinner
Copy link
Member

vstinner commented Sep 13, 2021

Nice optimization, thanks @corona10!

By the way, I'm not sure why ".pdbrc is now read with utf-8 encoding." is mentioned in the Optimization section.

@vstinner
Copy link
Member

vstinner commented Sep 22, 2021

By the way, I'm not sure why ".pdbrc is now read with utf-8 encoding." is mentioned in the Optimization section.

I created PR #28518 for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants