bpo-44987: Speed up unicode normalization of ASCII strings #28283

corona10 · 2021-09-11T05:37:03Z

https://bugs.python.org/issue44987

corona10 · 2021-09-11T05:41:57Z

>>> from timeit import Timer
>>> setup="from unicodedata import normalize; s = 'reverse'"
>>> t1 = Timer('normalize("NFKC", s)', setup=setup)
>>> setup="from unicodedata import normalize; s = 'reverse'*1000"
>>> t2 = Timer('normalize("NFKC", s)', setup=setup)
>>> min(t1.repeat(repeat=7))
0.038022067994461395
>>> min(t2.repeat(repeat=7))
0.038196470006369054

0:00:00 load avg: 18.63 Run tests sequentially
0:00:00 load avg: 18.63 [1/1] test_unicodedata
beginning 6 repetitions
123456
......
test_unicodedata passed in 1 min 3 sec

== Tests result: SUCCESS ==

1 test OK.

serhiy-storchaka · 2021-09-11T08:23:40Z

Doc/whatsnew/3.11.rst

+unicodedata
+-----------
+* If the given string is pure ASCII string, :func:`unicode.normalize` now
+  handles this as an already normalized to process it in constant time.
+  (Contributed by Dong-hee Na in :issue:`bpo-44987`.)


It is not worth a new subsection in the "Improved Modules" section. An entry in the "Optimizations" section should be enough. And make it shorter. E.g. "Pure ASCII strings are now normalized in constant time."

Thanks :) Nice suggestion

serhiy-storchaka · 2021-09-11T08:23:40Z

Misc/NEWS.d/next/Library/2021-09-11-14-41-02.bpo-44987.Mt8DiX.rst

@@ -0,0 +1,3 @@
+If the given string is pure ASCII string, :func:`unicode.normalize` now


unicodedata

vstinner · 2021-09-13T09:35:33Z

Nice optimization, thanks @corona10!

By the way, I'm not sure why ".pdbrc is now read with utf-8 encoding." is mentioned in the Optimization section.

vstinner · 2021-09-22T13:39:42Z

By the way, I'm not sure why ".pdbrc is now read with utf-8 encoding." is mentioned in the Optimization section.

I created PR #28518 for that.

the-knights-who-say-ni added the CLA signed label Sep 11, 2021

bedevere-bot added the awaiting core review label Sep 11, 2021

corona10 force-pushed the bpo-44987 branch from 67091d5 to 9608517 Compare Sep 11, 2021

corona10 requested review from vstinner and removed request for vstinner Sep 11, 2021

corona10 changed the title ~~bpo-44987: Speed up unicode normalization of ASCII strings~~ [WIP] bpo-44987: Speed up unicode normalization of ASCII strings Sep 11, 2021

bpo-44987: Speed up unicode normalization of ASCII strings

ddf7106

corona10 force-pushed the bpo-44987 branch from 9608517 to ddf7106 Compare Sep 11, 2021

corona10 changed the title ~~[WIP] bpo-44987: Speed up unicode normalization of ASCII strings~~ bpo-44987: Speed up unicode normalization of ASCII strings Sep 11, 2021

corona10 requested a review from vstinner Sep 11, 2021

bpo-44987: Update whatsnews

c53c3a9

serhiy-storchaka reviewed Sep 11, 2021

View changes

bpo-44987: Address code review

8ff6b3c

corona10 requested a review from serhiy-storchaka Sep 11, 2021

serhiy-storchaka approved these changes Sep 11, 2021

View changes

bedevere-bot added awaiting merge and removed awaiting core review labels Sep 11, 2021

serhiy-storchaka merged commit 9abd07e into python:main Sep 11, 2021
12 checks passed

bedevere-bot removed the awaiting merge label Sep 11, 2021

corona10 deleted the bpo-44987 branch Sep 11, 2021

bpo-44987: Speed up unicode normalization of ASCII strings #28283

bpo-44987: Speed up unicode normalization of ASCII strings #28283

corona10 commented Sep 11, 2021 •

edited by bedevere-bot

corona10 commented Sep 11, 2021 •

edited

serhiy-storchaka Sep 11, 2021

corona10 Sep 11, 2021

serhiy-storchaka Sep 11, 2021

vstinner commented Sep 13, 2021

vstinner commented Sep 22, 2021

		@@ -0,0 +1,3 @@
		If the given string is pure ASCII string, :func:`unicode.normalize` now

bpo-44987: Speed up unicode normalization of ASCII strings #28283

bpo-44987: Speed up unicode normalization of ASCII strings #28283

Conversation

corona10 commented Sep 11, 2021 • edited by bedevere-bot

corona10 commented Sep 11, 2021 • edited

serhiy-storchaka Sep 11, 2021

Choose a reason for hiding this comment

corona10 Sep 11, 2021

Choose a reason for hiding this comment

serhiy-storchaka Sep 11, 2021

Choose a reason for hiding this comment

vstinner commented Sep 13, 2021

vstinner commented Sep 22, 2021

corona10 commented Sep 11, 2021 •

edited by bedevere-bot

corona10 commented Sep 11, 2021 •

edited