gh-93033: Use wmemchr in find_char and replace_1char_inplace #93034

goldsteinn · 2022-05-20T23:29:07Z

This was brought up a bit in #69009 but the larger issue is mostly
different.

Generally comparable perf for the "good" case where memchr doesn't
return any collisions (false matches on lower byte) but clearly faster
with collisions.

Some notes on correctness:

wchar_t being signed/unsigned shouldn't matter here BUT wmemchr (along
with just about all the other wide-char string functions) can and
often does (x86_64 for example) assume that the input is aligned
relative to the sizeof(wchar_t). If this is not the case for
Py_UCS{2|4} then this patch is broken.

Also I think the way I implemented #define STRINGLIB_FAST_MEMCHR for
ucs{2|4}lib break strict-aliasing. If this is an issue but otherwise
the patch is fine, any suggestions for how to fix it?

Test results:

$> ./python -m test -j4
...
== Tests result: SUCCESS ==

406 tests OK.

30 tests skipped:
    test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_devpoll
    test_idle test_ioctl test_kqueue test_launcher test_msilib
    test_nis test_ossaudiodev test_readline test_smtpnet
    test_socketserver test_sqlite3 test_startfile test_tcl test_tix
    test_tk test_ttk_guionly test_ttk_textonly test_turtle
    test_urllib2net test_urllibnet test_winconsoleio test_winreg
    test_winsound test_xmlrpc_net test_zipfile64

Benchmarked on:
model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

sizeof(wchar_t) == 4

GLIBC 2.35

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018210")' ## Long, No match, No collision
No wmemchr  : 1000 loops, best of 100: 127 nsec per loop
With wmemchr: 1000 loops, best of 100: 123 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018208")' ## Long, No match, High collision
No wmemchr  : 1000 loops, best of 100: 1.29 usec per loop
With wmemchr: 1000 loops, best of 100: 123 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018210"' -- 's.find("\U00018210")' ## Long, match, No collision
No wmemchr  : 1000 loops, best of 100: 136 nsec per loop
With wmemchr: 1000 loops, best of 100: 130 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018208"' -- 's.find("\U00018208")' ## Long, match, High collision
No wmemchr  : 1000 loops, best of 100: 1.35 usec per loop
With wmemchr: 1000 loops, best of 100: 131 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018210")' ## Short, No match, No collision
No wmemchr  : 1000 loops, best of 100: 50.2 nsec per loop
With wmemchr: 1000 loops, best of 100: 52.9 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018208")' ## Short, No match, High collision
No wmemchr  : 1000 loops, best of 100: 69.1 nsec per loop
With wmemchr: 1000 loops, best of 100: 53.7 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018210"' -- 's.find("\U00018210")' ## Short, match, No collision
No wmemchr  : 1000 loops, best of 100: 53.6 nsec per loop
With wmemchr: 1000 loops, best of 100: 53.6 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018208"' -- 's.find("\U00018208")' ## Short, match, High collision
No wmemchr  : 1000 loops, best of 100: 69 nsec per loop
With wmemchr: 1000 loops, best of 100: 50.9 nsec per loop

cpython-cla-bot · 2022-05-20T23:29:09Z

All commit authors signed the Contributor License Agreement.

bedevere-bot · 2022-05-20T23:29:10Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

Objects/stringlib/asciilib.h

This was brought up a bit in python#69009 but the larger issue is mostly different. Generally comparable perf for the "good" case where memchr doesn't return any collisions (false matches on lower byte) but clearly faster with collisions. Some notes on correctness: wchar_t being signed/unsigned shouldn't matter here BUT wmemchr (along with just about all the other wide-char string functions) can and often does (x86_64 for example) assume that the input is aligned relative to the sizeof(wchar_t). If this is not the case for Py_UCS{2|4} then this patch is broken. Also I think the way I implemented `#define STRINGLIB_FAST_MEMCHR` for ucs{2|4}lib break strict-aliasing. If this is an issue but otherwise the patch is fine, any suggestions for how to fix it? Test results: ``` $> ./python -m test -j4 ... == Tests result: SUCCESS == 406 tests OK. 30 tests skipped: test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_devpoll test_idle test_ioctl test_kqueue test_launcher test_msilib test_nis test_ossaudiodev test_readline test_smtpnet test_socketserver test_sqlite3 test_startfile test_tcl test_tix test_tk test_ttk_guionly test_ttk_textonly test_turtle test_urllib2net test_urllibnet test_winconsoleio test_winreg test_winsound test_xmlrpc_net test_zipfile64 ``` Benchmarked on: model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz sizeof(wchar_t) == 4 GLIBC 2.35 ``` ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018210")' ## Long, No match, No collision No wmemchr : 1000 loops, best of 100: 127 nsec per loop With wmemchr: 1000 loops, best of 100: 123 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018208")' ## Long, No match, High collision No wmemchr : 1000 loops, best of 100: 1.29 usec per loop With wmemchr: 1000 loops, best of 100: 123 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018210"' -- 's.find("\U00018210")' ## Long, match, No collision No wmemchr : 1000 loops, best of 100: 136 nsec per loop With wmemchr: 1000 loops, best of 100: 130 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018208"' -- 's.find("\U00018208")' ## Long, match, High collision No wmemchr : 1000 loops, best of 100: 1.35 usec per loop With wmemchr: 1000 loops, best of 100: 131 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018210")' ## Short, No match, No collision No wmemchr : 1000 loops, best of 100: 50.2 nsec per loop With wmemchr: 1000 loops, best of 100: 52.9 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018208")' ## Short, No match, High collision No wmemchr : 1000 loops, best of 100: 69.1 nsec per loop With wmemchr: 1000 loops, best of 100: 53.7 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018210"' -- 's.find("\U00018210")' ## Short, match, No collision No wmemchr : 1000 loops, best of 100: 53.6 nsec per loop With wmemchr: 1000 loops, best of 100: 53.6 nsec per loop ./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018208"' -- 's.find("\U00018208")' ## Short, match, High collision No wmemchr : 1000 loops, best of 100: 69 nsec per loop With wmemchr: 1000 loops, best of 100: 50.9 nsec per loop ```

bedevere-bot · 2022-05-21T20:18:35Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

…ods.c

…nglib' into use-wmemchr-in-stringlib

serhiy-storchaka · 2022-05-22T17:38:08Z

Could you please repeat microbenchmarks from #69009?

goldsteinn · 2022-05-22T17:59:27Z

Could you please repeat microbenchmarks from #69009?

On my machine at least all of those don't go to ucs4lib_find and are unaffected by the patch. Thats in fact why I used the benchmarks in the commit message.

serhiy-storchaka · 2022-05-22T17:59:55Z

Misc/NEWS.d/next/Library/2022-05-21-20-25-04.gh-issue-93033.QuMGyh.rst

@@ -0,0 +1 @@
+Use wmemchr in stringlib where applicable when the size of STRINGLIB_CHAR equals the size of wchar_t. All the places wmemchr is added where places memchr was previously used when STRINGLIB_CHAR equaled size of char.


It can go in the commit message, but the common Python user does not have any idea what are stringlib and STRINGLIB_CHAR. Please rewrite the NEWS entry in Python terms. What is the effect of the change for the Python programmer and how large is it.

How does:
"Wide-character string operations include 'find' with a needle of length one and 'replace' with a needle and replacement of length one may be sped up." sound?

Edit: remove bad MD

Edit2: Fix typo.

serhiy-storchaka · 2022-05-22T18:06:00Z

Can you test on Windows?

goldsteinn · 2022-05-22T18:07:32Z

Can you test on Windows?

Unfortunately I do not have access to a windows machine.

goldsteinn · 2022-05-22T18:13:02Z

I can do tests with different collision rates / lengths if you want just LMK.

bedevere-bot added the awaiting review label May 20, 2022

goldsteinn mentioned this pull request May 21, 2022

Use wmemchr in stringlib if sizeof(STRINGLIB_CHAR) == sizeof(wchar_t) #93033

Open

corona10 requested a review from methane May 21, 2022

sweeneyde reviewed May 21, 2022

View changes

Objects/stringlib/asciilib.h Show resolved Hide resolved

goldsteinn force-pushed the use-wmemchr-in-stringlib branch from 2b7fc75 to 80bfc80 Compare May 21, 2022

blurb-it bot and others added 3 commits May 21, 2022

📜🤖 Added by blurb_it.

cd985e5

Add missing STRINGLIB_FAST_MEMCHR in bytearrayobject.c and bytes_meth…

56f6b42

…ods.c

Merge remote-tracking branch 'refs/remotes/origin/use-wmemchr-in-stri…

928d5df

…nglib' into use-wmemchr-in-stringlib

serhiy-storchaka self-requested a review May 22, 2022

goldsteinn closed this May 22, 2022

goldsteinn reopened this May 22, 2022

serhiy-storchaka reviewed May 22, 2022

View changes

python / cpython Public

gh-93033: Use wmemchr in find_char and replace_1char_inplace #93034

gh-93033: Use wmemchr in find_char and replace_1char_inplace #93034

goldsteinn commented May 20, 2022 •

edited

cpython-cla-bot bot commented May 20, 2022 •

edited

bedevere-bot commented May 20, 2022

bedevere-bot commented May 21, 2022

serhiy-storchaka commented May 22, 2022

goldsteinn commented May 22, 2022

serhiy-storchaka May 22, 2022

goldsteinn May 22, 2022 •

edited

serhiy-storchaka commented May 22, 2022

goldsteinn commented May 22, 2022

goldsteinn commented May 22, 2022

		@@ -0,0 +1 @@
		Use wmemchr in stringlib where applicable when the size of STRINGLIB_CHAR equals the size of wchar_t. All the places wmemchr is added where places memchr was previously used when STRINGLIB_CHAR equaled size of char.

python / cpython Public

gh-93033: Use wmemchr in find_char and replace_1char_inplace #93034

Are you sure you want to change the base?

gh-93033: Use wmemchr in find_char and replace_1char_inplace #93034

Conversation

goldsteinn commented May 20, 2022 • edited

cpython-cla-bot bot commented May 20, 2022 • edited

bedevere-bot commented May 20, 2022

bedevere-bot commented May 21, 2022

serhiy-storchaka commented May 22, 2022

goldsteinn commented May 22, 2022

serhiy-storchaka May 22, 2022

Choose a reason for hiding this comment

goldsteinn May 22, 2022 • edited

Choose a reason for hiding this comment

serhiy-storchaka commented May 22, 2022

goldsteinn commented May 22, 2022

goldsteinn commented May 22, 2022

goldsteinn commented May 20, 2022 •

edited

cpython-cla-bot bot commented May 20, 2022 •

edited

goldsteinn May 22, 2022 •

edited