bpo-43014: Improve performance of tokenize by 20-30% #24311

asottile · 2021-01-24T08:38:31Z

https://bugs.python.org/issue43014

isidentical

Great optimization, though there are 2 concerns of mine;

For people who are not using tokenize module to generate tokens (like detect_encoding/open are the most common functions), they'd have to pay this cost
Also, even though breaking them is somewhat OK, there are wild usages out there that monkeypatches the PseduoToken to change the behavior (add new tokens) of tokenize module.

Maybe there is a solution that would both optimize this, and also don't cause any new regressions for normal users (something like @lru_cache to _compile maybe?)

asottile · 2021-01-24T08:50:31Z

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

isidentical · 2021-01-24T08:53:35Z

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

Maybe we could set it to a global (_PSEDUO_TOKEN_RE = None, if ... is None: _PSEDUO_TOKEN_RE = compile())?

asottile · 2021-01-24T08:58:23Z

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

Maybe we could set it to a global (_PSEDUO_TOKEN_RE = None, if ... is None: _PSEDUO_TOKEN_RE = compile())?

from my tests this performs the same as the lru_cache approach (within a few 1s of ms -- error noise). the lru_cache approach seems a reasonable middle ground (and also avoids recompiling the triple-quoted-string regexes over and over as well)

isidentical · 2021-01-24T09:23:42Z

Thanks a lot @asottile!

the-knights-who-say-ni added the CLA signed label Jan 24, 2021

bedevere-bot added the awaiting review label Jan 24, 2021

isidentical reviewed Jan 24, 2021

View changes

Improve performance of tokenize by 20-30%

Loading status checks…

2025476

asottile changed the title ~~bpo-43014: Improve performance of tokenize by 25-35%~~ bpo-43014: Improve performance of tokenize by 20-30% Jan 24, 2021

asottile force-pushed the asottile:faster_tokenize_bpo-43014 branch from bc2dc35 to 2025476 Jan 24, 2021

isidentical approved these changes Jan 24, 2021

View changes

bedevere-bot added awaiting merge and removed awaiting review labels Jan 24, 2021

bedevere-bot removed the awaiting merge label Jan 24, 2021

asottile deleted the asottile:faster_tokenize_bpo-43014 branch Jan 24, 2021

python / cpython

bpo-43014: Improve performance of tokenize by 20-30% #24311

bpo-43014: Improve performance of tokenize by 20-30% #24311

asottile commented Jan 24, 2021 •

edited by bedevere-bot

isidentical left a comment

asottile commented Jan 24, 2021

isidentical commented Jan 24, 2021

asottile commented Jan 24, 2021

isidentical commented Jan 24, 2021

python / cpython

Sponsor python/cpython

bpo-43014: Improve performance of tokenize by 20-30% #24311

bpo-43014: Improve performance of tokenize by 20-30% #24311

Conversation

asottile commented Jan 24, 2021 • edited by bedevere-bot

isidentical left a comment

asottile commented Jan 24, 2021

isidentical commented Jan 24, 2021

asottile commented Jan 24, 2021

isidentical commented Jan 24, 2021

asottile commented Jan 24, 2021 •

edited by bedevere-bot