Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-43014: Improve performance of tokenize by 20-30% #24311

Merged
merged 1 commit into from Jan 24, 2021

Conversation

@asottile
Copy link
Contributor

@asottile asottile commented Jan 24, 2021

Copy link
Member

@isidentical isidentical left a comment

Great optimization, though there are 2 concerns of mine;

  • For people who are not using tokenize module to generate tokens (like detect_encoding/open are the most common functions), they'd have to pay this cost
  • Also, even though breaking them is somewhat OK, there are wild usages out there that monkeypatches the PseduoToken to change the behavior (add new tokens) of tokenize module.

Maybe there is a solution that would both optimize this, and also don't cause any new regressions for normal users (something like @lru_cache to _compile maybe?)

@asottile
Copy link
Contributor Author

@asottile asottile commented Jan 24, 2021

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

@isidentical
Copy link
Member

@isidentical isidentical commented Jan 24, 2021

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

Maybe we could set it to a global (_PSEDUO_TOKEN_RE = None, if ... is None: _PSEDUO_TOKEN_RE = compile())?

@asottile
Copy link
Contributor Author

@asottile asottile commented Jan 24, 2021

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

Maybe we could set it to a global (_PSEDUO_TOKEN_RE = None, if ... is None: _PSEDUO_TOKEN_RE = compile())?

from my tests this performs the same as the lru_cache approach (within a few 1s of ms -- error noise). the lru_cache approach seems a reasonable middle ground (and also avoids recompiling the triple-quoted-string regexes over and over as well)

@asottile asottile changed the title bpo-43014: Improve performance of tokenize by 25-35% bpo-43014: Improve performance of tokenize by 20-30% Jan 24, 2021
@asottile asottile force-pushed the asottile:faster_tokenize_bpo-43014 branch from bc2dc35 to 2025476 Jan 24, 2021
@isidentical isidentical merged commit 15bd9ef into python:master Jan 24, 2021
11 checks passed
11 checks passed
Docs
Details
Check for source changes
Details
Check if generated files are up to date
Details
Windows (x86)
Details
Windows (x64)
Details
macOS
Details
Ubuntu
Details
Azure Pipelines PR #20210124.3 succeeded
Details
Travis CI - Pull Request Build Passed
Details
bedevere/issue-number Issue number 43014 found
Details
bedevere/news News entry found in Misc/NEWS.d
@isidentical
Copy link
Member

@isidentical isidentical commented Jan 24, 2021

Thanks a lot @asottile!

@asottile asottile deleted the asottile:faster_tokenize_bpo-43014 branch Jan 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants