Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParserBase could be optimized #92088

Open
be-thomas opened this issue Apr 30, 2022 · 1 comment
Open

ParserBase could be optimized #92088

be-thomas opened this issue Apr 30, 2022 · 1 comment
Assignees
Labels
3.13 new features, bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@be-thomas
Copy link

be-thomas commented Apr 30, 2022

There is a lot of wasteful CPU code in this fashion (seen multiple times) :-

if ")" in rawdata[j:]:
    j = rawdata.find(")", j) + 1

search performed in string twice, even if we could just do it once.
Consider the strings immutable, then on every slice a new string is created on the fly.
Moreover, the slicing done at rawdata[j:], is bound to be quite expensive depending on the size of the string.
We could eliminate slicing altogether and only have one find operation in this style :-

RPAREN_pos = rawdata.find(")", j)
if find_RPAREN != -1:
    j = RPAREN_pos + 1

I'm new to Open Source Code Contributions. Would love to learn from other's coding style & know other's point of view.

Originally posted by @be-thomas in #92084 (comment)

@ezio-melotti
Copy link
Member

As mentioned in the PR, it would be good to:

  1. check that the optimizations indeed makes the code faster
  2. ensure that the changes are covered by tests

I published https://github.com/ezio-melotti/htmlparser-bench which contains a few scripts to download several web pages (from a list of the 100k most popular sites (from 10 years ago)), parse them, and summarize the results. This is not a particularly accurate benchmark, since the main goal was making sure that the parser could parse all pages without errors, while checking that the parsing maintained a reasonable speed.

I'm looking into modernizing the code a bit (removing code related to Python 2 and the deprecated and now removed strict mode, parallelizing the download of the webpages, etc.), and also adding the benchmark to the python/pyperformance repo.

@ezio-melotti ezio-melotti added type-feature A feature request or enhancement performance Performance or resource usage stdlib Python modules in the Lib dir 3.11 bug and security fixes labels May 2, 2022
@iritkatriel iritkatriel added 3.12 bugs and security fixes and removed 3.11 bug and security fixes labels Sep 7, 2022
@erlend-aasland erlend-aasland added 3.13 new features, bugs and security fixes and removed 3.12 bugs and security fixes labels Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 new features, bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
Status: Todo
Development

No branches or pull requests

4 participants