ParserBase could be optimized #92088

be-thomas · 2022-04-30T17:25:13Z

There is a lot of wasteful CPU code in this fashion (seen multiple times) :-

if ")" in rawdata[j:]:
    j = rawdata.find(")", j) + 1

search performed in string twice, even if we could just do it once.
Consider the strings immutable, then on every slice a new string is created on the fly.
Moreover, the slicing done at rawdata[j:], is bound to be quite expensive depending on the size of the string.
We could eliminate slicing altogether and only have one find operation in this style :-

RPAREN_pos = rawdata.find(")", j)
if find_RPAREN != -1:
    j = RPAREN_pos + 1

I'm new to Open Source Code Contributions. Would love to learn from other's coding style & know other's point of view.

Originally posted by @be-thomas in #92084 (comment)

The text was updated successfully, but these errors were encountered:

ezio-melotti · 2022-05-02T15:50:44Z

As mentioned in the PR, it would be good to:

check that the optimizations indeed makes the code faster
ensure that the changes are covered by tests

I published https://github.com/ezio-melotti/htmlparser-bench which contains a few scripts to download several web pages (from a list of the 100k most popular sites (from 10 years ago)), parse them, and summarize the results. This is not a particularly accurate benchmark, since the main goal was making sure that the parser could parse all pages without errors, while checking that the parsing maintained a reasonable speed.

I'm looking into modernizing the code a bit (removing code related to Python 2 and the deprecated and now removed strict mode, parallelizing the download of the webpages, etc.), and also adding the benchmark to the python/pyperformance repo.

be-thomas mentioned this issue May 1, 2022

gh-92088: Potential Performance Improvements #92084

Open

ezio-melotti self-assigned this May 1, 2022

ezio-melotti added type-feature A feature request or enhancement performance Performance or resource usage stdlib Python modules in the Lib dir 3.11 bug and security fixes labels May 2, 2022

iritkatriel added 3.12 bugs and security fixes and removed 3.11 bug and security fixes labels Sep 7, 2022

erlend-aasland added 3.13 new features, bugs and security fixes and removed 3.12 bugs and security fixes labels Jan 5, 2024

ParserBase could be optimized #92088

ParserBase could be optimized #92088

be-thomas commented Apr 30, 2022 •

edited

ezio-melotti commented May 2, 2022

ParserBase could be optimized #92088

ParserBase could be optimized #92088

Comments

be-thomas commented Apr 30, 2022 • edited

ezio-melotti commented May 2, 2022

be-thomas commented Apr 30, 2022 •

edited