Skip to content

Regex performance improvement. #114586

Closed as not planned
Closed as not planned
@amir-h-rassafi

Description

@amir-h-rassafi

Feature or enhancement

Proposal:

I had a test that I put my script in following.

import re
import time


rng = 1000000

# COMPILING IT BEFORE HAND
start = time.time()
pre_comp_pat = re.compile(r'[a-zA-Z]+')
for _ in range(rng):
    pre_comp_pat.match("gaz")
end = time.time()
print("Time (pre-compiled) ", end - start)

# COMPILING IT INSIDE THE FOR LOOP
start = time.time()
for _ in range(rng):
    pat = re.compile(r'[a-zA-Z]+')
    pat.match("gaz")
end = time.time()
print("Time (compiling each it)", end - start)

# NEVER EXPLICITLY COMPILING IT
non_comp_pat = r'[a-zA-Z]+'
start = time.time()
for _ in range(rng):
    re.match(non_comp_pat, "gaz")
end = time.time()
print("Time (not compiling)", end - start)

and the result is:

Time (pre-compiled)  0.19112229347229004
Time (compiling each it) 0.5717971324920654
Time (not compiling) 0.5644176006317139

I jumped to implementation

cpython/Lib/re/__init__.py

Lines 339 to 341 in 456e274

key = (type(pattern), pattern, flags)
# Item in _cache should be moved to the end if found.
p = _cache.pop(key, None)
and then I find out there is cache on this then the result is really unexpected(as it slower 3 times!).

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    pendingThe issue will be closed if no feedback is providedperformancePerformance or resource usagetopic-regextype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions