-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-119702: New dynamic algorithm selection for string search (+ rfind
alignment)
#120025
base: main
Are you sure you want to change the base?
Conversation
rfind
upgrade)
-1 on any changes to rfind(). Maybe someone can consider it later, let's leave it off of this PR. Also, how should I read your tables? I'm confused by |
It looks a big mess now, but I am half-way through making functions that handle both directions with the same code. Initially it slowed down code and looked complex, but it is slowly recovering in both aspects. I should have a presentable version in a day or so. You can take a look then and if you're not happy I can always revert to the first commit of this PR.
The formula is as follows: current_time = `time of current python version`
new_time = `time of this PR`
result = (new_time - current_time) / min(new_time, current_time) I did minimum to make it fair. E.g.: new_time = 2
current_time = 1
(new_time - current_time) / current_time = 100%
(new_time - current_time) / min(new_time, current_time) = 100%
new_time = 1
current_time = 2
(new_time - current_time) / current_time = -50%
(new_time - current_time) / min(new_time, current_time) = -100% This way it is symmetric to both positive and negative. So |
This comment was marked as resolved.
This comment was marked as resolved.
Changing variable names is easy. However, there were several functions defining same variables with different names: I took liberty to synchronize them and picked a) for the following reasons: However, this is my take. If there are strong objections, it is no problem to revert them back. Nevertheless, I think they should be synchronized across all the functions in the module. Naming discrepancy has cost me a fair bit of time due to having to readjust myself every time I changed focus. UPDATE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read everything but here are some general comments:
Also, please document your code (especially the structures) and explain what is what when you use short variables. Even though it is standard in such algorithms, it does not hurt to remind ourselves that p is the pattern, m is its length, etc. But there are other variables with short names that have no explanation.
I've suggested some improvements on blank lines and if-else blocks but feel free to ignore them.
I eventually changed back argument names to However, inside algorithm logic |
This comment was marked as resolved.
This comment was marked as resolved.
It wasn't necessary, but preferable, given the situation I was in. |
rfind
upgrade)rfind
upgrade)
rfind
upgrade)rfind
)
rfind
)rfind
alignment)
cc @tim-one |
@sweeneyde, are you still around? You're obviously the best person to look at this, and it appears to be substantial work with real value. I can't make time to climb this learning curve. |
1. Work
I managed to combine all good tricks that I found in the library into 1 dynamic solution, which seems to perform well and eliminate hard-coded boundaries for algorithm selection.
horspool_find
) is now called (for both forward and reverse search). It dynamically defaults to linear-complexity-assured solution (two_way_find
) if it predicts it will perform better.rfind
to use exact same code asfind
.n == m
to usememcmp
added.2. Results
Aggregate impact of this change seems to be net positive. It results in non-trivial average performance increase, adapts more advanced search algorithms for reverse search, smooths out performance
surface
and improves on general code structure and documentation.Benefits:
surface
is much smoother now. There is only 1 logic that can cause a step change now and it is dynamic as opposed to many hard-coded step changes of the current logic.75%
performance increase offind
for artificial benchmark of shuffled alphabet.34%
performance increase offind
for real file search of different slice lengths.247%
performance increase ofrfind
for artificial benchmark of shuffled alphabet.Worth noting:
3. Benchmarks:
Benchmark result value:
3.1. Artificial dataset via randomized alphabet.
Case Generation Gode
1.a. Results. Current vs new `str.find/str.count`.
Comparison for
len(haystack) == 1000
forstr.find
. x-axis is"{needle_len}:{seed}"
. Upper chart is run time, lower chart is percentage difference. It depicts the issue this PR is addressing. I.e. Big sub-optimal step-changes in performance for small input changes.1.b. Results. Current vs new `str.rfind/str.rcount`..
3.2. Search for arbitrary chunks in real files.
Case Generation Gode. `str.rfind/str.rcount`..
2.a. Results. Current vs new `str.find/str.count`.