Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-119702: New dynamic algorithm selection for string search (+ rfind alignment) #120025

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

dg-pb
Copy link
Contributor

@dg-pb dg-pb commented Jun 4, 2024

1. Work

I managed to combine all good tricks that I found in the library into 1 dynamic solution, which seems to perform well and eliminate hard-coded boundaries for algorithm selection.

  1. Instead of 3 different implementations only one (horspool_find) is now called (for both forward and reverse search). It dynamically defaults to linear-complexity-assured solution (two_way_find) if it predicts it will perform better.
  2. Direction agnostic logic allowed rfind to use exact same code as find.
  3. Special case n == m to use memcmp added.

2. Results

Aggregate impact of this change seems to be net positive. It results in non-trivial average performance increase, adapts more advanced search algorithms for reverse search, smooths out performance surface and improves on general code structure and documentation.

Benefits:

  1. Performance surface is much smoother now. There is only 1 logic that can cause a step change now and it is dynamic as opposed to many hard-coded step changes of the current logic.
  2. Direction agnostic logic works well and eases the strain of the alternative of having to keep 2 implementations in sync.
  3. Benchmarks:
  • Average 75% performance increase of find for artificial benchmark of shuffled alphabet.
  • Average 34% performance increase of find for real file search of different slice lengths.
  • Average 247% performance increase of rfind for artificial benchmark of shuffled alphabet.

Worth noting:

  1. Splitting 2 directions (forward and reversed) into 2 implementations would result in 10-30% better performance based on tested benchmarks. However, I think it is a good trade-off, given the advantages of unified approach.
  2. There are areas and cases where new algorithm performs worse (see benchmark). However, they are either not clustered or where they are, the performance decrease is non-substantial.

3. Benchmarks:

Benchmark result value:

current_runtime = `run time of current python version`
new_runtime = `run time of this PR`

result = (new_runtime - current_runtime) / min(new_runtime, current_runtime)

3.1. Artificial dataset via randomized alphabet.

Case Generation Gode

# shuffled alphabet
alphabet = 'DHUXYEZQCLFKISBVRGNAMWPTOJ'
zipf = [1/x for x in range(1, 1+len(alphabet))]

def zipf_string(length, seed):
    letters = random.Random(seed).choices(alphabet, weights=zipf, k=length)
    return ''.join(letters)

NLS = [
    2, 3, 4, 6,
    8, 12, 16, 24,
    32, 48, 64, 96,
    128, 192, 256, 384,
    500, 1000, 10000, 100_000
]

HSS = [
    500, 750, 1000, 1500,
    2_000, 3_000, 4_000, 6_000,
    8_000, 12_000, 16_000, 24_000,
    32_000, 48_000, 64_000, 96_000,
    1_000_000
]

def generate_benchmarks():
    output = []
    for m in NLS:
        for n in HSS:
            if n < m:
                continue
            for s in (1, 2, 3):
                seed = (s*n + m) % 1_000_003
                needle = zipf_string(m, seed)
                haystack = zipf_string(n, seed ** 2)
                name = f"needle={m}, haystack={n}, seed={s}"
                output.append((name, needle, haystack))
    with open(f"{PATH}/_generated.py", 'w') as f:
        print("benches = [", file=f)
        for name, needle, haystack in output:
            print(f"    {(name, needle, haystack)!r},", file=f)
        print("]", file=f)

1.a. Results. Current vs new `str.find/str.count`.

Screenshot 2024-06-27 at 15 06 37

Comparison for len(haystack) == 1000 for str.find. x-axis is "{needle_len}:{seed}". Upper chart is run time, lower chart is percentage difference. It depicts the issue this PR is addressing. I.e. Big sub-optimal step-changes in performance for small input changes.

Screenshot 2024-06-27 at 14 12 55
1.b. Results. Current vs new `str.rfind/str.rcount`..

Screenshot 2024-06-08 at 23 23 17

3.2. Search for arbitrary chunks in real files.

Case Generation Gode. `str.rfind/str.rcount`..

/
FILES = {
    "c": (CPYTHON_PATH / "Objects" / "unicodeobject.c").read_text(),
    "py": (CPYTHON_PATH / "Lib" / "_pydecimal.py").read_text(),
    "en": (CPYTHON_PATH / "Doc" / "library" / "stdtypes.rst").read_text(),
    "bin": (CPYTHON_PATH / "python.exe").read_bytes(),
}

MS = [10, 15, 20, 30, 40, 60, 80, 120, 160, 240, 320, 640, 1280]
MR = range(12)

def generate_benchmarks():
    results = dict()
    for file_label, haystack in FILES.items():
        n = len(haystack)
        for m in MS:
            for i in MR:
                stt = (1_000_003 * i) % (n - m)
                needle = haystack[stt:stt + m]
                results[(m, file_label, i)] = haystack, needle
    return results

2.a. Results. Current vs new `str.find/str.count`.

Screenshot 2024-06-27 at 15 06 54

Objects/stringlib/fastsearch.h Show resolved Hide resolved
Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved
@dg-pb dg-pb changed the title gh-119702: New dynamic algorithm for string search gh-119702: New dynamic algorithm for string search (+ rfind upgrade) Jun 4, 2024
@dg-pb
Copy link
Contributor Author

dg-pb commented Jun 4, 2024

Performance comparison of `count` vs `rcount` (both of this PR).

Screenshot 2024-06-05 at 02 05 25

@dg-pb dg-pb marked this pull request as draft June 5, 2024 11:49
@sweeneyde
Copy link
Member

-1 on any changes to rfind(). Maybe someone can consider it later, let's leave it off of this PR.

Also, how should I read your tables? I'm confused by -131.2%, since that implies negative time to me.

@dg-pb
Copy link
Contributor Author

dg-pb commented Jun 5, 2024

-1 on any changes to rfind(). Maybe someone can consider it later, let's leave it off of this PR.

It looks a big mess now, but I am half-way through making functions that handle both directions with the same code. Initially it slowed down code and looked complex, but it is slowly recovering in both aspects. I should have a presentable version in a day or so. You can take a look then and if you're not happy I can always revert to the first commit of this PR.

Also, how should I read your tables? I'm confused by -131.2%, since that implies negative time to me.

The formula is as follows:

current_time = `time of current python version`
new_time = `time of this PR`

result = (new_time - current_time) / min(new_time, current_time)

I did minimum to make it fair. E.g.:

new_time = 2
current_time = 1
(new_time - current_time) / current_time = 100%
(new_time - current_time) / min(new_time, current_time) = 100%
new_time = 1
current_time = 2
(new_time - current_time) / current_time = -50%
(new_time - current_time) / min(new_time, current_time) = -100%

This way it is symmetric to both positive and negative.

So -131.2% means old version is 131.2% slower than the new.

@nineteendo

This comment was marked as resolved.

@dg-pb
Copy link
Contributor Author

dg-pb commented Jun 6, 2024

Could you keep the variable names clear? I don't like 1 letter abbreviations. (i & j for loop counters are generally fine though.)

Changing variable names is easy. However, there were several functions defining same variables with different names:
a) default_find and friends was using s/n/p/m, which is a default naming for string search problems. See https://en.wikipedia.org/wiki/Boyer–Moore_string-search_algorithm. Generally, 90% of search algorithms use this notation.
b) two_way_find was using full names of haystack & needle.

I took liberty to synchronize them and picked a) for the following reasons:
a) To me, it is easier to ingest the code visually when variable names are short. High level code, where there are no complex loops, probably long names are better. But when code gets complex, such as algorithms, then bottleneck becomes the logic, not readability, because one needs to look a certain amount of time before starting to grasp something. 2% into the work variable names are learnt by heart (whatever they might be) and visual inspection of the code is priority for the rest 98% of the time. I think there is a reason why libraries of algorithms use short names.
b) haystack and needle are definitions that are created for analogy (and probably fun). I prefer more formal names for variables or the ones that match equations. Fun variable names can be mentioned in documentation.

However, this is my take. If there are strong objections, it is no problem to revert them back.

Nevertheless, I think they should be synchronized across all the functions in the module. Naming discrepancy has cost me a fair bit of time due to having to readjust myself every time I changed focus.

UPDATE:
Also Objects/stringlib/find.h uses str/str_len/sub/sub_len and it contains higher level functions. Generally, variable names get shorter as it gets to lower level, so I would at least like to keep the same length. But given complexity of the code would still try to insist on 1-letter names that match most of resources on such problems (especially code).

Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved
@dg-pb dg-pb marked this pull request as ready for review June 6, 2024 09:58
Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read everything but here are some general comments:

Also, please document your code (especially the structures) and explain what is what when you use short variables. Even though it is standard in such algorithms, it does not hurt to remind ourselves that p is the pattern, m is its length, etc. But there are other variables with short names that have no explanation.

I've suggested some improvements on blank lines and if-else blocks but feel free to ignore them.

Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved
Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved
Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved
Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved
Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved
Objects/stringlib/fastsearch.h Show resolved Hide resolved
Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved
Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved
Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved
Objects/stringlib/fastsearch.h Show resolved Hide resolved
@dg-pb dg-pb marked this pull request as draft June 7, 2024 22:48
@dg-pb
Copy link
Contributor Author

dg-pb commented Jun 9, 2024

Also, please document your code (especially the structures) and explain what is what when you use short variables. Even though it is standard in such algorithms, it does not hurt to remind ourselves that p is the pattern, m is its length, etc. But there are other variables with short names that have no explanation.

I eventually changed back argument names to needle and haystack. Also did the same with pre-work structure. It is easier to figure things out this way, when there is no need to work with low details of algorithm.

However, inside algorithm logic s/n/p/m are used and it is documented. But happy to change them to full names if needed.

@dg-pb dg-pb marked this pull request as draft June 11, 2024 04:07
@dg-pb dg-pb marked this pull request as ready for review June 11, 2024 08:03
@nineteendo

This comment was marked as resolved.

@dg-pb
Copy link
Contributor Author

dg-pb commented Jun 20, 2024

Why was the force push necessary?

It wasn't necessary, but preferable, given the situation I was in.

@dg-pb dg-pb changed the title gh-119702: New dynamic algorithm for string search (+ rfind upgrade) gh-119702: New dynamic algorithm selection for string search (+ rfind upgrade) Jul 13, 2024
@dg-pb dg-pb changed the title gh-119702: New dynamic algorithm selection for string search (+ rfind upgrade) gh-119702: New dynamic algorithm selection for string search (+ alignment with rfind) Jul 13, 2024
@dg-pb dg-pb changed the title gh-119702: New dynamic algorithm selection for string search (+ alignment with rfind) gh-119702: New dynamic algorithm selection for string search (+ rfind alignment) Jul 13, 2024
@pitrou
Copy link
Member

pitrou commented Sep 15, 2024

cc @tim-one

@tim-one
Copy link
Member

tim-one commented Oct 21, 2024

@sweeneyde, are you still around? You're obviously the best person to look at this, and it appears to be substantial work with real value. I can't make time to climb this learning curve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants