gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) #120025

dg-pb · 2024-06-04T10:30:05Z

1. Work

I managed to combine all good tricks that I found in the library into 1 dynamic solution, which seems to perform well and eliminate hard-coded boundaries for algorithm selection.

Instead of 3 different implementations only one (horspool_find) is now called (for both forward and reverse search). It dynamically defaults to linear-complexity-assured solution (two_way_find) if it predicts it will perform better.
Direction agnostic logic allowed rfind to use exact same code as find.
Special case n == m to use memcmp added.

2. Results

Aggregate impact of this change seems to be net positive. It results in non-trivial average performance increase, adapts more advanced search algorithms for reverse search, smooths out performance surface and improves on general code structure and documentation.

Benefits:

Performance surface is much smoother now. There is only 1 logic that can cause a step change now and it is dynamic as opposed to many hard-coded step changes of the current logic.
Direction agnostic logic works well and eases the strain of the alternative of having to keep 2 implementations in sync.
Benchmarks:

Average 75% performance increase of find for artificial benchmark of shuffled alphabet.
Average 34% performance increase of find for real file search of different slice lengths.
Average 247% performance increase of rfind for artificial benchmark of shuffled alphabet.

Worth noting:

Splitting 2 directions (forward and reversed) into 2 implementations would result in 10-30% better performance based on tested benchmarks. However, I think it is a good trade-off, given the advantages of unified approach.
There are areas and cases where new algorithm performs worse (see benchmark). However, they are either not clustered or where they are, the performance decrease is non-substantial.

3. Benchmarks:

Benchmark result value:

current_runtime = `run time of current python version`
new_runtime = `run time of this PR`

result = (new_runtime - current_runtime) / min(new_runtime, current_runtime)

3.1. Artificial dataset via randomized alphabet.

Case Generation Gode

# shuffled alphabet
alphabet = 'DHUXYEZQCLFKISBVRGNAMWPTOJ'
zipf = [1/x for x in range(1, 1+len(alphabet))]

def zipf_string(length, seed):
    letters = random.Random(seed).choices(alphabet, weights=zipf, k=length)
    return ''.join(letters)

NLS = [
    2, 3, 4, 6,
    8, 12, 16, 24,
    32, 48, 64, 96,
    128, 192, 256, 384,
    500, 1000, 10000, 100_000
]

HSS = [
    500, 750, 1000, 1500,
    2_000, 3_000, 4_000, 6_000,
    8_000, 12_000, 16_000, 24_000,
    32_000, 48_000, 64_000, 96_000,
    1_000_000
]

def generate_benchmarks():
    output = []
    for m in NLS:
        for n in HSS:
            if n < m:
                continue
            for s in (1, 2, 3):
                seed = (s*n + m) % 1_000_003
                needle = zipf_string(m, seed)
                haystack = zipf_string(n, seed ** 2)
                name = f"needle={m}, haystack={n}, seed={s}"
                output.append((name, needle, haystack))
    with open(f"{PATH}/_generated.py", 'w') as f:
        print("benches = [", file=f)
        for name, needle, haystack in output:
            print(f"    {(name, needle, haystack)!r},", file=f)
        print("]", file=f)

1.a. Results. Current vs new `str.find/str.count`.

Comparison for len(haystack) == 1000 for str.find. x-axis is "{needle_len}:{seed}". Upper chart is run time, lower chart is percentage difference. It depicts the issue this PR is addressing. I.e. Big sub-optimal step-changes in performance for small input changes.

1.b. Results. Current vs new `str.rfind/str.rcount`..

3.2. Search for arbitrary chunks in real files.

Case Generation Gode. `str.rfind/str.rcount`..

/
FILES = {
    "c": (CPYTHON_PATH / "Objects" / "unicodeobject.c").read_text(),
    "py": (CPYTHON_PATH / "Lib" / "_pydecimal.py").read_text(),
    "en": (CPYTHON_PATH / "Doc" / "library" / "stdtypes.rst").read_text(),
    "bin": (CPYTHON_PATH / "python.exe").read_bytes(),
}

MS = [10, 15, 20, 30, 40, 60, 80, 120, 160, 240, 320, 640, 1280]
MR = range(12)

def generate_benchmarks():
    results = dict()
    for file_label, haystack in FILES.items():
        n = len(haystack)
        for m in MS:
            for i in MR:
                stt = (1_000_003 * i) % (n - m)
                needle = haystack[stt:stt + m]
                results[(m, file_label, i)] = haystack, needle
    return results

2.a. Results. Current vs new `str.find/str.count`.

Issue: Better calibration of str.find aka FASTSEARCH #119702

Objects/stringlib/fastsearch.h

dg-pb · 2024-06-04T23:06:49Z

Performance comparison of `count` vs `rcount` (both of this PR).

sweeneyde · 2024-06-05T21:18:00Z

-1 on any changes to rfind(). Maybe someone can consider it later, let's leave it off of this PR.

Also, how should I read your tables? I'm confused by -131.2%, since that implies negative time to me.

dg-pb · 2024-06-05T21:28:14Z

-1 on any changes to rfind(). Maybe someone can consider it later, let's leave it off of this PR.

It looks a big mess now, but I am half-way through making functions that handle both directions with the same code. Initially it slowed down code and looked complex, but it is slowly recovering in both aspects. I should have a presentable version in a day or so. You can take a look then and if you're not happy I can always revert to the first commit of this PR.

Also, how should I read your tables? I'm confused by -131.2%, since that implies negative time to me.

The formula is as follows:

current_time = `time of current python version`
new_time = `time of this PR`

result = (new_time - current_time) / min(new_time, current_time)

I did minimum to make it fair. E.g.:

new_time = 2
current_time = 1
(new_time - current_time) / current_time = 100%
(new_time - current_time) / min(new_time, current_time) = 100%
new_time = 1
current_time = 2
(new_time - current_time) / current_time = -50%
(new_time - current_time) / min(new_time, current_time) = -100%

This way it is symmetric to both positive and negative.

So -131.2% means old version is 131.2% slower than the new.

Objects/stringlib/fastsearch.h

dg-pb · 2024-06-06T09:06:20Z

Could you keep the variable names clear? I don't like 1 letter abbreviations. (i & j for loop counters are generally fine though.)

Changing variable names is easy. However, there were several functions defining same variables with different names:
a) default_find and friends was using s/n/p/m, which is a default naming for string search problems. See https://en.wikipedia.org/wiki/Boyer–Moore_string-search_algorithm. Generally, 90% of search algorithms use this notation.
b) two_way_find was using full names of haystack & needle.

I took liberty to synchronize them and picked a) for the following reasons:
a) To me, it is easier to ingest the code visually when variable names are short. High level code, where there are no complex loops, probably long names are better. But when code gets complex, such as algorithms, then bottleneck becomes the logic, not readability, because one needs to look a certain amount of time before starting to grasp something. 2% into the work variable names are learnt by heart (whatever they might be) and visual inspection of the code is priority for the rest 98% of the time. I think there is a reason why libraries of algorithms use short names.
b) haystack and needle are definitions that are created for analogy (and probably fun). I prefer more formal names for variables or the ones that match equations. Fun variable names can be mentioned in documentation.

However, this is my take. If there are strong objections, it is no problem to revert them back.

Nevertheless, I think they should be synchronized across all the functions in the module. Naming discrepancy has cost me a fair bit of time due to having to readjust myself every time I changed focus.

UPDATE:
Also Objects/stringlib/find.h uses str/str_len/sub/sub_len and it contains higher level functions. Generally, variable names get shorter as it gets to lower level, so I would at least like to keep the same length. But given complexity of the code would still try to insist on 1-letter names that match most of resources on such problems (especially code).

Objects/stringlib/fastsearch.h

picnixz

I haven't read everything but here are some general comments:

Also, please document your code (especially the structures) and explain what is what when you use short variables. Even though it is standard in such algorithms, it does not hurt to remind ourselves that p is the pattern, m is its length, etc. But there are other variables with short names that have no explanation.

I've suggested some improvements on blank lines and if-else blocks but feel free to ignore them.

Objects/stringlib/fastsearch.h

dg-pb · 2024-06-09T15:52:53Z

Also, please document your code (especially the structures) and explain what is what when you use short variables. Even though it is standard in such algorithms, it does not hurt to remind ourselves that p is the pattern, m is its length, etc. But there are other variables with short names that have no explanation.

I eventually changed back argument names to needle and haystack. Also did the same with pre-work structure. It is easier to figure things out this way, when there is no need to work with low details of algorithm.

However, inside algorithm logic s/n/p/m are used and it is documented. But happy to change them to full names if needed.

dg-pb · 2024-06-20T15:25:55Z

Why was the force push necessary?

It wasn't necessary, but preferable, given the situation I was in.

pitrou · 2024-09-15T10:38:32Z

cc @tim-one

tim-one · 2024-10-21T04:50:09Z

@sweeneyde, are you still around? You're obviously the best person to look at this, and it appears to be substantial work with real value. I can't make time to climb this learning curve.

alpha version ready for review

9d0bb1c

bedevere-app bot added the awaiting review label Jun 4, 2024

bedevere-app bot mentioned this pull request Jun 4, 2024

Better calibration of str.find aka FASTSEARCH #119702

Open

nineteendo reviewed Jun 4, 2024

View reviewed changes

Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved

nineteendo reviewed Jun 4, 2024

View reviewed changes

Objects/stringlib/fastsearch.h Show resolved Hide resolved

Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved

rfind implemented

3bb688a

dg-pb changed the title ~~gh-119702: New dynamic algorithm for string search~~ gh-119702: New dynamic algorithm for string search (+ rfind upgrade) Jun 4, 2024

bug

48e7dd5

dg-pb marked this pull request as draft June 5, 2024 11:49

bedevere-app bot removed the awaiting review label Jun 5, 2024

bi-directional horspool_find

d11f9b9

dg-pb added 5 commits June 6, 2024 03:26

optimized horspool

6c9dbc3

more conservative bloom

1b9bdc9

fix assertions

982b510

seamless reverse integration

4e9d278

ready for review

c8e1cc5

This comment was marked as resolved.

Sign in to view

nineteendo reviewed Jun 6, 2024

View reviewed changes

Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved

dg-pb commented Jun 6, 2024

View reviewed changes

Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved

dg-pb marked this pull request as ready for review June 6, 2024 09:58

bedevere-app bot added the awaiting review label Jun 6, 2024

rm variable & add comment

b5bd4c5

picnixz reviewed Jun 6, 2024

View reviewed changes

dg-pb added 2 commits June 6, 2024 19:10

styling, comments and PR feedback

8c87a57

style mostly

4667880

dg-pb marked this pull request as draft June 7, 2024 22:48

dg-pb added 3 commits June 9, 2024 05:07

refactoring variable name changes and and cleanup

dad1c9e

bug

1a13f3a

lint fix and comment

5aca5d0

news edit

6fcf44e

dg-pb marked this pull request as draft June 11, 2024 04:07

bedevere-app bot removed the awaiting review label Jun 11, 2024

dg-pb force-pushed the implement-119702 branch from 465f43e to 6fcf44e Compare June 11, 2024 08:00

dg-pb marked this pull request as ready for review June 11, 2024 08:03

bedevere-app bot added the awaiting review label Jun 11, 2024

This comment was marked as resolved.

Sign in to view

dg-pb added 2 commits June 14, 2024 09:17

docstring fixes and updates

4e52472

clearer blurp

ea0fee6

dg-pb added 4 commits June 20, 2024 21:44

minus one bloom op

7c20ca7

table size 64 -> 128

c2859c0

moved bloom setup closer to its logic

f341989

micro simplifications and optimizations

2437983

dg-pb mentioned this pull request Jun 25, 2024

Support tuples for [r]find() & [r]index() nineteendo/cpython#2

Draft

dg-pb added 4 commits June 25, 2024 18:38

remove unused variable

ea0f362

microopt

d6ac158

30% faster critical factorization

c26dfe9

remove unrelated change

f75f4d1

dg-pb changed the title ~~gh-119702: New dynamic algorithm for string search (+ rfind upgrade)~~ gh-119702: New dynamic algorithm selection for string search (+ rfind upgrade) Jul 13, 2024

dg-pb changed the title ~~gh-119702: New dynamic algorithm selection for string search (+ rfind upgrade)~~ gh-119702: New dynamic algorithm selection for string search (+ alignment with rfind) Jul 13, 2024

dg-pb changed the title ~~gh-119702: New dynamic algorithm selection for string search (+ alignment with rfind)~~ gh-119702: New dynamic algorithm selection for string search (+ rfind alignment) Jul 13, 2024

dg-pb mentioned this pull request Jul 26, 2024

String search improvements faster-cpython/ideas#691

Open

Merge branch 'main' into implement-119702

e6b8414

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) #120025

gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) #120025

dg-pb commented Jun 4, 2024 •

edited

Loading

dg-pb commented Jun 4, 2024 •

edited

Loading

sweeneyde commented Jun 5, 2024

dg-pb commented Jun 5, 2024 •

edited

Loading

This comment was marked as resolved.

dg-pb commented Jun 6, 2024 •

edited

Loading

picnixz left a comment

dg-pb commented Jun 9, 2024 •

edited

Loading

This comment was marked as resolved.

dg-pb commented Jun 20, 2024

pitrou commented Sep 15, 2024

tim-one commented Oct 21, 2024

gh-119702: New dynamic algorithm selection for string search (+ rfind alignment) #120025

Are you sure you want to change the base?

gh-119702: New dynamic algorithm selection for string search (+ rfind alignment) #120025

Conversation

dg-pb commented Jun 4, 2024 • edited Loading

1. Work

2. Results

3. Benchmarks:

3.1. Artificial dataset via randomized alphabet.

3.2. Search for arbitrary chunks in real files.

dg-pb commented Jun 4, 2024 • edited Loading

sweeneyde commented Jun 5, 2024

dg-pb commented Jun 5, 2024 • edited Loading

This comment was marked as resolved.

dg-pb commented Jun 6, 2024 • edited Loading

picnixz left a comment

Choose a reason for hiding this comment

dg-pb commented Jun 9, 2024 • edited Loading

This comment was marked as resolved.

dg-pb commented Jun 20, 2024

pitrou commented Sep 15, 2024

tim-one commented Oct 21, 2024

gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) #120025

gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) #120025

dg-pb commented Jun 4, 2024 •

edited

Loading

dg-pb commented Jun 4, 2024 •

edited

Loading

dg-pb commented Jun 5, 2024 •

edited

Loading

dg-pb commented Jun 6, 2024 •

edited

Loading

dg-pb commented Jun 9, 2024 •

edited

Loading