Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-91576: Speed up iteration of strings #91574

Merged
merged 19 commits into from Apr 18, 2022

Conversation

kumaraditya303
Copy link
Contributor

@kumaraditya303 kumaraditya303 commented Apr 15, 2022

Benchmark Script:

from pyperf import Runner, perf_counter

def bench_str(loops, src, length):
    src = src * length
    t0 = perf_counter()
    for _ in range(loops):
        for i in src:
            pass
    return perf_counter() - t0

runner = Runner()
for src in ['a', 'é']:
    for n in [1_000, 10_000]:
        runner.bench_time_func(f"str {src} {n}", bench_str, src, n)

Results:

str a 1000: Mean +- std dev: [base] 9.72 us +- 0.62 us -> [patch] 6.54 us +- 1.05 us: 1.49x faster
str a 10000: Mean +- std dev: [base] 95.8 us +- 5.1 us -> [patch] 63.0 us +- 4.1 us: 1.52x faster
str é 1000: Mean +- std dev: [base] 10.3 us +- 1.6 us -> [patch] 8.60 us +- 0.56 us: 1.19x faster
str é 10000: Mean +- std dev: [base] 103 us +- 9 us -> [patch] 86.1 us +- 4.7 us: 1.19x faster

Geometric mean: 1.34x faster

Closes #91576

@kumaraditya303 kumaraditya303 changed the title Fast string iterator gh-91576: Speed up iteration of strings Apr 15, 2022
@JelleZijlstra JelleZijlstra self-requested a review Apr 15, 2022
Objects/object.c Outdated Show resolved Hide resolved
@JelleZijlstra
Copy link
Member

JelleZijlstra commented Apr 15, 2022

Happy to help review this, let me know when you're ready

@kumaraditya303 kumaraditya303 marked this pull request as ready for review Apr 15, 2022
@kumaraditya303
Copy link
Contributor Author

kumaraditya303 commented Apr 15, 2022

Happy to help review this, let me know when you're ready

@JelleZijlstra Finished.

@kumaraditya303 kumaraditya303 requested a review from gvanrossum Apr 15, 2022
Copy link
Member

@gvanrossum gvanrossum left a comment

Why not use the specialized iteratie for all Latin-1 strings?

@kumaraditya303
Copy link
Contributor Author

kumaraditya303 commented Apr 15, 2022

Why not use the specialized iteratie for all Latin-1 strings?

That would add one more branch instruction and I was trying to avoid it and LATIN1 is rare compared to ASCII.

Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Copy link
Member

@gvanrossum gvanrossum left a comment

You should just be able to test

(PyUnicode_KIND((unicode)) == PyUnicode_1BYTE_KIND

to decide which iterator to create, right? Or can kind be changed (once the object is "ready")?

Objects/unicodeobject.c Outdated Show resolved Hide resolved
@gvanrossum
Copy link
Member

gvanrossum commented Apr 15, 2022

That would add one more branch instruction and I was trying to avoid it and LATIN1 is rare compared to ASCII.

Given that this is a fixed cost (once per iterator construction) I think the extra branch won't be noticeable. Latin-1 may be rare compared to ASCII but it's still got some common characters and it would be essentially free.

@kumaraditya303
Copy link
Contributor Author

kumaraditya303 commented Apr 15, 2022

Given that this is a fixed cost (once per iterator construction) I think the extra branch won't be noticeable. Latin-1 may be rare compared to ASCII but it's still got some common characters and it would be essentially free.

No, the cost is a branch instruction on each iteration as ascii and latin1 uses different structures.

@gvanrossum
Copy link
Member

gvanrossum commented Apr 15, 2022

No, the cost is a branch instruction on each iteration as ascii and latin1 uses different structures.

Hm, couldn't you just store a pointer to the array of bytes (and another to the end) rather than an index? Or is it possible that the bytes move around somehow?

@kumaraditya303
Copy link
Contributor Author

kumaraditya303 commented Apr 15, 2022

See the LATIN1 macro in unicodeobject.c.

@kumaraditya303
Copy link
Contributor Author

kumaraditya303 commented Apr 15, 2022

It requires a check if ch is less than 128 then it uses a different array to index depending on the comparison.

@sweeneyde
Copy link
Member

sweeneyde commented Apr 15, 2022

How does this affect performance when ascii and non-ascii are mixed together in the same string?

@gvanrossum
Copy link
Member

gvanrossum commented Apr 16, 2022

It requires a check if ch is less than 128 then it uses a different array to index depending on the comparison.

Oh, I see. That's a bit unfortunate but I see your point and I guess ASCII strings are somewhat special anyways.

How does this affect performance when ascii and non-ascii are mixed together in the same string?

In that case the representation of the whole string will not use the "compact ASCII" format and we'll be using the regular (slow) iterator.

@kumaraditya303 Please address the other review comments.

@kumaraditya303 kumaraditya303 requested a review from gvanrossum Apr 17, 2022
@kumaraditya303
Copy link
Contributor Author

kumaraditya303 commented Apr 17, 2022

Added some tests and addressed comments.

@kumaraditya303 kumaraditya303 self-assigned this Apr 17, 2022
@kumaraditya303 kumaraditya303 added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 17, 2022
@bedevere-bot
Copy link

bedevere-bot commented Apr 17, 2022

🤖 New build scheduled with the buildbot fleet by @kumaraditya303 for commit ad2d676 🤖

If you want to schedule another build, you need to add the "🔨 test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 17, 2022
@kumaraditya303 kumaraditya303 added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 17, 2022
@bedevere-bot
Copy link

bedevere-bot commented Apr 17, 2022

🤖 New build scheduled with the buildbot fleet by @kumaraditya303 for commit 56d110c 🤖

If you want to schedule another build, you need to add the "🔨 test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 17, 2022
Lib/test/test_unicode.py Outdated Show resolved Hide resolved
Copy link
Contributor

@erlend-aasland erlend-aasland left a comment

Looks good!

Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Show resolved Hide resolved
@JelleZijlstra JelleZijlstra removed their request for review Apr 17, 2022
Copy link
Member

@gvanrossum gvanrossum left a comment

I'm happy now!

@gvanrossum gvanrossum merged commit 8c54c3d into python:main Apr 18, 2022
13 checks passed
@kumaraditya303 kumaraditya303 deleted the fast-iter-str branch Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up iteration of strings
7 participants