New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quadratic time internal base conversions #90716
Comments
Our internal base conversion algorithms between power-of-2 and non-power-of-2 bases are quadratic time, and that's been annoying forever ;-) This applies to int<->str and int<->decimal.Decimal conversions. Sometimes the conversion is implicit, like when comparing an int to a Decimal. For example:
I gave up after waiting for over 8 hours, and the computation apparently can't be interrupted. In contrast, using the function in the attached todecstr.py gets the result in under a minute: >>> a = 1 << 1000000000
>>> s = todecstr(a)
>>> len(s)
301029996 That builds an equal decimal.Decimal in a "clever" recursive way, and then just applies str to _that_. That's actually a best case for the function, which gets major benefit from the mountains of trailing 0 bits. A worst case is all 1-bits, but that still finishes in under 5 minutes: >>> a = 1 << 1000000000
>>> s2 = todecstr(a - 1)
>>> len(s2)
301029996
>>> s[-10:], s2[-10:]
('1787109376', '1787109375') A similar kind of function could certainly be written to convert from Decimal to int much faster, but it would probably be less effective. These things avoid explicit division entirely, but fat multiplies are key, and Decimal implements a fancier * algorithm than Karatsuba. Not for the faint of heart ;-) |
Is this similar to https://bugs.python.org/issue3451 ? |
Dennis, partly, although that was more aimed at speeding division, while the approach here doesn't use division at all. However, thinking about it, the implementation I attached doesn't actually for many cases (it doesn't build as much of the power tree in advance as may be needed). Which I missed because all the test cases I tried had mountains of trailing 0 or 1 bits, not mixtures. So I'm closing this anyway, at least until I can dream up an approach that always works. Thanks! |
Changed the code so that inner() only references one of the O(log log n) powers of 2 we actually precomputed (it could get lost before if |
The test case here is a = (1 << 100000000) - 1, a solid string of 100 million 1 bits. The goal is to convert to a decimal string. Methods: native: str(a) numeral: the Python numeral() function from bpo-3451's div.py after adapting to use the Python divmod_fast() from the same report's fast_div.py. todecstr: from the Python file attached to this report. gmp: str() applied to gmpy2.mpz(a). Timings: native: don't know; gave up after waiting over 2 1/2 hours. So there's room for improvement ;-) But here's the thing: I've lost count of how many times someone has whipped up a pure-Python implementation of a bigint algorithm that leaves CPython in the dust. And they're generally pretty easy in Python. But then they die there, because converting to C is soul-crushing, losing the beauty and elegance and compactness to mountains of low-level details of memory-management, refcounting, and checking for errors after every tiny operation. So a new question in this endless dilemma: _why_ do we need to convert to C? Why not leave the extreme cases to far-easier to write and maintain Python code? When we're cutting runtime from hours down to minutes, we're focusing on entirely the wrong end to not settle for 2 minutes because it may be theoretically possible to cut that to 1 minute by resorting to C. (*) I hope this algorithm tickles you by defying expectations ;-) It essentially stands |
Addendum: the "native" time (for built in str(a)) in the msg above turned out to be over 3 hours and 50 minutes. |
Somebody pointed me to V8's implementation of str(bigint) today: https://github.com/v8/v8/blob/main/src/bigint/tostring.cc They say that they can compute str(factorial(1_000_000)) (which is 5.5 million decimal digits) in 1.5s: https://twitter.com/JakobKummerow/status/1487872478076620800 As far as I understand the code (I suck at C++) they recursively split the bigint into halves using % 10^n at each recursion step, but pre-compute and cache the divisors' inverses. |
The factorial of a million is much smaller than the case I was looking at. Here are rough timings on my box, for computing the decimal string from the bigint (and, yes, they all return the same string): native: 475 seconds (about 8 minutes) "They recursively split the bigint into halves using % 10^n at each recursion step". That's the standard trick for "output" conversions. Beyond that, there are different ways to try to use "fat" multiplications instead of division. The recursive splitting all on its own can help, but dramatic speedups need dramatically faster multiplication. todecstr treats it as an "input" conversion instead, using the |
Worth pointing this out since it doesn't seem widely known: "input" base conversions are _generally_ faster than "output" ones. Working in the destination base (or a power of it) is generally simpler. In the math.factorial(1000000) example, it takes CPython more than 3x longer for str() to convert it to base 10 than for int() to reconstruct the bigint from that string. Not an O() thing (they're both quadratic time in CPython today). |
Hi, why is this issue closed? What needs to be done to make it through? |
It's closed because nobody appears to be both willing and able to pursue it. But it's on the edge regardless. In general, any number of huge-int algorithms could be greatly speeded, but that's a massive undertaking. It's why GMP exists, to push such things as far as possible, regardless of implementation complexity, effort, or bulk. Very capable GMP bindings for Python are already available ( As the timings here suggest, GMP will generally be faster than anything we may do anyway, because GMP never reaches a point where its authors say "good enough already". That said, I expect the single most valuable bigint speedup CPython could implement would be to bigint division, along the lines of gh-47701. That could indirectly give major speed boosts to bigint->str and bigint modular |
Why do you need fast division? Why not just implement str to int convertion using this
style of d&c? This would be simple to implement and would only require multiplication. If If multiplication is done using Karatsuba ( Since Python currently uses Karatsuba for its big ints multiplication, this simple str to int convertion algorithm would run in |
Yep, even simplest divide and conquer with native multiplication would already provide significant speed-up. |
Also if you want something slightly smarter, since 10 is
|
I made an implementation of the basic d&q algorithm to test it out pow5 = [5]
while len(pow5) <= 22:
pow5.append(pow5[-1] * pow5[-1])
def str_to_int(s):
def _str_to_int(l, r):
if r - l <= 3000:
return int(s[l:r])
lg_split = (r - l - 1).bit_length() - 1
split = 1 << lg_split
return ((_str_to_int(l, r - split) * pow5[lg_split]) << split) + _str_to_int(r - split, r)
return _str_to_int(0, len(s)) Running this locally, Clearly |
I also tried out For 40000 digits str_to_int with GMP takes 0.000432 s. So what Python actually needs is faster big int multiplication. With that you'd get a really fast string to int converter for free. |
All the earlier timing examples here were of int->str, not of str->int. For the latter case, which you're looking at, splitting a decimal string by a power of 10 is indeed trivial (conceptually - but by the time you finish coding it all in C, with all the memory-management, refcounting, and error-checking cruft debugged, not so much As already noted, the report's But faster native bigint division could nevertheless buy a major speedup for int->str, as already demonstrated near the start here by showing timings for the And it could also buy major speedups for modular And it would, tautologically, speed bigint division too, which is a bottleneck all on its own in some apps. |
It would be useful to have practical examples of actual Python applications that would benefit from any of: Including if they've adopted a third party library for this math and would no longer have any need for that if improved. Where "high performance" means more than constant-time faster than what CPython Bigint's really feel like a neat computer science toy most of the time. We've got them, but what actually uses integers larger than 64 or 128 bits? Similarly, once numbers get huge, what is the point of converting them to and from decimal? Decimal is for humans and humans don't readily comprehend precise numbers that large let alone type them correctly. |
"Practical applications" isn't really the name of this game Most complaints come from more-or-less newbies, who, e.g., play around in an interactive shell, and are surprised to find that sometimes the shell seems to freeze (but is really just waiting for a quadratic-time, or worse, bigint operation to finish). People who know what they're doing find many uses for big bigints, but more of a theoretical, research, educational, or recreational bent. Like, algorithm design, combinatorics, number theory, or contributing to the wonderful On-Line Encyclopedia of Integer Sequences. They also come as incidental side effects of using power tools, like in a symbolic math package computing symbolic power series or (for related reasons) symbolic high derivatives (integer coefficients and/or denominators often grow at an exponential rate across terms). Or they can be almost the whole banana, such as scaling an ill-conditioned float matrix to use exact bigint arithmetic instead to compute an exact matrix inverse or determinant (where sticking to floats the result could be pure noise due to compounding rounding errors). Not all that long ago, a number of people worked on adding "gonzo" optimizations to |
One good example of using big integers are math people using Python for some quick calculations. When you use functions like factorial, it is easy to get pretty big integers. I also think that math people generally prefer decimal numbers over hex. Another example would be working with fractions. Just adding fractional numbers makes their denominators and numerators grow really quickly. I also don't think fractions support hex. Both reading and printing requires decimal as far as I can tell. For example you can do The examples I've given above is the type of scripts that wont appear in things like Google's codebase, even if they are relatively common use cases. Another important thing to note is that programmers trust the built in functions in Python to just work. That is the fundamental reason why |
I think Tim's suggestion of switching to a Python implementation of a more efficient algorithm for large inputs would be a good approach. It is reasonable to not want to maintain a complicated high-performance algorithm written in C. This would be an interesting project for a student or someone wanting a challenge. Calling into Python from C is not too hard to do. Use a C-API to import a Python module and call a function inside of it. |
Here's a
|
@pochmann, that's encouraging - but please post code people can run as-is. For example, this code uses |
Sorry. Ok, updated it to be a complete script. |
I btw measured on replit.com, as I'm only on a phone. Would like to see times with 3.10 on a normal PC. |
From a very quick run on Windows 10, 64-bit, quad core desktop box, under Python 3.10.8, using precisely (no changes) the code you posted:
But that's all I can make time for now. Boosting it to 10 million digits, I'm still waiting for the first line of output ... OK, first two lines of output at 10 million digits tell enough:
So |
Thanks. Smaller ratio than I got (for 1e7 digits I had 575 vs 18 seconds), bit disappointing. |
Oh, I don't know - x-platform optimization is a PITA. So many things can differ. May, e.g., look quite different under gcc with various optimization gimmicks. More interesting was tossing Neil's PR's
So your code gets off to a worse start, but appears to have much better asymptotics (but not yet better enough to overtake |
But the new code wins convincingly at 100 million digits (and skipping
|
News to me: I knew |
Yes, before I implemented my idea I checked I'm not familiar with Newton's reciprocal way. Does it actually compute |
gh-47701 starts with a discussion of (and Python code for) "Newton division". "Newton's method" for computing a reciprocal of
and so on. The number of correct digits approximately doubles on each iteration. Code isn't that simple, of course, because no floats are actually used in this context, but rather scaled integers. Getting the last retained bit exactly right is delicate. I also doubt setting |
I made a very bare-bones implementation of Schönhage-Strassen's algorithm in Python as a proof of concept. It probably shouldn't even be called Schönhage-Strassen, but I don't know what else to call it. Feel free to try it out! It should have the same time complexity as Schönhage-Strassen. strassen.pydef shift(x, m, b):
# Calculate (x << b) % (2**m + 1)
b %= 2 * m
if b >= m:
x = -x
b -= m
upper = x >> (m - b)
x -= upper << (m - b)
return (x << b) - upper
# NTT using Cooley-Tukey, a divide and conquer algorithm
# Note that Cooley-Tukey requires n to be a power of two
def NTT(A, m, root = 1):
n = len(A)
if n == 1:
return A
root2 = root * 2 % (2 * m)
even = NTT(A[::2], m, root2)
odd = NTT(A[1::2], m, root2)
# Apply the twiddle factor
odd = [shift(odd[k], m, root * k) for k in range(n//2)]
return [even[k] + odd[k] for k in range(n//2)] +\
[even[k] - odd[k] for k in range(n//2)]
def INTT(A, m, root = 1):
A2 = NTT(A, m, -root)
inverse_shift = -m.bit_length()
return [shift(a, m, inverse_shift) for a in A2]
def int_to_limbs(x, limb, n):
# Convert big integer x to base 'limb' (a list of length n)
if n == 1:
return [x]
n2 = n//2
upper = x >> n2 * limb
lower = x - (upper << n2 * limb)
return int_to_limbs(lower, limb, n2) + int_to_limbs(upper, limb, n - n2)
def limbs_to_int(A, limb):
# Convert a number given in base 'limb' into a big integer
n = len(A)
if n == 1:
return A[0]
n2 = n//2
return limbs_to_int(A[:n2], limb) + (limbs_to_int(A[n2:], limb) << (n2 * limb))
def ssa(x1, x2):
""" Calculate x1 * x2 using Schonhagen-Strassen's algorithm """
n1 = x1.bit_length()
n2 = x2.bit_length()
# Find smallest mod 2**m + 1 that will work
def check(m):
# if m is too small, return 0
# Otherwise return smallest working diget size
dig = 1
while True:
n_dig1 = (n1 + dig - 1) // dig
n_dig2 = (n2 + dig - 1) // dig
n = n_dig1 + n_dig2 - 1
if n > 2 * m: # dig too small
dig += 1
continue
if max(n_dig1, n_dig2) * 2**(2*dig) <= 2**m:
return dig
else:
return 0 # impossible m
m = 1
while not check(m):
m *= 2 # Only powers of 2 will work because we use Cooley-Tukey for the NTT
limb = check(m)
print('Multiplying a', n1, 'bit number with a', n2, 'bit number')
print('Using mod 2**m + 1, with m =', m)
print('Using limb of', limb, 'bits')
print('Reduction to', 2*m, 'multiplications of numbers with', m, 'bits')
import time
L = time.perf_counter()
A = int_to_limbs(x1, limb, 2 * m)
B = int_to_limbs(x2, limb, 2 * m)
A = NTT(A, m)
B = NTT(B, m)
# The big x1 * x2 calculation is split into 2*m, m bit multiplications
C = [shift(a * b, m, 0) for a,b in zip(A,B)]
C = INTT(C, m)
C = [shift(c, m, 0) for c in C]
y = limbs_to_int(C, limb)
R = time.perf_counter()
print('Time taken by Schonhagen-Strassen', R - L)
L = time.perf_counter()
x1 * x2
R = time.perf_counter()
print('Built in time taken', R - L)
assert y == x1 * x2
return y
ssa(141**1000000, 99**1000000)
So it breaks even at around 1e6 bits, and becomes significantly faster at 1e7 bits. Since it speeds up multiplication, it would also speed up algorithms dependent on multiplication (like I have some ideas for making the implementation faster. For example, I think I might be able to use a trick called the "square root of 2 trick" to speed it up by roughly a factor of 2. But I have not yet tried implementing that. If anyone has any ideas of how to improve this implementation in general, then I'm all ears. |
Cool! I never implemented a There are many other algorithms that could benefit from this. Most critically to my eyes is that Mark Dickinson's "fast division" code in Neil's PR also essentially inherits its asymptotic behavior from |
@bjorn-martinsson I ran your code as-is and got:
The
Is that I was trying to use |
As I said this is just a proof of concept. I have not optimized the I have been working an actual proper implementation of Schönhagen-Strassen based on these notes. I will post it later today. It seems a lot faster, and it doesn't have the cheeky |
Here comes the updated version. This is a proper implementation of Schönhage-Strassen following these notes. If you are new to FFT algorithms, and want to learn more about how FFT algorithms work, then I recommend reading this pdf. Schönhage-Strassen implementationdef shift(x, m, b):
# Calculate (x << b) % (2**m + 1)
b %= 2 * m
if b >= m:
x = -x
b -= m
upper = x >> (m - b)
x -= upper << (m - b)
return (x << b) - upper
# FFT using Cooley-Tukey, a divide and conquer algorithm
# NOTE that Cooley-Tukey requires n to be a power of two
def NTT(A, m, root = 1):
n = len(A)
if n == 1:
return A
root2 = root * 2 % (2 * m)
even = NTT(A[::2], m, root2)
odd = NTT(A[1::2], m, root2)
# Multiply by roots of unity
odd = [shift(odd[k], m, root * k) for k in range(n//2)]
return [even[k] + odd[k] for k in range(n//2)] +\
[even[k] - odd[k] for k in range(n//2)]
def INTT(A, m, root = 1):
A2 = NTT(A, m, -root)
inverse_shift = -(len(A).bit_length() - 1)
return [shift(a, m, inverse_shift) for a in A2]
def int_to_limbs(x, limb, n):
# Convert big integer x to base 'limb' (a list of length n)
if n == 1:
return [x]
n2 = n//2
upper = x >> n2 * limb
lower = x - (upper << n2 * limb)
return int_to_limbs(lower, limb, n2) + int_to_limbs(upper, limb, n - n2)
def limbs_to_int(A, limb):
# Convert a number given in base 'limb' into a big integer
n = len(A)
if n == 1:
return A[0]
n2 = n//2
return limbs_to_int(A[:n2], limb) + (limbs_to_int(A[n2:], limb) << (n2 * limb))
def negacyclic_convolution(A,B,m):
n = len(A)
assert len(A) == len(B) <= m
assert m % n == 0
root = m // n
# Weight A and B
A = [shift(a, m, root * i) for i,a in enumerate(A)]
B = [shift(b, m, root * i) for i,b in enumerate(B)]
A = NTT(A, m, 2 * root)
B = NTT(B, m, 2 * root)
A = [shift(a, m, 0) for a in A]
B = [shift(b, m, 0) for b in B]
if m >= 1e5: # Use ssa if num of bits in a and b >= 1e5
C = [SSA(a, b, m) for a,b in zip(A,B)]
else:
C = [shift(a * b, m, 0) for a,b in zip(A,B)]
C = INTT(C, m, 2 * root)
C = [shift(c, m, -root * i) for i,c in enumerate(C)]
C = [shift(c, m, 0) for c in C]
return C
def SSA(x1, x2, M = None):
"""
Calculates x1 * x2 if M is None
Otherwise calculates x1 * x2 % (2**M + 1)
"""
n = 256 # Sṕlit factor, configurable, needs to be a factor of 2
if M is None:
n1 = x1.bit_length()
n2 = x2.bit_length()
limb = (n1 + n2 + n - 1) // n
M = limb * n
else:
assert M % n == 0
limb = M // n
lower_lim_m = 2 * limb + n.bit_length() + 1
m = ((lower_lim_m + n - 1)//n) * n
A = int_to_limbs(x1, limb, n)
B = int_to_limbs(x2, limb, n)
C = negacyclic_convolution(A, B, m)
# Detect underflow in output of the negacyclic convolution
C = [c - ((1 << m) + 1) if c >> (m - 1) else c for c in C]
return limbs_to_int(C, limb)
#a = 12345**20000; b = 13245**20000
a = 12345**200000; b = 13245**200000
#a = 12345**2000000; b = 13245**2000000
#a = 12345**4000000; b = 13245**4000000
print('Multiplying',a.bit_length(),'bit number with', b.bit_length(), 'bit number', flush=True)
import time
L = time.perf_counter()
y1 = SSA(a,b)
R = time.perf_counter()
print('SSA took', R - L, flush=True)
L = time.perf_counter()
y2 = a * b
R = time.perf_counter()
print('Built in took', R - L)
assert y1 == y2 Here are some benchmarks
Looking at these numbers. It seems like |
@bjorn-martinsson Great, thanks! I tried benchmarking it against Results
CodeSame as yours, except the different benchmark at the end:
|
Very interesting! Here's a replacement for the tail end, which adds GMP to the mix (gmpy2 needs to be installed). The real purpose to adding GMP is to make it possible to check the SSA and Decimal results for equality in reasonable time ( Sample output (not on a quiet box, and note that I added another exponent at the end):
and the code:
|
Perhaps a silly question: couldn't we 'just' make GMP a dependency and use their algorithms? (I appreciated that GMP probably has an incompatible license. But I don't think it's the only game in town?) |
I'm too old to spend another second of my life hassling with license issues, so someone else will have to wrestle that bear. We definitely can't ship GMP with Python. A missing bit of the puzzle is how many Python programmers will actually care. No way to guess. But, e.g., CPython could presumably also check for the presence of So, by all means, pursue this if you like - just don't expect it to be easy |
Not a silly question. IIRC @vstinner did an experiment of using GMP as the PyLong implementation in the past and unsurprisingly found it to slow things down as it's focus is primarily large numbers and Python programs are normally tiny 1-3 internal-digits number focused? If you wanted to use it, you'd be end up aiming for a multi-implementation PyLong type that switches at math operation time based on the size of the inputs (including conversion costs). That's a lot of complexity. The other big can of worms that Tim alludes to is licensing. GMP is LGPL v3 / GPL v2. While we do have other L?GPL-linking batteries in the standard library (readline & gdbm at least), but they're optional and not in the core itself. LGPL/GPL is unusable in many CPython environments. So it'd always need to be an optional dependency. IMNSHO It is easier to allow big-math libraries that are focused on big number performance do the dynamically choose to use it dance as several already appear to do. I don't think we should focus on GMP for this issue. |
Neil has an open PR (gh-96673) to add some asymptotically better int->str, str->int, and int division implementations, but we're in agreement on that "ambition is the fatal enemy of major improvements" in this area. Any number of proposals to achieve major improvements, with relatively simple code, have died over the decades, due to the peanut gallery So Neil (with my enthusiastic support) is aiming to limit the scope of his PR, and even to leave the code in Python. Check it in, it's big wins. Other people can build on it later, if they're willing to devote their time to it. One of the things we discussed off-line was whether to try to auto-detect I've been annoyed at times by CPython's int<->str sloth for 30 years now. I don't care here about DoS vulnerabilities, or competing with GMP, I just want to see some major asymptotic improvements checked in before I die BTW, this isn't a "not invented here" thing. I'd love to use GMP, but licensing hassles kill that idea in general. If licensing wasn't a fatal hassle, it would probably be best to re-introduce Python 2's distinction between |
tim-one commentedJan 28, 2022
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: