Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadratic time internal base conversions #90716

Open
tim-one opened this issue Jan 28, 2022 · 73 comments
Open

Quadratic time internal base conversions #90716

tim-one opened this issue Jan 28, 2022 · 73 comments
Labels
3.12 interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement

Comments

@tim-one
Copy link
Member

tim-one commented Jan 28, 2022

BPO 46558
Nosy @tim-one, @cfbolz, @sweeneyde
Files
  • todecstr.py
  • todecstr.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-01-28.03:12:39.839>
    created_at = <Date 2022-01-28.02:31:44.271>
    labels = ['interpreter-core', 'performance']
    title = 'Quadratic time internal base conversions'
    updated_at = <Date 2022-01-31.06:11:48.110>
    user = 'https://github.com/tim-one'

    bugs.python.org fields:

    activity = <Date 2022-01-31.06:11:48.110>
    actor = 'tim.peters'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-01-28.03:12:39.839>
    closer = 'tim.peters'
    components = ['Interpreter Core']
    creation = <Date 2022-01-28.02:31:44.271>
    creator = 'tim.peters'
    dependencies = []
    files = ['50593', '50595']
    hgrepos = []
    issue_num = 46558
    keywords = []
    message_count = 9.0
    messages = ['411962', '411966', '411969', '411971', '412120', '412122', '412172', '412191', '412192']
    nosy_count = 3.0
    nosy_names = ['tim.peters', 'Carl.Friedrich.Bolz', 'Dennis Sweeney']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue46558'
    versions = []

    @tim-one
    Copy link
    Member Author

    tim-one commented Jan 28, 2022

    Our internal base conversion algorithms between power-of-2 and non-power-of-2 bases are quadratic time, and that's been annoying forever ;-) This applies to int<->str and int<->decimal.Decimal conversions. Sometimes the conversion is implicit, like when comparing an int to a Decimal.

    For example:

    >> a = 1 << 1000000000 # yup! a billion and one bits
    >> s = str(a)

    I gave up after waiting for over 8 hours, and the computation apparently can't be interrupted.

    In contrast, using the function in the attached todecstr.py gets the result in under a minute:

    >>> a = 1 << 1000000000
    >>> s = todecstr(a)
    >>> len(s)
    301029996

    That builds an equal decimal.Decimal in a "clever" recursive way, and then just applies str to _that_.

    That's actually a best case for the function, which gets major benefit from the mountains of trailing 0 bits. A worst case is all 1-bits, but that still finishes in under 5 minutes:

    >>> a = 1 << 1000000000
    >>> s2 = todecstr(a - 1)
    >>> len(s2)
    301029996
    >>> s[-10:], s2[-10:]
    ('1787109376', '1787109375')

    A similar kind of function could certainly be written to convert from Decimal to int much faster, but it would probably be less effective. These things avoid explicit division entirely, but fat multiplies are key, and Decimal implements a fancier * algorithm than Karatsuba.

    Not for the faint of heart ;-)

    @tim-one tim-one added interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jan 28, 2022
    @sweeneyde
    Copy link
    Member

    sweeneyde commented Jan 28, 2022

    Is this similar to https://bugs.python.org/issue3451 ?

    @tim-one
    Copy link
    Member Author

    tim-one commented Jan 28, 2022

    Dennis, partly, although that was more aimed at speeding division, while the approach here doesn't use division at all.

    However, thinking about it, the implementation I attached doesn't actually for many cases (it doesn't build as much of the power tree in advance as may be needed). Which I missed because all the test cases I tried had mountains of trailing 0 or 1 bits, not mixtures.

    So I'm closing this anyway, at least until I can dream up an approach that always works. Thanks!

    @tim-one
    Copy link
    Member Author

    tim-one commented Jan 28, 2022

    Changed the code so that inner() only references one of the O(log log n) powers of 2 we actually precomputed (it could get lost before if lo was non-zero but within n had at least one leading zero bit - now we pass the conceptual width instead of computing it on the fly).

    @tim-one
    Copy link
    Member Author

    tim-one commented Jan 30, 2022

    The test case here is a = (1 << 100000000) - 1, a solid string of 100 million 1 bits. The goal is to convert to a decimal string.

    Methods:

    native: str(a)

    numeral: the Python numeral() function from bpo-3451's div.py after adapting to use the Python divmod_fast() from the same report's fast_div.py.

    todecstr: from the Python file attached to this report.

    gmp: str() applied to gmpy2.mpz(a).

    Timings:

    native: don't know; gave up after waiting over 2 1/2 hours.
    numeral: about 5 1/2 minutes.
    todecstr: under 30 seconds. (*)
    gmp: under 6 seconds.

    So there's room for improvement ;-)

    But here's the thing: I've lost count of how many times someone has whipped up a pure-Python implementation of a bigint algorithm that leaves CPython in the dust. And they're generally pretty easy in Python.

    But then they die there, because converting to C is soul-crushing, losing the beauty and elegance and compactness to mountains of low-level details of memory-management, refcounting, and checking for errors after every tiny operation.

    So a new question in this endless dilemma: _why_ do we need to convert to C? Why not leave the extreme cases to far-easier to write and maintain Python code? When we're cutting runtime from hours down to minutes, we're focusing on entirely the wrong end to not settle for 2 minutes because it may be theoretically possible to cut that to 1 minute by resorting to C.

    (*) I hope this algorithm tickles you by defying expectations ;-) It essentially stands numeral() on its head by splitting the input by a power of 2 instead of by a power of 10. As a result no divisions are used. But instead of shifting decimal digits into place, it has to multiply the high-end pieces by powers of 2. That seems insane on the face of it, but hard to argue with the clock ;-) The "tricks" here are that the O(log log n) powers of 2 needed can be computed efficiently in advance of any splitting, and that all the heavy arithmetic is done in the decimal module, which implements fancier-than-Karatsuba multiplication and whose values can be converted to decimal strings very quickly.

    @tim-one
    Copy link
    Member Author

    tim-one commented Jan 30, 2022

    Addendum: the "native" time (for built in str(a)) in the msg above turned out to be over 3 hours and 50 minutes.

    @cfbolz
    Copy link
    Mannequin

    cfbolz mannequin commented Jan 30, 2022

    Somebody pointed me to V8's implementation of str(bigint) today:

    https://github.com/v8/v8/blob/main/src/bigint/tostring.cc

    They say that they can compute str(factorial(1_000_000)) (which is 5.5 million decimal digits) in 1.5s:

    https://twitter.com/JakobKummerow/status/1487872478076620800

    As far as I understand the code (I suck at C++) they recursively split the bigint into halves using % 10^n at each recursion step, but pre-compute and cache the divisors' inverses.

    @tim-one
    Copy link
    Member Author

    tim-one commented Jan 31, 2022

    The factorial of a million is much smaller than the case I was looking at. Here are rough timings on my box, for computing the decimal string from the bigint (and, yes, they all return the same string):

    native: 475 seconds (about 8 minutes)
    numeral: 22.3 seconds
    todecstr: 4.10 seconds
    gmp: 0.74 seconds

    "They recursively split the bigint into halves using % 10^n at each recursion step". That's the standard trick for "output" conversions. Beyond that, there are different ways to try to use "fat" multiplications instead of division. The recursive splitting all on its own can help, but dramatic speedups need dramatically faster multiplication.

    todecstr treats it as an "input" conversion instead, using the decimal module to work mostly in base 10. That, by itself, reduces the role of division (to none at all in the Python code), and decimal has a more advanced multiplication algorithm than CPython's bigints have.

    @tim-one
    Copy link
    Member Author

    tim-one commented Jan 31, 2022

    todecstr treats it as an "input" conversion instead, ...

    Worth pointing this out since it doesn't seem widely known: "input" base conversions are _generally_ faster than "output" ones. Working in the destination base (or a power of it) is generally simpler.

    In the math.factorial(1000000) example, it takes CPython more than 3x longer for str() to convert it to base 10 than for int() to reconstruct the bigint from that string. Not an O() thing (they're both quadratic time in CPython today).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @adamant-pwn
    Copy link

    adamant-pwn commented Sep 5, 2022

    Hi, why is this issue closed? What needs to be done to make it through?

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 5, 2022

    It's closed because nobody appears to be both willing and able to pursue it.

    But it's on the edge regardless. In general, any number of huge-int algorithms could be greatly speeded, but that's a massive undertaking. It's why GMP exists, to push such things as far as possible, regardless of implementation complexity, effort, or bulk. Very capable GMP bindings for Python are already available (gmpy2).

    As the timings here suggest, GMP will generally be faster than anything we may do anyway, because GMP never reaches a point where its authors say "good enough already".

    That said, I expect the single most valuable bigint speedup CPython could implement would be to bigint division, along the lines of gh-47701. That could indirectly give major speed boosts to bigint->str and bigint modular pow() too.

    @bjorn-martinsson
    Copy link

    bjorn-martinsson commented Sep 6, 2022

    Why do you need fast division? Why not just implement str to int convertion using this

    12345678 = 1234 * 10^4 + 5678 = (12 * 10^2 + 34) * 10^4 + (56 * 10^2 + 78) = ...
    

    style of d&c?

    This would be simple to implement and would only require multiplication.

    If $M(n)$ is the time it takes to multiply two $n$ bit numbers, then the cost of string to int convertion using this d&q is
    $T(n) = 2 T(n/2) + M(n)$.

    If multiplication is done using Karatsuba ( $M(n) = O(n^{1.58})$ ) then $T(n) = 3 M(n)$.
    If multiplication is done in $M(n) = O(n \log n)$ time, then $T(n) = O(n \log^2 n)$.

    Since Python currently uses Karatsuba for its big ints multiplication, this simple str to int convertion algorithm would run in $O(n^{1.58})$.

    @adamant-pwn
    Copy link

    adamant-pwn commented Sep 6, 2022

    Yep, even simplest divide and conquer with native multiplication would already provide significant speed-up.

    @bjorn-martinsson
    Copy link

    bjorn-martinsson commented Sep 6, 2022

    Also if you want something slightly smarter, since 10 is $2 * 5$ you can use bitshifts.

    12345678 = 1234 * 10^4 + 5678 = ((1234 * 5^4) << 4) + 5678
    

    @bjorn-martinsson
    Copy link

    bjorn-martinsson commented Sep 6, 2022

    I made an implementation of the basic d&q algorithm to test it out

    pow5 = [5]
    while len(pow5) <= 22:
        pow5.append(pow5[-1] * pow5[-1])
    
    def str_to_int(s):
        def _str_to_int(l, r):
            if r - l <= 3000:
                return int(s[l:r])
            lg_split = (r - l - 1).bit_length() - 1
            split = 1 << lg_split
            return ((_str_to_int(l, r - split) * pow5[lg_split]) << split) + _str_to_int(r - split, r)
        return _str_to_int(0, len(s))

    Running this locally, str_to_int is as fast as int at about 3000 digits.
    For 40000 digits str_to_int takes 0.00722 s and int takes 0.0125 s.
    For 400000 digits str_to_int takes 0.272 s and int takes 1.27 s.
    For 4000000 digits str_to_int takes 10.3 s and int takes 127 s.

    Clearly str_to_int is subquadratic, with same time complexity as Karatsuba. While int is quadratic. Also worth noting is that with a faster big int mult, str_to_int would definitely run a lot faster.

    @bjorn-martinsson
    Copy link

    bjorn-martinsson commented Sep 6, 2022

    I also tried out str_to_int with GMP (gmpy2.mpz) integers instead of Python's big integers to see what faster big int multiplication could lead to.

    For 40000 digits str_to_int with GMP takes 0.000432 s.
    For 400000 digits str_to_int with GMP takes 0.00817 s.
    For 4000000 digits str_to_int with GMP takes 0.132 s.

    So what Python actually needs is faster big int multiplication. With that you'd get a really fast string to int converter for free.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 6, 2022

    All the earlier timing examples here were of int->str, not of str->int. For the latter case, which you're looking at, splitting a decimal string by a power of 10 is indeed trivial (conceptually - but by the time you finish coding it all in C, with all the memory-management, refcounting, and error-checking cruft debugged, not so much 😉).

    As already noted, the report's todecstr() does the harder int->str direction without division too, relying instead on faster-than-Karatsuba multiplication, by doing most of the arithmetic in a power-of-10 base to begin with (via the decimal module, which does implement one faster-than-Karatsuba multiplication method). Python's bigints are represented internally in a power-of-2 base, and there is no linear-time way to split one by a power of 10. So todecstr() doesn't even try to; instead it splits by various powers of 2.

    But faster native bigint division could nevertheless buy a major speedup for int->str, as already demonstrated near the start here by showing timings for the numeral() function. Much faster than str(), although still much slower than todecstr().

    And it could also buy major speedups for modular pow(), which is increasingly visible as people playing with crypto schemes move to fatter keys.

    And it would, tautologically, speed bigint division too, which is a bottleneck all on its own in some apps.

    @gpshead
    Copy link
    Member

    gpshead commented Sep 6, 2022

    And it would, tautologically, speed bigint division too, which is a bottleneck all on its own in some apps.

    It would be useful to have practical examples of actual Python applications that would benefit from any of:
    a) high performance huge int MUL or DIV
    b) high performance huge int to decimal or other non-binary-power base conversion.
    c) high performance huge int from decimal string conversion.

    Including if they've adopted a third party library for this math and would no longer have any need for that if improved.

    Where "high performance" means more than constant-time faster than what CPython int offers today. I'll leave the definition of "huge" up to you, but if the need is less than 5 of CPython's current implementation detail 30-bit value digits I'm skeptical. :)

    Bigint's really feel like a neat computer science toy most of the time. We've got them, but what actually uses integers larger than 64 or 128 bits?

    Similarly, once numbers get huge, what is the point of converting them to and from decimal? Decimal is for humans and humans don't readily comprehend precise numbers that large let alone type them correctly.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 6, 2022

    "Practical applications" isn't really the name of this game 😉.

    Most complaints come from more-or-less newbies, who, e.g., play around in an interactive shell, and are surprised to find that sometimes the shell seems to freeze (but is really just waiting for a quadratic-time, or worse, bigint operation to finish).

    People who know what they're doing find many uses for big bigints, but more of a theoretical, research, educational, or recreational bent. Like, algorithm design, combinatorics, number theory, or contributing to the wonderful On-Line Encyclopedia of Integer Sequences. They also come as incidental side effects of using power tools, like in a symbolic math package computing symbolic power series or (for related reasons) symbolic high derivatives (integer coefficients and/or denominators often grow at an exponential rate across terms). Or they can be almost the whole banana, such as scaling an ill-conditioned float matrix to use exact bigint arithmetic instead to compute an exact matrix inverse or determinant (where sticking to floats the result could be pure noise due to compounding rounding errors).

    Not all that long ago, a number of people worked on adding "gonzo" optimizations to math.comb(). I played along, but was vocally disinclined to do ever more. Of course GMP still runs circles around CPython for many comb() bigint arguments. Such efforts would, IMO, have been better spent on speeding bigint // (which, as a matter of course, would also give another speed boost to comb() - and to any number of other bigint algorithms in users' own Python code - // is a fundamental building block, not "an application").

    @bjorn-martinsson
    Copy link

    bjorn-martinsson commented Sep 6, 2022

    One good example of using big integers are math people using Python for some quick calculations. When you use functions like factorial, it is easy to get pretty big integers. I also think that math people generally prefer decimal numbers over hex.

    Another example would be working with fractions. Just adding fractional numbers makes their denominators and numerators grow really quickly. I also don't think fractions support hex. Both reading and printing requires decimal as far as I can tell. For example you can do fractions.Fraction('1/2') but not fractions.Fraction('0x1/0x2').

    The examples I've given above is the type of scripts that wont appear in things like Google's codebase, even if they are relatively common use cases.

    Another important thing to note is that programmers trust the built in functions in Python to just work. That is the fundamental reason why int <=> str became an issue in the first place. This is also a great argument for why it is important to fix the time complexity issues with big ints.

    @nascheme
    Copy link
    Member

    nascheme commented Sep 8, 2022

    I think Tim's suggestion of switching to a Python implementation of a more efficient algorithm for large inputs would be a good approach. It is reasonable to not want to maintain a complicated high-performance algorithm written in C. This would be an interesting project for a student or someone wanting a challenge. Calling into Python from C is not too hard to do. Use a C-API to import a Python module and call a function inside of it.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 24, 2022

    Reminds me of this loop in Mark's divmod.

    That's pretty shallow 😉. Each loop in a call to _divmod_pos(a, b) goes around about a.bit_length() / b.bit_length() times. It's viewing a as a sequence of digits in base 2**b.bit_length(). If, e.g., a <= b**2, never more than twice. Pass a large power of b for a, and watch them go around as often as you care to force it.

    @pochmann
    Copy link
    Contributor

    pochmann commented Sep 24, 2022

    You're not talking about the loop that I linked to / highlighted, are you? The while r < 0: one.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 24, 2022

    You're not talking about the loop that I linked to / highlighted, are you?

    I thought I was, but on my screen I see now that the highlighting is barely visible and my eye was drawn to the middle of the screen. The loop it appears you're really talking about:

    The while r > 0: one.

    shows up as the very topmost 3 lines of the screen.

    Yup, that's not obvious (to me) at all. In Burnikel and Ziegler's paper, this is no loop, but instead what you'd get if you unrolled Mark's loop twice. I assume (but don't know) Mark just wanted to cut the code bulk, and didn't care about making a third
    useless but dirt-cheap r < 0 test in the unlikely case q had to be decremented twice.

    But you'll have to read their paper for proof that twice is enough. I never fought to understanding of it, and found their exposition difficult to follow.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 24, 2022

    Settling for computing reciprocal approximations does buy real speedup, saving about a third of the original code's cycles. This is somewhat aggressive, rounding the approximation of 1/T to just one more decimal digit of precision than T has. Nevertheless, the correction loops following don't find much in need of correcting. When they do, corrections in both directions can be needed, but so far I haven't seen an adjustment (to the quotient) of more than one needed.

    Alas, this still takes over twice the time of the PR code at 1 million bits, although the gap is shrinking.

    # at 10 million bits
    29.88990306854248 str_to_int_using_decimal
    20.070303916931152 dec2 # the code below, now the clear winner
    24.412405014038086 str_to_int # in Neil's PR
    

    And the code:

    def dec2inner(s, ctx, bits):
        from decimal import Decimal, ROUND_DOWN, MAX_PREC
        asint = Decimal(1)
        d = Decimal(s)
        div = Decimal(2) ** bits
        rdiv = 1 / div
        divs = []
        while div <= d:
            ctx.prec = div.adjusted() + 2
            divs.append((div, +rdiv)) # `+rdiv` rounda back rdiv
            ctx.prec = MAX_PREC
            div *= div
            rdiv *= rdiv
        digits = [d]
        for div, rdiv in reversed(divs):
            newdigits = []
            for digit in digits:
                q = (digit * rdiv).quantize(asint, rounding=ROUND_DOWN)
                r = digit - q * div
                if r < 0:
                    while True:
                        q -= 1
                        r += div
                        if r >= 0:
                            break
                else:
                    while r >= div:
                        q += 1
                        r -= div
                newdigits.append(q)
                newdigits.append(r)
            digits = newdigits
            if not digits[0]:
                del digits[0]
        b8 = bits // 8
        b = b''.join(int(digit).to_bytes(b8, 'big')
                     for digit in digits)
        return int.from_bytes(b, 'big')
    
    def dec2(s, nbits=1024):
        import decimal
        with decimal.localcontext() as ctx:
            ctx.prec = decimal.MAX_PREC
            ctx.Emax = decimal.MAX_EMAX
            ctx.Emin = decimal.MIN_EMIN
            return dec2inner(s, ctx, nbits)

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 25, 2022

    Surprising observation: in that code, for the 10-million digit case, 5% of the total time is spent in the final time rdiv *= rdiv is executed. Which is a result that's never used. That's easy to avoid, albeit obscure.

    The new version here does that, and sets the context to ROUND_DOWN so that the first q guess is never too large. That simplifies the correction code, but doesn't speed it. FYI, a correction appears to be needed in about a third of a percent of cases.

    def dec2inner(s, ctx, bits):
        from decimal import Decimal, MAX_PREC
        d = Decimal(s)
        div = Decimal(2) ** bits
        rdiv = 1 / div
        divs = []
        needrdiv = False
        while div <= d:
            if needrdiv:
                rdiv *= rdiv
            ctx.prec = div.adjusted() + 2
            # `+rdiv` rounds back rdiv; context is ROUND_DOWN
            divs.append((div, +rdiv))
            ctx.prec = MAX_PREC
            div *= div
            needrdiv = True
        digits = [d]
        for div, rdiv in reversed(divs):
            newdigits = []
            for digit in digits:
                q = (digit * rdiv).to_integral_value() # ctx truncates
                r = digit - q * div
                assert r >= 0
                while r >= div:
                    q += 1
                    r -= div
                newdigits.append(q)
                newdigits.append(r)
            digits = newdigits
            if not digits[0]:
                del digits[0]
        b8 = bits // 8
        b = b''.join(int(digit).to_bytes(b8, 'big')
                     for digit in digits)
        return int.from_bytes(b, 'big')
    
    def dec2(s, nbits=1024):
        import decimal
        with decimal.localcontext() as ctx:
            ctx.prec = decimal.MAX_PREC
            ctx.Emax = decimal.MAX_EMAX
            ctx.Emin = decimal.MIN_EMIN
            ctx.rounding = decimal.ROUND_DOWN
            return dec2inner(s, ctx, nbits)

    mdickinson pushed a commit that referenced this issue Sep 25, 2022
    This is a preliminary PR to refactor `PyLong_FromString` which is currently quite messy and has spaghetti like code that mixes up different concerns as well as duplicating logic.
    
    In particular:
    
    - `PyLong_FromString` now only handles sign, base and prefix detection and calls a new function `long_from_string_base` to parse the main body of the string.
    - The `long_from_string_base` function handles all string validation and then calls `long_from_binary_base` or a new function `long_from_non_binary_base` to construct the actual `PyLong`.
    - The existing `long_from_binary_base` function is simplified by factoring duplicated logic to `long_from_string_base`.
    - The new function `long_from_non_binary_base` factors out much of the code from `PyLong_FromString` including in particular the quadratic algorithm reffered to in gh-95778 so that this can be seen separately from unrelated concerns such as string validation.
    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 27, 2022

    EDIT: changed int(Decimal) to int(str(Decimal)) because the latter is faster(!); removed the tail-recursion "optimization" because it added some obscurity (and an indentation level) for no measurable gain.

    I've run out of ideas for speeding the decimal version of string->int, so will just leave the current state here for posterity. It's about 15% faster than the last version at 10 million digits, quite better still at some other digit lengths, and takes less than twice the time of the code in Neil's PR at 1 million digits now.

    Major changes:

    • Like the other converters in the PR, rewrote to use a recursive cached "compute power" routine. This enables the work to be balanced as evenly as possible regardless of digit-string length, and never requires computing a trial divisor larger than about the square root of the input.

    • digit // div is approximated by truncating digit * rdiv to an integer, where rdiv is 1/div truncated to a number of significant digits two larger than div has. That's basically a 2N x N -> 3N bit multiplication (for some N), but we only want no more than the leading N bits of the result. The lowermost N bits of digit have very little effect on the retained part of the result, so now, before the multiplication, we also truncate digit to the same number of significant digits as rdiv has. That makes it a skinnier (faster! although decimal still computes 2N bits internally) N x N -> N bit multiplication. In return, the correction loop gets involved more often in some cases. Doesn't really matter, as it's relatively cheap, is still invoked rarely, and I haven't seen the loop go around more than once.

    CAUTION: this needs rigorous error analysis. "Haven't seen" doesn't mean "can't happen" ☹️.

    At 100 million digits, it's well over twice as fast as the original now:

    # at 100 million digits
    928.7468802928925  str_to_int in Neil's PR
    458.11627554893494 str_to_int_using_decimal
    289.3329448699951  dec2 # last version posted
    194.87010216712952 dec3 # the code below
    

    Click for decimal str->int code

    def dec3(s, nbytes=128):
        from collections import defaultdict
        import decimal
        from decimal import Decimal, MAX_PREC
    
        cache = {}
        rcache = {}
        def cpow(n):
            if (result := cache.get(n)) is None:
                if n - 1 in cache:
                    div = cache[n-1][0] * D256
                    rdiv = rcache[n-1] * rD256
                elif n <= nbytes:
                    div = D256 ** n
                    rdiv = 1 / div
                else:
                    div = cpow(n1 := n >> 1)[0] * cpow(n2 := n - n1)[0]
                    rdiv = rcache[n1] * rcache[n2]
                rcache[n] = rdiv
                ctx.prec = div.adjusted() + 3
                cache[n] = result = div, (+ rdiv)
                ctx.prec = MAX_PREC
            return result
    
        # `spread` maps a correction loop count to the number of times
        # that count was needed; all code involving it should be removed
        # when this is all debugged
        spread = defaultdict(int)
        result = bytearray()
        add = result.extend
        def inner(x, n):
            # assert 0 <= x < D256 ** n
            if n <= nbytes:
                # XXX Stefan Pochmann discovered that, for 1024-bit
                # ints, `int(Decimal)` took 2.5x longer than
                # `int(str(Decimal))`. So simplify this code to the
                # former if/when that gets repaired.
                add(int(str(x)).to_bytes(n, 'big'))
                return
            # If `n` is odd it's vital that we split on n//2 + 1 for the
            # approximations to be justified. If we split on the smaller
            # n//2, the correction loop can go around a lot more times,
            # unless we also keep more digits in `rdiv`, `x`, and the
            # multiplication, than `div.adjusted() + 3`. About 3 more
            # digits than that appear to suffice, but the more digits we
            # use the more expensive arithmetic.
            n2 = (n + 1) >> 1
            div, rdiv = cpow(n2)
            ctx.prec = div.adjusted() + 3
            q = (+x * rdiv).to_integral_value() # ctx truncates
            ctx.prec = MAX_PREC
            x -= q * div
            assert x >= 0
            count = 0
            while x >= div:
                count += 1
                q += 1
                x -= div
            spread[count] += 1
            inner(q, n - n2)
            inner(x, n2)
    
        with decimal.localcontext() as ctx:
            ctx.prec = decimal.MAX_PREC
            ctx.Emax = decimal.MAX_EMAX
            ctx.Emin = decimal.MIN_EMIN
            ctx.rounding = decimal.ROUND_DOWN
    
            D256 = Decimal(256)
            rD256 = 1 / D256
    
            x = Decimal(s)
            ctx.prec = 20
            n = int(x.ln() / D256.ln() + 2)
            ctx.prec = MAX_PREC
            inner(x, n)
    
        print(sorted(spread.items()))
        del cache, rcache
        return int.from_bytes(result, 'big')```
    </details>

    @oscarbenjamin
    Copy link
    Contributor

    oscarbenjamin commented Oct 2, 2022

    I've just taken the time to read through all of the above. I like the idea of implementing some of these operations in Python (especially if it means introducing FFT multiplication).

    That being said though the main complaint of the CVE was str -> int and I have a PR (gh-97550) which implements that in C using the obvious algorithm that reduces the problem to integer multiplication. I haven't timed all the various alternatives offered above but I did compare with gh-96673 and it looks like runtime is basically the same but I think that the C implementation has better memory overheads.

    @oscarbenjamin
    Copy link
    Contributor

    oscarbenjamin commented Oct 2, 2022

    Here comes the updated version. This is a proper implementation of Schönhage-Strassen following these notes.

    Very nice. A note to anyone wanting to run timings with this is that the parameter n (aka K = 2^k in other texts) should be chosen based on the bit size of the integers being multiplied rather than always set to 256. Larger integers should usually use larger values (although look at the "Fine-Grained Tuning" slide in the linked notes). GMP tunes thresholds for this empirically on a per architecture basis but broadly n should scale like sqrt(N) if N is the bit length of the inputs. In other words you approximately want to split a 1000000 bit input into 1000 digits each of which has 1000 bits. It is this square rooting of the problem size that leads to the log(log(N)) factor in the complexity of SSA.

    For example if multiplying a = 12345**500000; b = 13245**500000 then these are the timings:

    $ python ssa.py 
    Multiplying 6795820 numbers with n= 256
    SSA took 2.7558290379993196
    Multiplying 6795820 numbers with n= 512
    SSA took 0.7624775220001538
    Multiplying 6795820 numbers with n= 1024
    SSA took 0.6304274069998428
    Multiplying 6795820 numbers with n= 2048
    SSA took 0.5857483529998717
    Multiplying 6795820 numbers with n= 4096
    SSA took 0.709619131999716
    

    Note here that 2048**2 == 4194304 which is almost the closest power of 2 to the input size.

    You can see an example of the GMP thresholds here:
    https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/skylake/gmp-mparam.h#l83

    @pochmann
    Copy link
    Contributor

    pochmann commented Oct 2, 2022

    @tim-one You could maybe save a little more time by converting small Decimals to int not directly but via str. For 1024-bit values, that seems about 2.5 times as fast.

    Test code, results, a note

    Code:

    from timeit import timeit
    from random import getrandbits
    import decimal
    nbits = 1024
        
    with decimal.localcontext() as ctx:
        ctx.prec = decimal.MAX_PREC
        ctx.Emax = decimal.MAX_EMAX
        ctx.Emin = decimal.MIN_EMIN
            
        for _ in range(3):
            digit = decimal.Decimal(getrandbits(nbits))
            direct = timeit(lambda: int(digit), number=10000)
            viastr = timeit(lambda: int(str(digit)), number=10000)
            print(f'{direct / viastr = }')
    

    Results:

    direct / viastr = 2.5690248907712365
    direct / viastr = 2.5946376697481535
    direct / viastr = 2.6202966323441648
    

    I noticed this shortly after posting my original version, but didn't bother with it because that version was only relevant for very large numbers, where this speedup of the linear time part of the complexity was insignificant in the bigger picture. Maybe now with your improvements it would be worth doing.

    @oscarbenjamin
    Copy link
    Contributor

    oscarbenjamin commented Oct 2, 2022

    I don't understand why the emphasis here is on using the decimal module rather than making use of SSA (i.e. FFT multiplication) as shown above. Every other bigint library solves all of these problems by reducing every operation to multiplication and then focusing all optimisations on multiplication which means using SSA for large integers. If SSA was implemented then the other algorithms would all be trivial and integer multiplication would be faster (and then I don't think anyone would contemplate using decimal for any of this).

    @tim-one
    Copy link
    Member Author

    tim-one commented Oct 2, 2022

    I don't understand why the emphasis here is on using the decimal module

    Because decimal already implements NTT for multiplication of large values, and also division building on that with the same O() behavior. The code for that is already written in C, and is in a mature, widely used, and extensively tested external library (libmpdec) we don't maintain. It's here, right now, and has been for years. A competitive, production-quality SSA for CPython's bigints is still just a wish.

    I won't repeat yet again that ambition has forever been the fatal enemy of making real progress visible to CPython users here.

    @tim-one
    Copy link
    Member Author

    tim-one commented Oct 3, 2022

    @pochmann discovers that int(Decimal) takes about 2.5x longer than int(str(Decimal)) for 1024-bit ints

    Thanks - but sheesh 😉. The original entry in this report noted that int <-> decimal.Decimal conversions are also quadratic time now, but, alas, nothing we've done so far appears to have addressed that (although Neil's PR contains a fast int -> decimal.Decimal function, CPython's decimal implementation doesn't use it).

    Making the change does help a little, but, as you noted, tends to get lost in the noise on "really big" inputs. It cuts some tens of thousands of digits off the smallest value at which the str -> int here ties the one in Neil's PR - but that's still over 3 million digits on my desktop box.

    @tim-one
    Copy link
    Member Author

    tim-one commented Oct 4, 2022

    Notes on SSA and base conversion:

    • Bjorn's SSA code is so lovely it hurts 😄.

    • int->str is naturally done using decimal, with the division-free algorithm already in Neil's PR.

    • str->int is strained in decimal, because it requires division (or, as in dec3() here, a pair of multiplies getting the same effect via reciprocal approximations followed by correction). Using SSA with CPython's bigints should give Neil's PR's division-free str_to_int() better speed (apart from computing powers of 5, half the bigint multiplies as dec3() needs).

    • But SSA appears to be hurting in at least these two respects: (1) as Oscar showed, the precise power-of-2 splitting factor used can have dramatic effects on speed, the hard-coded 256 is rarely "best", and it's unclear how to make a good dynamic choice; and, (2) SSA appears to need to be smarter about when to "turn itself off".

    For an example of the last, I got much better str_to_int-adapted-to-use-SSA behavior after adding this to the start of SSA():

        if x1.bit_length() <= 10000 or x2.bit_length() <= 10000:
            p = x1 * x2
            return shift(p, M, 0) if M is not None else p

    By pursuing small things like that by hand for each different digit-string-length tried, it was easy enough to contrive a str_to_int_using_ssa (Neil's PR's str_to_int() with the two bigint multiplies replaced by SSA() calls) that beats dec3(). However, on digit strings of length one million (the target Stefan initially aimed at), the current str_to_int() is still a little faster. At 2 million digits, SSA beats str_to_int() (while dec3() doesn't win until reaching about 3.4 million digits).

    So:

    • SSA is promising, but needs fiddly tuning to be production quality.

    • For str->int the inputs it requires to beat the current PR's approach are so large that it would be astonishing if "DoS vulnerability" types cared one whit.

    @oscarbenjamin
    Copy link
    Contributor

    oscarbenjamin commented Oct 4, 2022

    • But SSA appears to be hurting in at least these two respects: (1) as Oscar showed, the precise power-of-2 splitting factor used can have dramatic effects on speed, the hard-coded 256 is rarely "best", and it's unclear how to make a good dynamic choice

    I think this problem can be solved in the same way as for GMP although perhaps done more simply. Just run a bunch of timings and choose thresholds like:

    if N < 500:
        n = 64
    elif N < 1000:
        n =128
    # etc.

    That's how GMP used to work. The way GMP does it now is more complicated but that change in GMP is only credited with a smallish factor improvement. The paper describing that and other improvements is here:
    https://web.archive.org/web/20110720104333/http://www.loria.fr/~gaudry/publis/issac07.pdf

    The appropriate comparison when considering if it is worth using something like the SSA implementation above is not really with GMP but rather with the existing CPython implementation. In other words the bar to meet is that the code should be correct and then its speed should be not worse in any case and should be significantly better in at least some cases (to justify making any change at all).

    • For str->int the inputs it requires to beat the current PR's approach are so large that it would be astonishing if "DoS vulnerability" types cared one whit.

    Fair enough. Well if Neal's PR currently gives the fastest str->int then let's merge it. That would also make it easier to move other operations to pure Python e.g. for SSA etc later.

    nascheme added a commit that referenced this issue Oct 26, 2022
    Add Python implementations of certain longobject.c functions. These use
    asymptotically faster algorithms that can be used for operations on
    integers with many digits. In those cases, the performance overhead of
    the Python implementation is not significant since the asymptotic
    behavior is what dominates runtime. Functions provided by this module
    should be considered private and not part of any public API.
    
    Co-author: Tim Peters <tim.peters@gmail.com>
    Co-author: Mark Dickinson <dickinsm@gmail.com>
    Co-author: Bjorn Martinsson
    @vstinner
    Copy link
    Member

    vstinner commented Oct 31, 2022

    I like Lib/_pylong.py feature: it was a long awaited feature: cool.

    I'm just curious about _pylong._DEBUG private attribute: is it really useful to ship it in a released Python version? Or is it only useful to people who designed and hacked Lib/_pylong.py? Can it be removed now?

    @nascheme
    Copy link
    Member

    nascheme commented Oct 31, 2022

    Can [the _DEBUG flag] be removed now?

    Yeah, I think it could be removed now since it is not very useful.

    vstinner added a commit to vstinner/cpython that referenced this issue Nov 3, 2022
    To debug the _pylong module, it's trivial to add this code again
    locally. This is not need to keep it in Python releases.
    @vstinner
    Copy link
    Member

    vstinner commented Nov 3, 2022

    Yeah, I think it could be removed now since it is not very useful.

    Ok, I created PR #99063 to remove _pylong._DEBUG flag.

    vstinner added a commit that referenced this issue Nov 3, 2022
    To debug the _pylong module, it's trivial to add this code again
    locally. There is not need to keep it in Python releases.
    gpshead added a commit to gpshead/cpython that referenced this issue Nov 3, 2022
    * Properly decref on _pylong import error.
    * Improve the error message on _pylong TypeError.
    * Tie the return value comments together.
    
    These are minor followups to issues not caught among the reviewers on
    python#96673.
    gpshead added a commit that referenced this issue Nov 3, 2022
    * Properly decref on _pylong import error.
    * Improve the error message on _pylong TypeError.
    * Fix the assertion error in pydebug builds to be a TypeError.
    * Tie the return value comments together.
    
    These are minor followups to issues not caught among the reviewers on
    #96673.
    vstinner added a commit to vstinner/cpython that referenced this issue Nov 4, 2022
    Fix validated by:
    
        $ ./python -m test -R 3:3 test_int
        Tests result: SUCCESS
    vstinner added a commit that referenced this issue Nov 4, 2022
    Fix validated by:
    
        $ ./python -m test -R 3:3 test_int
        Tests result: SUCCESS
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.12 interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests