Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C API: Consider adding public PyLong_AsByteArray() and PyLong_FromByteArray() functions #111140

Closed
vstinner opened this issue Oct 20, 2023 · 15 comments
Labels
topic-C-API type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

vstinner commented Oct 20, 2023

Feature or enhancement

The private _PyLong_AsByteArray() and _PyLong_FromByteArray() functions were removed in Python 3.13: see PR #108429.

@scoder asked what is the intended replacement for _PyLong_FromByteArray().

The replacement for _PyLong_FromByteArray() is PyObject_CallMethod((PyObject*)&PyList_Type, "from_bytes", "s#s", str, len, "big") but I'm not sure what is the easy way to set the signed parameter to True (default: signed=False).

The replacement for _PyLong_AsByteArray() is PyObject_CallMethod(my_int, "to_bytes", "ns", length, "big"). Same, I'm not sure how to easy set the signed parameter to True (default: signed=False).

I propose to add public PyLong_AsByteArray() and PyLong_FromByteArray() functions to the C API.

Python 3.12 modified PyLongObject: it's no longer a simple array of digits, but it's now a more less straightforward _PyLongValue structure which requires using unstable functions to access small "compact" values:

  • PyUnstable_Long_IsCompact()
  • PyUnstable_Long_CompactValue()

So having a reliable and simple way to import/export a Python int object as bytes became even more important.


A code search for _PyLong_AsByteArray in PyPI top 5,000 projects found 12 projects using it:

  • Cython (0.29.36)
  • blspy (2.0.2)
  • catboost (1.2)
  • fastobo (0.12.2)
  • gevent (22.10.2)
  • guppy3 (3.1.3)
  • line_profiler (4.0.3)
  • msgspec (0.16.0)
  • orjson (3.9.1)
  • pickle5 (0.0.12)
  • pyodbc (4.0.39)
  • rlp (3.0.0)
@vstinner vstinner added type-feature A feature request or enhancement topic-C-API labels Oct 20, 2023
@vstinner vstinner changed the title C API: Consider adding a public PyLong_AsByteArray() and PyLong_FromByteArray() functions C API: Consider adding public PyLong_AsByteArray() and PyLong_FromByteArray() functions Oct 20, 2023
@scoder
Copy link
Contributor

scoder commented Oct 21, 2023

Thanks for creating the issue. I agree that the functions should be added. The current replacements seem awful for this kind of basic functionality. Going through an expensive Python call like this for converting between PyLong and large C integer types (int128_t) seems excessive.

Note that at least a couple of projects that you list use Cython implemented parts and thus probably just mention the function in there. I'm sure something like line_profiler would never end up calling it.

@serhiy-storchaka
Copy link
Member

It was already discussed several times. This API lacks some features which would be needed for general use. You need to know the size of the resulting bytes array. Calculating it is not trivial, especially for negative integers. Also, it would be core convenient to support "native" ending, not just "big"/"littel".

I have been thinking about implementing a similar API for mpz_import/mpz_export in GMP (https://gmplib.org/manual/Integer-Import-and-Export) or mp_unpack/mp_unpack in libtommath (https://github.com/libtom/libtommath). It is powerful and a kind of standard. It is general enough to support internal PyLong represenation and the marshal format used for long integers (15 bits packed in 16-bit words). But it is only for unsigned integers, you are supposed to store the sign separately. It should be extended to support negative integers in several formats, for convenience and for performance.

@vstinner
Copy link
Member Author

I'm not sure that passing the endian as a string is efficient if this function is part of hot code.

@serhiy-storchaka
Copy link
Member

Not as a string. Just 3-variant value native/little/big instead of boolean little/big.

@vstinner
Copy link
Member Author

Not as a string. Just 3-variant value native/little/big instead of boolean little/big.

Sorry, I was confused between C API (int for the endian) and the Python API (string for the endian).

@scoder
Copy link
Contributor

scoder commented Oct 21, 2023

It was already discussed several times. This API lacks some features which would be needed for general use.

Ok, then please put the existing function back in until there is a public replacement.

@scoder
Copy link
Contributor

scoder commented Oct 21, 2023

I created PR #111162

@vstinner
Copy link
Member Author

This API lacks some features which would be needed for general use.

Which features are missing?

Do you have links to previous discussions if it was discussed previously?

@vstinner
Copy link
Member Author

You need to know the size of the resulting bytes array.

wcstombs() can be called with NULL buffer to compute the buffer size. It avoids to have to provide a second API "calculate the buffet size".

I suppose that a common use case is also to convert a Python int object to C int type for which there is no C API. Like int128_t

@scoder
Copy link
Contributor

scoder commented Oct 23, 2023 via email

@encukou
Copy link
Member

encukou commented Oct 23, 2023

I'm partial to API like mpz_export that accepts a buffer and length, and can:

  • Successfully fill the buffer, and output the length
  • Not fill the buffer, but output the necessary length

This allows handling the common cases with a minimum of function calls:

  • If the size is already known, you only need one function call
  • If the caller can pre-allocate a buffer, and the value happens to fit, this is again one function call
  • If the size is unknown, and a pre-allocated buffer can't be used or turns out to be too small, it's two function calls

But, IMO we also need general API for exporting/importing Python objects to/from buffers (see e.g. #15 in capi-workgroup/problems), and it would be good to make this consistent.

I'd prefer adding the original functions back until we design a proper replacement.

@scoder
Copy link
Contributor

scoder commented Oct 24, 2023

mpz_export() looks like a good blue print. It's based on "word data", though, not bytes. I'm not sure if that design is something important to copy. There are certainly use cases for this (it resembles SIMD-like operations, for example). However, I'm not convinced that reordering bytes in native/little/big-endian words is an important feature for a PyLong C-API. Whoever needs this kind of special functionality can probably implement it in a separate pass on their side. A simple byte array export in (overall) native/little/big ordering seems sufficient to get the data in and out.

Note that it goes together with an mpz_sgn() function to query the sign since that is not part of the export. That's reasonable since the sign is unlikely to become part of the internal PyLong digits representation even in the future. Given that Py_SIZE() cannot be used for the sign detection any more, a stable function and a fast inline function would both be helpful for this.

So, basically, the proposal is to add

  • a function PyLong_AsByteArray() for the unsigned export, modelled after mpz_export
  • a function PyLong_FromByteArray() for the unsigned import, modelled after mpz_import
  • a function PyLong_Sign() to detect the sign as -1, 0, 1
  • an inline function PyUnstable_Long_Sign() to read the sign without checks and ABI compatibility guarantees
  • a function PyLong_BitLength() to count the number of bits used by the PyLong value

IMO we also need general API for exporting/importing Python objects to/from buffers, and it would be good to make this consistent.

It's not strictly related, though. I think a PyLong number is sufficiently different from an arbitrary Python object array to not require a highly similar interface. If it can be made similar, fine. I wouldn't hold up one for the other, though.

Regarding Serhiy's concerns about missing ABI-stability of enum flags and arguments: we've used C macro names for this for ages, and they turn into simple integers that can be passed as C int arguments. Just #define some names and people will use them.

encukou pushed a commit that referenced this issue Oct 25, 2023
* gh-106320: Re-add _PyLong_FromByteArray(), _PyLong_AsByteArray() and _PyLong_GCD() to the public header files since they are used by third-party packages and there is no efficient replacement.

See #111140
See #111139

* gh-111262: Re-add _PyDict_Pop() to have a C-API until a new public one is designed.
@vstinner
Copy link
Member Author

See also comments about removed _PyLong_New(): #108604 (comment)

@vstinner
Copy link
Member Author

_PyLong_FromByteArray(), _PyLong_AsByteArray() and _PyLong_GCD() functions were restored by commit a8a89fc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-C-API type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants