-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
gh-89635: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_DecodeRawUnicodeEscapeStateful() #28955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
gh-89635: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_DecodeRawUnicodeEscapeStateful() #28955
Conversation
…DecodeRawUnicodeEscapeStateful()
@@ -611,6 +611,15 @@ PyAPI_FUNC(PyObject*) PyUnicode_DecodeUnicodeEscape( | |||
const char *errors /* error handling */ | |||
); | |||
|
|||
#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030b0000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add something to the limited C AP, you must update Misc/stable_abi.txt and then run make regen-limited-abi
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with adding a stateful variant to the Python C API, but I don't think that it's worth it to add a C API to the limited C API for that. There are already other functions to access the feature, start from the str.encode() and bytes.decode() methods, which can be called in C with generic CallMethod functions.
I would prefer to keep the limited C API as small as possible.
cc @encukou
It makes sense. I do not have strong opinion about this. On one hand, these codecs are the only multibyte codecs implemented in C which do not provide public stateful variants of API. On other hand, these codecs are Python 2 legacy and not very useful as is. |
For consistency, please do add those APIs. The Unicode API was designed to be a rich API, making it easy to use from C extensions. The unicode escape codecs are not Python 2 legacy, they are still useful when it comes to encoding Unicode in an ASCII compatible format. |
Why do you want to add these 2 functions to the stable ABI? If someone cares about the C API, there are other existing C API functions to use this codec. |
These are new APIs, providing new useful functionality. The existing APIs cannot be used for stateful decoding. That's the story behind a rich C API - you expose all available functionality usable in C extensions, so that C extensions don't have to resort to slower high level APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks, Serhiy.
The only issue with the "unicode-escape" codec is that the decoder accepts non-ASCII bytes and decodes them with Latin1. It looks like Python 2 legacy. Using UTF-8 or treating them as error would be more useful in Python 3, but it is difficult to change now. It is used in the Python parser (it requires pre-processing), but an str-to-str codec would be more convenient in that case. However the "raw-unicode-escape" codec is more problematic. It is not usable without some pre- or post- processing at all (you need at least escape/unescape the backslash to make it non-ambiguous). It is only used in the text pickle protocol 0, although the C implementation uses modified implementation for encoder. Maybe we should deprecate it. I think about adding more useful codecs. |
On 15.10.2021 14:35, Serhiy Storchaka wrote:
The only issue with the "unicode-escape" codec is that the decoder accepts
non-ASCII bytes and decodes them with Latin1. It looks like Python 2 legacy.
Using UTF-8 or treating them as error would be more useful in Python 3, but it
is difficult to change now. It is used in the Python parser (it requires
pre-processing), but an str-to-str codec would be more convenient in that case.
However the "raw-unicode-escape" codec is more problematic. It is not usable
without some pre- or post- processing at all (you need at least escape/unescape
the backslash to make it non-ambiguous). It is only used in the text pickle
protocol 0, although the C implementation uses modified implementation for
encoder. Maybe we should deprecate it.
I'm not sure I understand what you mean with that last paragraph. The difference
between the regular and raw variants is the same as for Python regular and raw
string literals.
Those two codecs were used in Python 2 for parse Unicode literal strings
(regular and raw literals).
In Python 3, only the regular codec is still used for parsing literals.
The raw codec is used in pickle.c for Python 2 and 3. In Python 3, pickle uses a
variant which also escapes backslash and newlines when encoding Unicode strings.
…--
Marc-Andre Lemburg
http://www.malemburg.com/
|
It was so in Python 2. In Python 2 escape sequences To make the "raw-unicode-escape" encoding reversible you need first replace backslashes: encoded = text.replace('\\', r'\u005c').encode('raw-unicode-escape') |
This PR is stale because it has been open for 30 days with no activity. |
Hum, the PR diff is hard to hard. It contains unrelated changes likely caused by merges. Can you try to rebase the PR instead? |
@pablogsal, could you please merge this PR? It add a new C API, so it would be simpler to do this before beta1. |
I don't see why adding public C API is required to fix Python codecs. Can't you add them to the internal C API and use them in Modules/_codecsmodule.c? |
They are already in the internal C API. I want them to be available for users. It fixes a flaw in the Unicode C API. |
On 06.05.2022 15:17, Victor Stinner wrote:
@pablogsal <https://github.com/pablogsal>, could you please merge this PR?
It add a new C API, so it would be simpler to do this before beta1.
I don't see why adding public C API is *required* to fix Python codecs. Can't
you add them to the internal C API and use them in Modules/_codecsmodule.c?
As we've discussed before, these APIs expose missing stateful parts to
the raw-unicode-escape as public C APIs.
This is not a fix to codecs. The APIs provide a new feature to make the
codec usable as incremental codec from C.
…--
Marc-Andre Lemburg
http://www.malemburg.com/
|
Which project requires a public C API for these codecs? Do you have project names? You can already use an incremental decoder or encoding in C: just use the codecs API in C. IMO using these C API for unicode-escape and unicode-escape-raw encodings is overkill and we should do the opposite: deprecate most C API related to codecs to only keep the bare minimum like C API for ASCII and UTF-8 encodings. I'm not convinced that a C API is required for performance. I don't see any benchmark in this PR to justify adding more functions to the C API.
I don't understand that. @malemburg wrote "This is not a fix to codecs." |
On 06.05.2022 16:03, Victor Stinner wrote:
IMO using these C API for unicode-escape and unicode-escape-raw encodings is
overkill and we should do the opposite: deprecate most C API related to codecs
to only keep the bare minimum like C API for ASCII and UTF-8 encodings.
Your comment points to the general difference we have:
I designed the Unicode and codecs API to be a rich C API, where you
don't have to resort to abstract entry points to make use of the
functionality. That's why we have C APIs for most of the codec
features.
You want the opposite, namely create a Python C API which only exposes
such high level APIs and leaves internals inaccessible to C extension
writers, limiting what they can do and reducing flexibility and
performance.
We're not going to resolve this difference on this ticket and
it's unlikely to go away without the Python Steering Council deciding
on which approach is the one to strive for :-)
IMO, Python's success is mostly built on the fact that we do have
a rich C API, since this makes it possible to efficiently interface
from slow Python code to faster algorithms and storage mechanisms
implemented in C or other languages.
Moving to a higher level API will make this interfacing less efficient
and thus Python less attractive for people looking to use Python as
easy to use interface to fast and efficient subsystems.
…--
Marc-Andre Lemburg
http://www.malemburg.com/
|
https://bugs.python.org/issue45472