gh-89635: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_DecodeRawUnicodeEscapeStateful() #28955

serhiy-storchaka · 2021-10-14T19:00:27Z

https://bugs.python.org/issue45472

Issue: Add public C API for partial "unicode-escape" and "raw-unicode-escape" decoding #89635

…DecodeRawUnicodeEscapeStateful()

vstinner · 2021-10-14T19:26:15Z

Include/unicodeobject.h

@@ -611,6 +611,15 @@ PyAPI_FUNC(PyObject*) PyUnicode_DecodeUnicodeEscape(
    const char *errors          /* error handling */
    );

+#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030b0000


If you add something to the limited C AP, you must update Misc/stable_abi.txt and then run make regen-limited-abi.

vstinner

I'm fine with adding a stateful variant to the Python C API, but I don't think that it's worth it to add a C API to the limited C API for that. There are already other functions to access the feature, start from the str.encode() and bytes.decode() methods, which can be called in C with generic CallMethod functions.

I would prefer to keep the limited C API as small as possible.

cc @encukou

serhiy-storchaka · 2021-10-15T08:43:11Z

It makes sense. I do not have strong opinion about this. On one hand, these codecs are the only multibyte codecs implemented in C which do not provide public stateful variants of API. On other hand, these codecs are Python 2 legacy and not very useful as is.

malemburg · 2021-10-15T08:50:38Z

For consistency, please do add those APIs. The Unicode API was designed to be a rich API, making it easy to use from C extensions. The unicode escape codecs are not Python 2 legacy, they are still useful when it comes to encoding Unicode in an ASCII compatible format.

vstinner · 2021-10-15T09:24:22Z

Why do you want to add these 2 functions to the stable ABI? If someone cares about the C API, there are other existing C API functions to use this codec.

malemburg · 2021-10-15T09:33:15Z

These are new APIs, providing new useful functionality. The existing APIs cannot be used for stateful decoding. That's the story behind a rich C API - you expose all available functionality usable in C extensions, so that C extensions don't have to resort to slower high level APIs.

malemburg

Looks good. Thanks, Serhiy.

serhiy-storchaka · 2021-10-15T12:35:07Z

The only issue with the "unicode-escape" codec is that the decoder accepts non-ASCII bytes and decodes them with Latin1. It looks like Python 2 legacy. Using UTF-8 or treating them as error would be more useful in Python 3, but it is difficult to change now. It is used in the Python parser (it requires pre-processing), but an str-to-str codec would be more convenient in that case.

However the "raw-unicode-escape" codec is more problematic. It is not usable without some pre- or post- processing at all (you need at least escape/unescape the backslash to make it non-ambiguous). It is only used in the text pickle protocol 0, although the C implementation uses modified implementation for encoder. Maybe we should deprecate it.

I think about adding more useful codecs.

malemburg · 2021-10-15T16:00:00Z

On 15.10.2021 14:35, Serhiy Storchaka wrote: The only issue with the "unicode-escape" codec is that the decoder accepts non-ASCII bytes and decodes them with Latin1. It looks like Python 2 legacy. Using UTF-8 or treating them as error would be more useful in Python 3, but it is difficult to change now. It is used in the Python parser (it requires pre-processing), but an str-to-str codec would be more convenient in that case. However the "raw-unicode-escape" codec is more problematic. It is not usable without some pre- or post- processing at all (you need at least escape/unescape the backslash to make it non-ambiguous). It is only used in the text pickle protocol 0, although the C implementation uses modified implementation for encoder. Maybe we should deprecate it.

I'm not sure I understand what you mean with that last paragraph. The difference between the regular and raw variants is the same as for Python regular and raw string literals. Those two codecs were used in Python 2 for parse Unicode literal strings (regular and raw literals). In Python 3, only the regular codec is still used for parsing literals. The raw codec is used in pickle.c for Python 2 and 3. In Python 3, pickle uses a variant which also escapes backslash and newlines when encoding Unicode strings.

…

-- Marc-Andre Lemburg http://www.malemburg.com/

serhiy-storchaka · 2021-10-15T17:03:34Z

The difference between the regular and raw variants is the same as for Python regular and raw string literals.

It was so in Python 2. In Python 2 escape sequences \u and \U were active. In Python 3 there is no such relation. Also, both "unicode-escape" and "raw-unicode-escape" decoders use Latin1 for decoding non-ASCII bytes. Latin1 no longer the default encoding of Python sources.

To make the "raw-unicode-escape" encoding reversible you need first replace backslashes:

encoded = text.replace('\\', r'\u005c').encode('raw-unicode-escape')

github-actions · 2021-11-15T00:05:57Z

This PR is stale because it has been open for 30 days with no activity.

vstinner · 2022-05-06T12:19:32Z

Hum, the PR diff is hard to hard. It contains unrelated changes likely caused by merges. Can you try to rebase the PR instead?

serhiy-storchaka · 2022-05-06T13:12:01Z

@pablogsal, could you please merge this PR? It add a new C API, so it would be simpler to do this before beta1.

vstinner · 2022-05-06T13:17:22Z

@pablogsal, could you please merge this PR? It add a new C API, so it would be simpler to do this before beta1.

I don't see why adding public C API is required to fix Python codecs. Can't you add them to the internal C API and use them in Modules/_codecsmodule.c?

serhiy-storchaka · 2022-05-06T13:37:19Z

They are already in the internal C API. I want them to be available for users. It fixes a flaw in the Unicode C API.

malemburg · 2022-05-06T13:40:11Z

On 06.05.2022 15:17, Victor Stinner wrote: @pablogsal <https://github.com/pablogsal>, could you please merge this PR? It add a new C API, so it would be simpler to do this before beta1. I don't see why adding public C API is *required* to fix Python codecs. Can't you add them to the internal C API and use them in Modules/_codecsmodule.c?

As we've discussed before, these APIs expose missing stateful parts to the raw-unicode-escape as public C APIs. This is not a fix to codecs. The APIs provide a new feature to make the codec usable as incremental codec from C.

…

-- Marc-Andre Lemburg http://www.malemburg.com/

vstinner · 2022-05-06T14:03:17Z

Which project requires a public C API for these codecs? Do you have project names? You can already use an incremental decoder or encoding in C: just use the codecs API in C.

IMO using these C API for unicode-escape and unicode-escape-raw encodings is overkill and we should do the opposite: deprecate most C API related to codecs to only keep the bare minimum like C API for ASCII and UTF-8 encodings. I'm not convinced that a C API is required for performance. I don't see any benchmark in this PR to justify adding more functions to the C API.

@serhiy-storchaka wrote:

They are neccessary for correct implementation of increment decoders and stream readers (see bpo-45461 and bpo-45467).

I don't understand that. @malemburg wrote "This is not a fix to codecs."

malemburg · 2022-05-06T14:17:36Z

On 06.05.2022 16:03, Victor Stinner wrote: IMO using these C API for unicode-escape and unicode-escape-raw encodings is overkill and we should do the opposite: deprecate most C API related to codecs to only keep the bare minimum like C API for ASCII and UTF-8 encodings.

Your comment points to the general difference we have: I designed the Unicode and codecs API to be a rich C API, where you don't have to resort to abstract entry points to make use of the functionality. That's why we have C APIs for most of the codec features. You want the opposite, namely create a Python C API which only exposes such high level APIs and leaves internals inaccessible to C extension writers, limiting what they can do and reducing flexibility and performance. We're not going to resolve this difference on this ticket and it's unlikely to go away without the Python Steering Council deciding on which approach is the one to strive for :-) IMO, Python's success is mostly built on the fact that we do have a rich C API, since this makes it possible to efficiently interface from slow Python code to faster algorithms and storage mechanisms implemented in C or other languages. Moving to a higher level API will make this interfacing less efficient and thus Python less attractive for people looking to use Python as easy to use interface to fast and efficient subsystems.

…

-- Marc-Andre Lemburg http://www.malemburg.com/

bpo-45472: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_…

27d494f

…DecodeRawUnicodeEscapeStateful()

serhiy-storchaka added the type-feature A feature request or enhancement label Oct 14, 2021

serhiy-storchaka requested review from vstinner and malemburg October 14, 2021 19:00

bedevere-bot added the awaiting core review label Oct 14, 2021

the-knights-who-say-ni added the CLA signed label Oct 14, 2021

vstinner reviewed Oct 14, 2021

View reviewed changes

Add them to the stable ABI.

8305604

serhiy-storchaka requested a review from a team as a code owner October 15, 2021 08:38

malemburg approved these changes Oct 15, 2021

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Oct 15, 2021

Merge branch 'main' into stateful-decoder-unicode-escape

8677438

github-actions bot added the stale Stale PR or inactive for long period of time. label Nov 15, 2021

Merge branch 'main' into stateful-decoder-unicode-escape

5605077

serhiy-storchaka added 3 commits May 6, 2022 15:32

Merge branch 'main' into stateful-decoder-unicode-escape

7c97338

Fix test_stable_abi_ctypes.

591e1a3

Merge branch 'main' into stateful-decoder-unicode-escape

23352ac

serhiy-storchaka mentioned this pull request Apr 10, 2022

Add public C API for partial "unicode-escape" and "raw-unicode-escape" decoding #89635

Open

ezio-melotti removed the CLA signed label Jul 13, 2022

erlend-aasland changed the title ~~bpo-45472: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_DecodeRawUnicodeEscapeStateful()~~ gh-89635: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_DecodeRawUnicodeEscapeStateful() Jan 5, 2024

Uh oh!

gh-89635: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_DecodeRawUnicodeEscapeStateful() #28955

Are you sure you want to change the base?

gh-89635: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_DecodeRawUnicodeEscapeStateful() #28955

Uh oh!

Conversation

serhiy-storchaka commented Oct 14, 2021 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner Oct 14, 2021

Choose a reason for hiding this comment

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Oct 15, 2021

Uh oh!

malemburg commented Oct 15, 2021

Uh oh!

vstinner commented Oct 15, 2021

Uh oh!

malemburg commented Oct 15, 2021

Uh oh!

malemburg left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Oct 15, 2021

Uh oh!

malemburg commented Oct 15, 2021 via email

Uh oh!

serhiy-storchaka commented Oct 15, 2021

Uh oh!

github-actions bot commented Nov 15, 2021

Uh oh!

vstinner commented May 6, 2022

Uh oh!

serhiy-storchaka commented May 6, 2022

Uh oh!

vstinner commented May 6, 2022

Uh oh!

serhiy-storchaka commented May 6, 2022

Uh oh!

malemburg commented May 6, 2022 via email

Uh oh!

vstinner commented May 6, 2022

Uh oh!

malemburg commented May 6, 2022 via email

Uh oh!

Uh oh!

serhiy-storchaka commented Oct 14, 2021 •

edited by bedevere-app bot

Loading