Skip to content

gh-89635: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_DecodeRawUnicodeEscapeStateful() #28955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Oct 14, 2021

@@ -611,6 +611,15 @@ PyAPI_FUNC(PyObject*) PyUnicode_DecodeUnicodeEscape(
const char *errors /* error handling */
);

#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030b0000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add something to the limited C AP, you must update Misc/stable_abi.txt and then run make regen-limited-abi.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with adding a stateful variant to the Python C API, but I don't think that it's worth it to add a C API to the limited C API for that. There are already other functions to access the feature, start from the str.encode() and bytes.decode() methods, which can be called in C with generic CallMethod functions.

I would prefer to keep the limited C API as small as possible.

cc @encukou

@serhiy-storchaka serhiy-storchaka requested a review from a team as a code owner October 15, 2021 08:38
@serhiy-storchaka
Copy link
Member Author

It makes sense. I do not have strong opinion about this. On one hand, these codecs are the only multibyte codecs implemented in C which do not provide public stateful variants of API. On other hand, these codecs are Python 2 legacy and not very useful as is.

@malemburg
Copy link
Member

For consistency, please do add those APIs. The Unicode API was designed to be a rich API, making it easy to use from C extensions. The unicode escape codecs are not Python 2 legacy, they are still useful when it comes to encoding Unicode in an ASCII compatible format.

@vstinner
Copy link
Member

Why do you want to add these 2 functions to the stable ABI? If someone cares about the C API, there are other existing C API functions to use this codec.

@malemburg
Copy link
Member

These are new APIs, providing new useful functionality. The existing APIs cannot be used for stateful decoding. That's the story behind a rich C API - you expose all available functionality usable in C extensions, so that C extensions don't have to resort to slower high level APIs.

Copy link
Member

@malemburg malemburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks, Serhiy.

@serhiy-storchaka
Copy link
Member Author

The only issue with the "unicode-escape" codec is that the decoder accepts non-ASCII bytes and decodes them with Latin1. It looks like Python 2 legacy. Using UTF-8 or treating them as error would be more useful in Python 3, but it is difficult to change now. It is used in the Python parser (it requires pre-processing), but an str-to-str codec would be more convenient in that case.

However the "raw-unicode-escape" codec is more problematic. It is not usable without some pre- or post- processing at all (you need at least escape/unescape the backslash to make it non-ambiguous). It is only used in the text pickle protocol 0, although the C implementation uses modified implementation for encoder. Maybe we should deprecate it.

I think about adding more useful codecs.

@malemburg
Copy link
Member

malemburg commented Oct 15, 2021 via email

@serhiy-storchaka
Copy link
Member Author

The difference between the regular and raw variants is the same as for Python regular and raw string literals.

It was so in Python 2. In Python 2 escape sequences \u and \U were active. In Python 3 there is no such relation. Also, both "unicode-escape" and "raw-unicode-escape" decoders use Latin1 for decoding non-ASCII bytes. Latin1 no longer the default encoding of Python sources.

To make the "raw-unicode-escape" encoding reversible you need first replace backslashes:

encoded = text.replace('\\', r'\u005c').encode('raw-unicode-escape')

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Nov 15, 2021
@vstinner
Copy link
Member

vstinner commented May 6, 2022

Hum, the PR diff is hard to hard. It contains unrelated changes likely caused by merges. Can you try to rebase the PR instead?

@serhiy-storchaka
Copy link
Member Author

@pablogsal, could you please merge this PR? It add a new C API, so it would be simpler to do this before beta1.

@vstinner
Copy link
Member

vstinner commented May 6, 2022

@pablogsal, could you please merge this PR? It add a new C API, so it would be simpler to do this before beta1.

I don't see why adding public C API is required to fix Python codecs. Can't you add them to the internal C API and use them in Modules/_codecsmodule.c?

@serhiy-storchaka
Copy link
Member Author

They are already in the internal C API. I want them to be available for users. It fixes a flaw in the Unicode C API.

@malemburg
Copy link
Member

malemburg commented May 6, 2022 via email

@vstinner
Copy link
Member

vstinner commented May 6, 2022

Which project requires a public C API for these codecs? Do you have project names? You can already use an incremental decoder or encoding in C: just use the codecs API in C.

IMO using these C API for unicode-escape and unicode-escape-raw encodings is overkill and we should do the opposite: deprecate most C API related to codecs to only keep the bare minimum like C API for ASCII and UTF-8 encodings. I'm not convinced that a C API is required for performance. I don't see any benchmark in this PR to justify adding more functions to the C API.

@serhiy-storchaka wrote:

They are neccessary for correct implementation of increment decoders and stream readers (see bpo-45461 and bpo-45467).

I don't understand that. @malemburg wrote "This is not a fix to codecs."

@malemburg
Copy link
Member

malemburg commented May 6, 2022 via email

@erlend-aasland erlend-aasland changed the title bpo-45472: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_DecodeRawUnicodeEscapeStateful() gh-89635: Add PyUnicode_DecodeUnicodeEscapeStateful() and PyUnicode_DecodeRawUnicodeEscapeStateful() Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting merge stale Stale PR or inactive for long period of time. type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants