PEP: 624 Title: Remove Py_UNICODE encoder APIs Author: Inada Naoki <songofacandy@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 06-Jul-2020 Python-Version: 3.11 Post-History: 08-Jul-2020
Abstract
This PEP proposes to remove deprecated Py_UNICODE
encoder APIs in Python 3.11:
PyUnicode_Encode()
PyUnicode_EncodeASCII()
PyUnicode_EncodeLatin1()
PyUnicode_EncodeUTF7()
PyUnicode_EncodeUTF8()
PyUnicode_EncodeUTF16()
PyUnicode_EncodeUTF32()
PyUnicode_EncodeUnicodeEscape()
PyUnicode_EncodeRawUnicodeEscape()
PyUnicode_EncodeCharmap()
PyUnicode_TranslateCharmap()
PyUnicode_EncodeDecimal()
PyUnicode_TransformDecimalToASCII()
Note
PEP 623 propose to remove
Unicode object APIs relating to Py_UNICODE
. On the other hand, this PEP
is not relating to Unicode object. These PEPs are split because they have
different motivation and need different discussion.
Motivation
In general, reducing the number of APIs that have been deprecated for a long time and have few users is a good idea for not only it improves the maintainability of CPython, but it also helps API users and other Python implementations.
Rationale
Deprecated since Python 3.3
Py_UNICODE
and APIs using it have been deprecated since Python 3.3.
Inefficient
All of these APIs are implemented using PyUnicode_FromWideChar
.
So these APIs are inefficient when user want to encode Unicode
object.
Not used widely
When searching from top 4000 PyPI packages [1], only pyodbc use these APIs.
PyUnicode_EncodeUTF8()
PyUnicode_EncodeUTF16()
pyodbc uses these APIs to encode Unicode object into bytes object. So it is easy to fix it. [2]
Alternative APIs
There are alternative APIs to accept PyObject *unicode
instead of
Py_UNICODE *
. Users can migrate to them.
Deprecated API | Alternative APIs |
---|---|
PyUnicode_Encode() |
PyUnicode_AsEncodedString() |
PyUnicode_EncodeASCII() |
PyUnicode_AsASCIIString() (1) |
PyUnicode_EncodeLatin1() |
PyUnicode_AsLatin1String() (1) |
PyUnicode_EncodeUTF7() |
(2) |
PyUnicode_EncodeUTF8() |
PyUnicode_AsUTF8String() (1) |
PyUnicode_EncodeUTF16() |
PyUnicode_AsUTF16String() (3) |
PyUnicode_EncodeUTF32() |
PyUnicode_AsUTF32String() (3) |
PyUnicode_EncodeUnicodeEscape() |
PyUnicode_AsUnicodeEscapeString() |
PyUnicode_EncodeRawUnicodeEscape() |
PyUnicode_AsRawUnicodeEscapeString() |
PyUnicode_EncodeCharmap() |
PyUnicode_AsCharmapString() (1) |
PyUnicode_TranslateCharmap() |
PyUnicode_Translate() |
PyUnicode_EncodeDecimal() |
(4) |
PyUnicode_TransformDecimalToASCII() |
(4) |
Notes:
const char *errors
parameter is missing.- There is no public alternative API. But user can use generic
PyUnicode_AsEncodedString()
instead. const char *errors, int byteorder
parameters are missing.- There is no direct replacement. But
Py_UNICODE_TODECIMAL
can be used instead. CPython uses_PyUnicode_TransformDecimalAndSpaceToASCII
for converting from Unicode to numbers instead.
Plan
Python 3.9
Add Py_DEPRECATED(3.3)
to following APIs. This change is committed
already [3]. All other APIs have been marked Py_DEPRECATED(3.3)
already.
PyUnicode_EncodeDecimal()
PyUnicode_TransformDecimalToASCII()
.
Document all APIs as "will be removed in version 3.11".
Python 3.11
These APIs are removed.
PyUnicode_Encode()
PyUnicode_EncodeASCII()
PyUnicode_EncodeLatin1()
PyUnicode_EncodeUTF7()
PyUnicode_EncodeUTF8()
PyUnicode_EncodeUTF16()
PyUnicode_EncodeUTF32()
PyUnicode_EncodeUnicodeEscape()
PyUnicode_EncodeRawUnicodeEscape()
PyUnicode_EncodeCharmap()
PyUnicode_TranslateCharmap()
PyUnicode_EncodeDecimal()
PyUnicode_TransformDecimalToASCII()
Alternative ideas
Instead of just removing deprecated APIs, we may be able to use thier names with different signature.
Make some private APIs public
PyUnicode_EncodeUTF7()
doesn't have public alternative APIs.
Some APIs have alternative public APIs. But they are missing
const char *errors
or int byteorder
parameters.
We can rename some private APIs and make them public to cover missing APIs and parameters.
Rename to | Rename from |
---|---|
PyUnicode_EncodeASCII() |
_PyUnicode_AsASCIIString() |
PyUnicode_EncodeLatin1() |
_PyUnicode_AsLatin1String() |
PyUnicode_EncodeUTF7() |
_PyUnicode_EncodeUTF7() |
PyUnicode_EncodeUTF8() |
_PyUnicode_AsUTF8String() |
PyUnicode_EncodeUTF16() |
_PyUnicode_EncodeUTF16() |
PyUnicode_EncodeUTF32() |
_PyUnicode_EncodeUTF32() |
Pros:
- We have more consistent API set.
Cons:
- We have more public APIs to maintain.
- Existing public APIs are enough for most use cases, and
PyUnicode_AsEncodedString()
can be used in other cases.
Py_UNICODE*
with Py_UCS4*
Replace We can replace Py_UNICODE
(typedef of wchar_t
) with
Py_UCS4
. Since builtin codecs support UCS-4, we don't need to
convert Py_UCS4*
string to Unicode object.
Pros:
- We have more consistent API set.
- User can encode UCS-4 string in C without creating Unicode object.
Cons:
- We have more public APIs to maintain.
- Applications which uses UTF-8 or UTF-16 can not use these APIs anyway.
- Other Python implementations may not have builtin codec for UCS-4.
- If we change the Unicode internal representation to UTF-8, we need to keep UCS-4 support only for these APIs.
Py_UNICODE*
with wchar_t*
Replace We can replace Py_UNICODE
to wchar_t
.
Pros:
- We have more consistent API set.
- Backward compatible.
Cons:
- We have more public APIs to maintain.
- They are inefficient on platforms
wchar_t*
is UTF-16. It is because built-in codecs supports only UCS-1, UCS-2, and UCS-4 input.
Rejected ideas
Using runtime warning
These APIs doesn't release GIL for now. Emitting a warning from such APIs is not safe. See this example.
PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference. PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u), PyUnicode_GET_SIZE(u), NULL); // Assumes u is still living reference. PyObject *t = PyTuple_Pack(2, u, b); Py_DECREF(b); return t;
If we emit Python warning from PyUnicode_EncodeUTF8()
, warning
filters and other threads may change the list
and u
can be
a dangling reference after PyUnicode_EncodeUTF8()
returned.
Additionally, since we are not changing behavior but removing C APIs,
runtime DeprecationWarning
might not helpful for Python
developers. We should warn to extension developers instead.
PyUnicode_Decode*
APIs too
Deprecate Not only remove PyUnicode_Encode*()
APIs, but also deprecate
following APIs too for symmetry and reducing number of APIs.
PyUnicode_DecodeASCII()
PyUnicode_DecodeLatin1()
PyUnicode_DecodeUTF7()
PyUnicode_DecodeUTF8()
PyUnicode_DecodeUTF16()
PyUnicode_DecodeUTF32()
PyUnicode_DecodeUnicodeEscape()
PyUnicode_DecodeRawUnicodeEscape()
PyUnicode_DecodeCharmap()
PyUnicode_DecodeMBCS()
This idea is excluded from this PEP because of several reasons:
- We can not remove them anytime soon because they are part of stable ABI.
PyUnicode_DecodeASCII()
andPyUnicode_DecodeUTF8()
are used very widely. Deprecating them is not worth enough.- Decoding from
const char*
is independent from Unicode representation.PyUnicode_Decode*()
APIs are useful for applications and extensions using UTF-8 or Python Unicode objects to store Unicode string. ButPyUnicode_Encode*()
APIs are not useful for them.- Python implementations using UTF-8 for Unicode internal
representation (e.g. PyPy and micropython) may not have encoder
with
wchar_t*
or UCS-4 input. But decoding fromchar*
is very natural for them too.
Discussions
References
[1] | Source package list chosen from top 4000 PyPI packages. (https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt) |
[2] | pyodbc -- Don't use PyUnicode_Encode API #792 (https://github.com/mkleehammer/pyodbc/pull/792) |
[3] | Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318) (https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181) |
Copyright
This document has been placed in the public domain.