Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upbpo-39287: Doc: Add UTF-8 mode section in using/windows. #17935
Conversation
|
||
.. versionadded:: 3.7 | ||
|
||
Windows doesn't use UTF-8 for the system encoding (the ANSI Code Page). |
This comment has been minimized.
This comment has been minimized.
eryksun
Jan 10, 2020
Contributor
Windows 10 supports setting the system locale's ANSI and OEM codepages to UTF-8 (65001), but it's not enabled by default.
There are still problems to be resolved with using UTF-8 at the system level. In particular, the console host (conhost.exe) doesn't support using UTF-8 as the input codepage for use with ReadFile
and ReadConsoleA
. It encodes the UTF-16 input buffer with an internal WideCharToMultiByte
call that assumes one byte per encoded character (at least in a Western locale, for which a single-byte encoding is assumed). This fails for non-ASCII characters, which in turn end up as null bytes in the result of a ReadFile
or ReadConsoleA
call. Python is immune to this problem for the most part. The I/O stack detects a console file and uses wide-character ReadConsoleW
instead, via io._WindowsConsoleIO
. The problem affects low-level os.write
and os.read
, however, because they're not integrated with _WindowsConsoleIO
.
This comment has been minimized.
This comment has been minimized.
methane
Jan 17, 2020
Author
Member
Windows 10 supports setting the system locale's ANSI and OEM codepages to UTF-8 (65001), but it's not enabled by default.
It is still beta and breaks many applications. I don't think we should refer it here until it becomes non-beta.
There are still problems to be resolved with using UTF-8 at the system level. In particular, the console host (conhost.exe) doesn't support using UTF-8 as the input codepage for use with ReadFile and ReadConsoleA. It encodes the UTF-16 input buffer with an internal WideCharToMultiByte call that assumes one byte per encoded character (at least in a Western locale, for which a single-byte encoding is assumed). This fails for non-ASCII characters, which in turn end up as null bytes in the result of a ReadFile or ReadConsoleA call. Python is immune to this problem for the most part. The I/O stack detects a console file and uses wide-character ReadConsoleW instead, via io._WindowsConsoleIO.
Yes, that's why I don't mention about stdio encoding here.
The problem affects low-level
os.write
andos.read
, however, because they're not integrated with_WindowsConsoleIO
.
os.write
and os.read
uses bytes, not unicode. So UTF-mode doesn't affect to it.
This comment has been minimized.
This comment has been minimized.
eryksun
Jan 17, 2020
•
Contributor
os.write
andos.read
uses bytes, not unicode. So UTF-mode doesn't affect to it.
What I meant was the problem I discussed above with calling ReadFile
(e.g. via os.read
-> read
-> ReadFile
) on a console input file when the input codepage is set to UTF-8. The latter could be set as the default from changing the system OEM codepage to UTF-8, or by setting the "CodePage" value in a subkey of "HKCU\Console" (subkey named for the initial window title), or by manually calling SetConsoleCP(65001)
, such as via chcp.com 65001
. As I discussed above, in this case input read from the console is limited to 7-bit ASCII, and all ordinals above 127 are replaced by null bytes.
Setting the output codepage to UTF-8 (e.g. SetConsoleOutputCP(65001)
) works reasonably well in Windows 8+, but it still has a problem with C buffered FILE
streams. The console doesn't expect multibyte data to span multiple writes (with the exception of legacy DBCS codepages), so a code sequence that gets split across two writes becomes two replacement characters (U+FFFD) in the screen buffer. Still, this is a huge improvement over Windows 7 and older, which mistakenly returns the number of UTF-16 code points written (1-2) instead of the number of UTF-8 bytes (1-4). That confuses buffered writers, and leads to multiple writes of garbage on the screen. They fixed that years ago when Windows 8 switched to using the ConDrv device for console I/O instead of an LPC port.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
eryksun
Jan 17, 2020
Contributor
In the first post I said it's also possible to enable UTF-8 at the system level, which concerns the direction I see Windows headed toward in the relatively near future (4-6 years -- a couple years after Windows 8.1 is retired). Then in the following paragraph I provided an example of the current problems with it. My followup was just to clarify the relevance of the example regarding os.read
and os.write
, in case there was some misunderstanding. It wasn't specifically about UTF-8 mode.
Back on topic, if legacy console I/O is enabled via PYTHONLEGACYWINDOWSSTDIO
, currently UTF-8 mode leads to mojibake. It disregards the console codepage to use UTF-8. I think this is nearly right, but it should force using _WindowsConsoleIO
instead of nonsensically using UTF-8 with console files when the console codepage is not UTF-8.
Also -- not directly related -- but legacy stdio mode is broken in general because the new configuration code doesn't use the result from _Py_device_encoding
like it's supposed to, but instead always uses the preferred encoding (UTF-8 or ANSI). If UTF-8 mode is not enabled, it should not force console files to use ANSI. All console files in that case (not just fds 0-2) should use the console codepage. So the configuration code needs to be fixed, and we also need to fix _Py_device_encoding
/ os.device_encoding
, which were always broken in Windows.
This comment has been minimized.
This comment has been minimized.
@aeros Would you review this, please? |
Page). Python uses it for the default encoding of text files (e.g. | ||
:func:`locale.getpreferredencoding`). | ||
|
||
It may cause trouble because the UTF-8 is widely used on the internet |
This comment has been minimized.
This comment has been minimized.
eryksun
Jan 24, 2020
Contributor
It would read better as "This may" instead of "It may". Using "this" is more clearly referring to the previous statement.
Co-Authored-By: Kyle Stanley <aeros167@gmail.com>
As far as grammar goes, LGTM. I'll leave the remaining "legacy encoding" vs "system codepage" discussion to @methane and @eryksun. |
This comment has been minimized.
This comment has been minimized.
Instead of "legacy encoding" and "legacy system encoding", it can just use "system encoding", as is used in the first paragraph. The only place I'd retain "legacy" is in the first sentence of the intro, "Windows still uses legacy encodings for the system encoding". That's referring to pre-Unicode encodings such as codepages 850 and 1252. |
This comment has been minimized.
This comment has been minimized.
miss-islington
commented
Jan 28, 2020
Thanks @methane for the PR |
This comment has been minimized.
This comment has been minimized.
bedevere-bot
commented
Jan 28, 2020
GH-18235 is a backport of this pull request to the 3.8 branch. |
This comment has been minimized.
This comment has been minimized.
bedevere-bot
commented
Jan 28, 2020
GH-18236 is a backport of this pull request to the 3.7 branch. |
methane commentedJan 10, 2020
•
edited by bedevere-bot
https://bugs.python.org/issue39287