Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-39287: Doc: Add UTF-8 mode section in using/windows. #17935

Merged
merged 9 commits into from Jan 28, 2020

Conversation

@methane
Copy link
Member

methane commented Jan 10, 2020

@methane methane force-pushed the methane:win-utf8mode branch from 6cb32d7 to 812c0be Jan 10, 2020

.. versionadded:: 3.7

Windows doesn't use UTF-8 for the system encoding (the ANSI Code Page).

This comment has been minimized.

Copy link
@eryksun

eryksun Jan 10, 2020

Contributor

Windows 10 supports setting the system locale's ANSI and OEM codepages to UTF-8 (65001), but it's not enabled by default.

There are still problems to be resolved with using UTF-8 at the system level. In particular, the console host (conhost.exe) doesn't support using UTF-8 as the input codepage for use with ReadFile and ReadConsoleA. It encodes the UTF-16 input buffer with an internal WideCharToMultiByte call that assumes one byte per encoded character (at least in a Western locale, for which a single-byte encoding is assumed). This fails for non-ASCII characters, which in turn end up as null bytes in the result of a ReadFile or ReadConsoleA call. Python is immune to this problem for the most part. The I/O stack detects a console file and uses wide-character ReadConsoleW instead, via io._WindowsConsoleIO. The problem affects low-level os.write and os.read, however, because they're not integrated with _WindowsConsoleIO.

This comment has been minimized.

Copy link
@methane

methane Jan 17, 2020

Author Member

Windows 10 supports setting the system locale's ANSI and OEM codepages to UTF-8 (65001), but it's not enabled by default.

It is still beta and breaks many applications. I don't think we should refer it here until it becomes non-beta.

There are still problems to be resolved with using UTF-8 at the system level. In particular, the console host (conhost.exe) doesn't support using UTF-8 as the input codepage for use with ReadFile and ReadConsoleA. It encodes the UTF-16 input buffer with an internal WideCharToMultiByte call that assumes one byte per encoded character (at least in a Western locale, for which a single-byte encoding is assumed). This fails for non-ASCII characters, which in turn end up as null bytes in the result of a ReadFile or ReadConsoleA call. Python is immune to this problem for the most part. The I/O stack detects a console file and uses wide-character ReadConsoleW instead, via io._WindowsConsoleIO.

Yes, that's why I don't mention about stdio encoding here.

The problem affects low-level os.write and os.read, however, because they're not integrated with _WindowsConsoleIO.

os.write and os.read uses bytes, not unicode. So UTF-mode doesn't affect to it.

This comment has been minimized.

Copy link
@eryksun

eryksun Jan 17, 2020

Contributor

os.write and os.read uses bytes, not unicode. So UTF-mode doesn't affect to it.

What I meant was the problem I discussed above with calling ReadFile (e.g. via os.read -> read -> ReadFile) on a console input file when the input codepage is set to UTF-8. The latter could be set as the default from changing the system OEM codepage to UTF-8, or by setting the "CodePage" value in a subkey of "HKCU\Console" (subkey named for the initial window title), or by manually calling SetConsoleCP(65001), such as via chcp.com 65001. As I discussed above, in this case input read from the console is limited to 7-bit ASCII, and all ordinals above 127 are replaced by null bytes.

Setting the output codepage to UTF-8 (e.g. SetConsoleOutputCP(65001)) works reasonably well in Windows 8+, but it still has a problem with C buffered FILE streams. The console doesn't expect multibyte data to span multiple writes (with the exception of legacy DBCS codepages), so a code sequence that gets split across two writes becomes two replacement characters (U+FFFD) in the screen buffer. Still, this is a huge improvement over Windows 7 and older, which mistakenly returns the number of UTF-16 code points written (1-2) instead of the number of UTF-8 bytes (1-4). That confuses buffered writers, and leads to multiple writes of garbage on the screen. They fixed that years ago when Windows 8 switched to using the ConDrv device for console I/O instead of an LPC port.

This comment has been minimized.

Copy link
@methane

methane Jan 17, 2020

Author Member

How does it affect to the UTF-8 mode?

This comment has been minimized.

Copy link
@eryksun

eryksun Jan 17, 2020

Contributor

In the first post I said it's also possible to enable UTF-8 at the system level, which concerns the direction I see Windows headed toward in the relatively near future (4-6 years -- a couple years after Windows 8.1 is retired). Then in the following paragraph I provided an example of the current problems with it. My followup was just to clarify the relevance of the example regarding os.read and os.write, in case there was some misunderstanding. It wasn't specifically about UTF-8 mode.

Back on topic, if legacy console I/O is enabled via PYTHONLEGACYWINDOWSSTDIO, currently UTF-8 mode leads to mojibake. It disregards the console codepage to use UTF-8. I think this is nearly right, but it should force using _WindowsConsoleIO instead of nonsensically using UTF-8 with console files when the console codepage is not UTF-8.

Also -- not directly related -- but legacy stdio mode is broken in general because the new configuration code doesn't use the result from _Py_device_encoding like it's supposed to, but instead always uses the preferred encoding (UTF-8 or ANSI). If UTF-8 mode is not enabled, it should not force console files to use ANSI. All console files in that case (not just fds 0-2) should use the console codepage. So the configuration code needs to be fixed, and we also need to fix _Py_device_encoding / os.device_encoding, which were always broken in Windows.

Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
methane added 2 commits Jan 17, 2020
@methane methane requested a review from python/windows-team Jan 22, 2020
methane added 2 commits Jan 24, 2020
@methane methane requested a review from aeros Jan 24, 2020
@methane

This comment has been minimized.

Copy link
Member Author

methane commented Jan 24, 2020

@aeros Would you review this, please?

Page). Python uses it for the default encoding of text files (e.g.
:func:`locale.getpreferredencoding`).

It may cause trouble because the UTF-8 is widely used on the internet

This comment has been minimized.

Copy link
@eryksun

eryksun Jan 24, 2020

Contributor

It would read better as "This may" instead of "It may". Using "this" is more clearly referring to the previous statement.

Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Doc/using/windows.rst Outdated Show resolved Hide resolved
Co-Authored-By: Kyle Stanley <aeros167@gmail.com>
@aeros
aeros approved these changes Jan 27, 2020
Copy link
Member

aeros left a comment

As far as grammar goes, LGTM. I'll leave the remaining "legacy encoding" vs "system codepage" discussion to @methane and @eryksun.

@eryksun

This comment has been minimized.

Copy link
Contributor

eryksun commented Jan 28, 2020

As far as grammar goes, LGTM. I'll leave the remaining "legacy encoding" vs "system codepage" discussion to @methane and @eryksun.

Instead of "legacy encoding" and "legacy system encoding", it can just use "system encoding", as is used in the first paragraph. The only place I'd retain "legacy" is in the first sentence of the intro, "Windows still uses legacy encodings for the system encoding". That's referring to pre-Unicode encodings such as codepages 850 and 1252.

@methane methane merged commit 148610d into python:master Jan 28, 2020
5 checks passed
5 checks passed
Docs
Details
Azure Pipelines PR #20200128.16 succeeded
Details
bedevere/issue-number Issue number 39287 found
Details
bedevere/news "skip news" label found
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@methane methane deleted the methane:win-utf8mode branch Jan 28, 2020
@miss-islington

This comment has been minimized.

Copy link

miss-islington commented Jan 28, 2020

Thanks @methane for the PR 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8.
🐍🍒🤖

miss-islington added a commit to miss-islington/cpython that referenced this pull request Jan 28, 2020
)

Co-Authored-By: Kyle Stanley <aeros167@gmail.com>
(cherry picked from commit 148610d)

Co-authored-by: Inada Naoki <songofacandy@gmail.com>
@bedevere-bot

This comment has been minimized.

Copy link

bedevere-bot commented Jan 28, 2020

GH-18235 is a backport of this pull request to the 3.8 branch.

miss-islington added a commit to miss-islington/cpython that referenced this pull request Jan 28, 2020
)

Co-Authored-By: Kyle Stanley <aeros167@gmail.com>
(cherry picked from commit 148610d)

Co-authored-by: Inada Naoki <songofacandy@gmail.com>
@bedevere-bot

This comment has been minimized.

Copy link

bedevere-bot commented Jan 28, 2020

GH-18236 is a backport of this pull request to the 3.7 branch.

@methane

This comment has been minimized.

Copy link
Member Author

methane commented Jan 28, 2020

@aeros @eryksun Thank you for your reviews.

miss-islington added a commit that referenced this pull request Jan 28, 2020
Co-Authored-By: Kyle Stanley <aeros167@gmail.com>
(cherry picked from commit 148610d)

Co-authored-by: Inada Naoki <songofacandy@gmail.com>
miss-islington added a commit that referenced this pull request Jan 28, 2020
Co-Authored-By: Kyle Stanley <aeros167@gmail.com>
(cherry picked from commit 148610d)

Co-authored-by: Inada Naoki <songofacandy@gmail.com>
shihai1991 added a commit to shihai1991/cpython that referenced this pull request Jan 31, 2020
)

Co-Authored-By: Kyle Stanley <aeros167@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.