Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052

Open
andy-ms mannequin opened this issue Aug 16, 2019 · 5 comments
Open

Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052

andy-ms mannequin opened this issue Aug 16, 2019 · 5 comments
Labels
3.8 3.9 3.10 expert-IO OS-windows type-bug An unexpected behavior, bug, or error

Comments

@andy-ms
Copy link
Mannequin

andy-ms mannequin commented Aug 16, 2019

BPO 37871
Nosy @gpshead, @pfmoore, @tjguk, @zware, @eryksun, @zooba, @andy-ms

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2019-08-16.00:36:04.457>
labels = ['type-bug', '3.8', '3.9', 'expert-IO', '3.10', 'OS-windows']
title = 'Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB'
updated_at = <Date 2021-03-20.23:35:16.331>
user = 'https://github.com/andy-ms'

bugs.python.org fields:

activity = <Date 2021-03-20.23:35:16.331>
actor = 'eryksun'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Windows', 'IO']
creation = <Date 2019-08-16.00:36:04.457>
creator = 'anhans'
dependencies = []
files = []
hgrepos = []
issue_num = 37871
keywords = []
message_count = 5.0
messages = ['349837', '349844', '349872', '389191', '389199']
nosy_count = 7.0
nosy_names = ['gregory.p.smith', 'paul.moore', 'tim.golden', 'zach.ware', 'eryksun', 'steve.dower', 'anhans']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue37871'
versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

@andy-ms
Copy link
Mannequin Author

andy-ms mannequin commented Aug 16, 2019

To reproduce:

Put this text in a file a.py and run py a.py.

Or just run: py -c "print(('é' * 40 + '\n') * 473)"

Scroll up for a while. One of the lines will be:

éééééééééééééééééééé��ééééééééééééééééééé

(You can spot this because it's slightly longer than the other lines.)

The error is consistently on line 237, column 21 (1-indexed).

# The error reproduces on Windows but not Linux. Tested in both powershell and CMD.
# (Failed to reproduce on either a real Linux machine or on Ubuntu with WSL.)
# On Windows, the error reproduces every time consistently.

# There is no error if N = 472 or 474.
N = 473
# There is no error if W = 39 or 41.
# (I tested with console windows of varying sizes, all well over 40 characters.)
W = 40
# There is no error if ch = "e" with no accent.
# There is still an error for other unicode characters like "Ö" or "ü".
ch = "é"
# There is no error without newlines.
s = (ch * W + "\n") * N
# Assert the string itself is correct.
assert all(c in (ch, "\n") for c in s)
print(s)

# There is no error if we use N separate print statements
# instead of printing a single string with N newlines.

# Similar scripts written in Groovy, JS and Ruby have no error.
# Groovy: System.out.println(("é" * 40 + "\n") * 473)
# JS: console.log(("é".repeat(40) + "\n").repeat(473))
# Ruby: puts(("é" * 40 + "\n") * 473)

@andy-ms andy-ms mannequin added 3.7 OS-windows type-bug An unexpected behavior, bug, or error labels Aug 16, 2019
@eryksun
Copy link
Contributor

eryksun commented Aug 16, 2019

To be compatible with Windows 7, _io__WindowsConsoleIO_write_impl in Modules/_io/winconsoleio.c is forced to write to the console in chunks that do not exceed 32 KiB. It does so by repeatedly dividing the length to decode by 2 until the decoded buffer size is small enough.

    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    while (wlen > 32766 / sizeof(wchar_t)) {
        len /= 2;
        wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    }

With ('é' * 40 + '\n') * 473, encoded as UTF-8, we have 473 82-byte lines (note that "\n" has been translated to "\r\n"). This is 38,786 bytes, which is too much for a single write, so it splits it in two.

    >>> 38786 // 2
    19393
    >>> 19393 // 82
    236
    >>> 19393 % 82
    41

This means line 237 ends up with 20 'é' characters (UTF-8 b'\xc3\xa9') and one partial character sequjence, b'\xc3'. When this buffer is passed to MultiByteToWideChar to decode from UTF-8 to UTF-16, the partial sequence gets decoded as the replacement character U+FFFD. For the next write, the remaining b'\xa9' byte also gets decoded as U+FFFD.

To avoid this, _io__WindowsConsoleIO_write_impl could decode the whole buffer in one pass, and slice that up into writes that are less than 32 KiB. Or it could ensure that its UTF-8 slices are always at character boundaries.

@zooba
Copy link
Member

zooba commented Aug 16, 2019

I'd rather keep encoding incrementally, and reduce the length of each attempt until the last UTF-8 character does not have its top bit set (i.e. is the final character in a multi-byte sequence).

Otherwise the people who like to print >2GB worth of data to the console will complain about the memory error :)

@vstinner vstinner changed the title 40 * 473 grid of "é" has a single wrong character on Windows Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB Aug 21, 2019
@gpshead
Copy link
Member

gpshead commented Mar 20, 2021

Steve's approach makes sense and should be robust.

side note: do we need to care about Windows 7 anymore in 3.10 given that microsoft no longer supports it?

@eryksun
Copy link
Contributor

eryksun commented Mar 20, 2021

side note: do we need to care about Windows 7 anymore in
3.10 given that microsoft no longer supports it?

If the fix comes in time for Python 3.8, then it needs to support Windows 7. For Python 3.9+, the 32 KiB limit can be removed.

The console documentation still includes the misleading disclaimer about "available heap". This refers to a relatively small block of shared memory (64 KiB IIRC) that's overlayed by a heap, not the default process heap. Shared memory is used by system LPC ports to efficiently pass large messages between a system server (e.g. csrss.exe, conhost.exe) and a client process. The console API used to use an LPC port, but in Windows 8.1+ it uses a driver instead, so none of the "available heap" warnings apply anymore. Microsoft should clarify the docs to stress that the warning is for Windows 7 and earlier.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.8 3.9 3.10 expert-IO OS-windows type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants