New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB #82052
Comments
To reproduce:Put this text in a file
|
To be compatible with Windows 7, _io__WindowsConsoleIO_write_impl in Modules/_io/winconsoleio.c is forced to write to the console in chunks that do not exceed 32 KiB. It does so by repeatedly dividing the length to decode by 2 until the decoded buffer size is small enough. wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
while (wlen > 32766 / sizeof(wchar_t)) {
len /= 2;
wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
} With >>> 38786 // 2
19393
>>> 19393 // 82
236
>>> 19393 % 82
41 This means line 237 ends up with 20 'é' characters (UTF-8 b'\xc3\xa9') and one partial character sequjence, b'\xc3'. When this buffer is passed to MultiByteToWideChar to decode from UTF-8 to UTF-16, the partial sequence gets decoded as the replacement character U+FFFD. For the next write, the remaining b'\xa9' byte also gets decoded as U+FFFD. To avoid this, _io__WindowsConsoleIO_write_impl could decode the whole buffer in one pass, and slice that up into writes that are less than 32 KiB. Or it could ensure that its UTF-8 slices are always at character boundaries. |
I'd rather keep encoding incrementally, and reduce the length of each attempt until the last UTF-8 character does not have its top bit set (i.e. is the final character in a multi-byte sequence). Otherwise the people who like to print >2GB worth of data to the console will complain about the memory error :) |
Steve's approach makes sense and should be robust. side note: do we need to care about Windows 7 anymore in 3.10 given that microsoft no longer supports it? |
If the fix comes in time for Python 3.8, then it needs to support Windows 7. For Python 3.9+, the 32 KiB limit can be removed. The console documentation still includes the misleading disclaimer about "available heap". This refers to a relatively small block of shared memory (64 KiB IIRC) that's overlayed by a heap, not the default process heap. Shared memory is used by system LPC ports to efficiently pass large messages between a system server (e.g. csrss.exe, conhost.exe) and a client process. The console API used to use an LPC port, but in Windows 8.1+ it uses a driver instead, so none of the "available heap" warnings apply anymore. Microsoft should clarify the docs to stress that the warning is for Windows 7 and earlier. |
andy-ms mannequin commentedAug 16, 2019
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: