Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris #25096

Merged
merged 12 commits into from Apr 30, 2021

Conversation

@kulikjak
Copy link
Contributor

@kulikjak kulikjak commented Mar 30, 2021

This PR fixes wchar_t issues on Oracle Solaris when non-UTF locales are in effect (see the issue for more info).

This is a work in progress.

https://bugs.python.org/issue43667

Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
@kulikjak
Copy link
Contributor Author

@kulikjak kulikjak commented Apr 1, 2021

Since this doesn't affect other Solaris derivatives, I also added a configure detection for Oracle Solaris and changed all #ifdef to reflect that.

configure.ac Outdated Show resolved Hide resolved
Include/unicodeobject.h Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Include/cpython/fileutils.h Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
@kulikjak
Copy link
Contributor Author

@kulikjak kulikjak commented Apr 6, 2021

I fixed some of your suggestions, but there are still some to fix, like wcstombs truncating the input string. And I am also missing unicode to wchar_t handling.

I am considering using iconv_open() and iconv(); that should make the conversion clearer (no need for wcstombs and mbrtoc32 functions) and better supported because it will always work no matter what encoding wchar_t uses.

It doesn't solve the truncating issue though. I will have to think this through.

Copy link
Member

@vstinner vstinner left a comment

Thanks. The design now looks better. New review.

I would prefer to include <uchar.h> in Python.h (move the new functions to the internal C API, see my comment), and the issue with NULL characters is not solved yet.

Include/unicodeobject.h Outdated Show resolved Hide resolved
Include/cpython/fileutils.h Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Python/fileutils.c Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
@kulikjak
Copy link
Contributor Author

@kulikjak kulikjak commented Apr 29, 2021

Sorry for such a delay - I had to dive deeper into locale stuff and tend to other more urgent things.

I just pushed my latest changes that incorporated your suggestions, changed conversion a little, and added conversion back for corresponding functions. It's tested quite extensively in 3.7 (which is the default on Solaris now), but not much in master (which is a little bit different).

Copy link
Member

@vstinner vstinner left a comment

Oh thanks, the updated PR looks simpler (thanks to iconv?).

Objects/unicodeobject.c Show resolved Hide resolved
Python/fileutils.c Show resolved Hide resolved
Python/fileutils.c Outdated Show resolved Hide resolved
wchar_t* result = _Py_ConvertWCharForm(unicode, size, "wchar_t", "UCS-4-INTERNAL");
if (!result) {
return NULL;
}

This comment has been minimized.

@vstinner

vstinner Apr 29, 2021
Member

Maybe add an assertion: assert((wcslen(result) + 1) == size);. I understand that result cannot be shorter or longer.

This comment has been minimized.

@kulikjak

kulikjak Apr 30, 2021
Author Contributor

Is this always the case? You told me that _Py_ConvertWCharForm cannot truncate at first zero and then wcslen+1 doesn't have to correspond to size. Although I am not sure if that can happen when converting from UCS-4 to wchar_t.

The conversion is done in-place. Return a pointer to the wchar_t buffer
given as the first argument. Return NULL and raise exception on conversion
or memory allocation error. */
wchar_t *

This comment has been minimized.

@vstinner

vstinner Apr 29, 2021
Member

I suggest to return -1 on error and return 0 on success, so the caller uses _Py_ConvertWCharFormToNative_InPlace() < 0 to check for error which is common in Python.

This comment has been minimized.

@kulikjak

kulikjak Apr 30, 2021
Author Contributor

I changed it as you suggested. At first, I was able to use that returned value nicely in PyUnicode_AsWideCharString, but then additional checks made it unnecessary.

the memory. Return NULL and raise exception on conversion or memory
allocation error. */
wchar_t *
_Py_ConvertWCharFormToUCS4(const wchar_t *native, Py_ssize_t size)

This comment has been minimized.

@vstinner

vstinner Apr 29, 2021
Member

I don't understand "Form" in "_Py_ConvertWCharFormToUCS4" name. I suggest the name "_Py_ConvertWCharToUCS4" or "_Py_DecodeNonUnicodeWchar".

Same remark for "Form" in "_Py_ConvertWCharFormToNative_InPlace" name.

This comment has been minimized.

@kulikjak

kulikjak Apr 30, 2021
Author Contributor

Changed to decode/encode variant of _Py_DecodeNonUnicodeWchar.

Python/fileutils.c Show resolved Hide resolved
@vstinner vstinner merged commit 9032cf5 into python:master Apr 30, 2021
10 of 11 checks passed
10 of 11 checks passed
@github-actions
Check for source changes
Details
@github-actions
Check if generated files are up to date
Details
@github-actions
Windows (x86)
Details
@github-actions
Windows (x64)
Details
@github-actions
macOS
Details
@github-actions
Ubuntu
Details
@github-actions
Ubuntu SSL tests with OpenSSL ${{ matrix.openssl_ver }}
Details
Azure Pipelines PR in progress
Details
@travis-ci
Travis CI - Pull Request Build Passed
Details
@bedevere-bot
bedevere/issue-number Issue number 43667 found
Details
@bedevere-bot
bedevere/news "skip news" label found
@vstinner
Copy link
Member

@vstinner vstinner commented Apr 30, 2021

Thanks @kulikjak! I recall many issues in the locale issue because of that. Python raised an exception since Unicode characters were outside the [U+0000; U+10ffff] range.

@miss-islington
Copy link
Contributor

@miss-islington miss-islington commented Apr 30, 2021

Thanks @kulikjak for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.9.
🐍🍒🤖

@miss-islington
Copy link
Contributor

@miss-islington miss-islington commented Apr 30, 2021

Sorry, @kulikjak and @vstinner, I could not cleanly backport this to 3.9 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker 9032cf5cb1e33c0349089cfb0f6bf11ed3c30e86 3.9

@kulikjak
Copy link
Contributor Author

@kulikjak kulikjak commented Apr 30, 2021

I thank you very much as well @vstinner! This was likely the biggest remaining issue with Python on Solaris (as you said, there were many [U+0000; U+10ffff] related problems reported before, and I hope this finally fixed that).

@vstinner
Copy link
Member

@vstinner vstinner commented Apr 30, 2021

The automated backport to 3.9 failed. You requested 3.8 and 3.9 backports in https://bugs.python.org/issue43667

Do you want to try to backport the change to 3.9? Maybe your manual 3.9 backport can be automatically backported later to 3.8 (we can try the bot).

@kulikjak
Copy link
Contributor Author

@kulikjak kulikjak commented Apr 30, 2021

Sure, I will backport this into 3.9.

I have two more questions:

  1. should I add a news entry for this change (via another PR)?
  2. I tried changing raised exceptions to UnicodeDecode/EncodeErrors instead of ValueError but when I force an exception, I am getting a very unexpected result:
ERROR: test_strftime (test.datetimetester.TestTimeTZ_Pure)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../cpython-master/Lib/test/datetimetester.py", line 3290, in test_strftime
    t.strftime('%H\ud800%M')
  File "/.../cpython-master/Lib/datetime.py", line 1457, in strftime
    return _wrap_strftime(self, fmt, timetuple)
  File "/.../cpython-master/Lib/datetime.py", line 262, in _wrap_strftime
    return _time.strftime(newformat, timetuple)
TypeError: function takes exactly 5 arguments (1 given)

Is this a known issue, or am I doing something wrong (I just replaced PyExc_ValueError with PyExc_UnicodeDecodeError)?

@vstinner
Copy link
Member

@vstinner vstinner commented Apr 30, 2021

should I add a news entry for this change (via another PR)?

Documenting changes is always a good idea.

UnicodeDecode/EncodeErrors have a complex API. I don't think that it's worth it to both with that.

What is the current exception message for t.strftime('%H\ud800%M')?

@kulikjak
Copy link
Contributor Author

@kulikjak kulikjak commented Apr 30, 2021

What is the current exception message for t.strftime('%H\ud800%M')?

With non-unicode locale it's ValueError: iconv() failed.

@vstinner
Copy link
Member

@vstinner vstinner commented Apr 30, 2021

Implementing UnicodeEncodeError would require very precise information about the error: which characters have been encoded, which characters are causing the encoding error, info about the error. I'm not sure that iconv provides all required information. Anyway, it can be enhanced later. This change is already a huge step to the right direction ;-)

kulikjak added a commit to kulikjak/cpython that referenced this pull request May 3, 2021
…laris (pythonGH-25096).

(cherry picked from commit 9032cf5)

Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>
kulikjak added a commit to kulikjak/cpython that referenced this pull request May 3, 2021
…laris (pythonGH-25096).

(cherry picked from commit 9032cf5)

Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>
kulikjak added a commit to kulikjak/cpython that referenced this pull request May 3, 2021
…laris (pythonGH-25096).

(cherry picked from commit 9032cf5)

Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>
@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented May 3, 2021

GH-25847 is a backport of this pull request to the 3.9 branch.

@kulikjak
Copy link
Contributor Author

@kulikjak kulikjak commented May 3, 2021

Implementing UnicodeEncodeError would require very precise information about the error: which characters have been encoded, which characters are causing the encoding error, info about the error. I'm not sure that iconv provides all required information.

I see, I could have thought of that. Well, iconv doesn't provide such info so it's not trivial to do so. I might look into that later ;).

@kulikjak
Copy link
Contributor Author

@kulikjak kulikjak commented May 21, 2021

I wanted to create a PR with news entry, but I am unsure whether that is a good idea before this gets backported (my reasoning being that news entry will get most likely auto backported, and then there will be news entry before the actual changes).

vstinner pushed a commit that referenced this pull request May 21, 2021
…laris (GH-25096) (GH-25847)

(cherry picked from commit 9032cf5)

Co-authored-by: Jakub Kulík <Kulikjak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants