Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_logging SMTPHandlerTest fails due to bad encoding #113650

Open
zooba opened this issue Jan 2, 2024 · 11 comments
Open

test_logging SMTPHandlerTest fails due to bad encoding #113650

zooba opened this issue Jan 2, 2024 · 11 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode

Comments

@zooba
Copy link
Member

zooba commented Jan 2, 2024

I get this error on one of my ARM64 test machines.

The s.encode('ascii', 'surrogateescape') looks strange to my eyes, since ASCII doesn't have a way to encode surrogates? I assume the handler has tried to do it (hence "position 6-8" not 6-7), but really this ought to just fail (perhaps explicitly) and email shouldn't be passing ascii and surrogateescape together.

I'm not sure why it doesn't occur on other platforms. This really ought to be an error all the time - ideas?

test_basic (test.test_logging.SMTPHandlerTest.test_basic) ... --- Logging error ---
Traceback (most recent call last):
  File "D:\agent\_work\3\b\layout\Lib\logging\handlers.py", line 1086, in emit
    smtp.send_message(msg)
    ~~~~~~~~~~~~~~~~~^^^^^
  File "D:\agent\_work\3\b\layout\Lib\smtplib.py", line 973, in send_message
    g.flatten(msg_copy, linesep='\r\n')
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\generator.py", line 115, in flatten
    self._write(msg)
    ~~~~~~~~~~~^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\generator.py", line 180, in _write
    self._dispatch(msg)
    ~~~~~~~~~~~~~~^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\generator.py", line 217, in _dispatch
    meth(msg)
    ~~~~^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\generator.py", line 433, in _handle_text
    self._write_lines(msg._payload)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\generator.py", line 154, in _write_lines
    self.write(line)
    ~~~~~~~~~~^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\generator.py", line 409, in write
    self._fp.write(s.encode('ascii', 'surrogateescape'))
                   ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-8: ordinal not in range(128)
[skip test infrastructure stack]
  File "D:\agent\_work\3\b\layout\Lib\test\test_logging.py", line 1126, in test_basic
    h.handle(r)
Message: 'Hello \u2713'
Arguments: ()
FAIL

Linked PRs

@zooba
Copy link
Member Author

zooba commented Jan 2, 2024

Same error occurs in other places:

======================================================================
ERROR: test_invalid_nonascii_header_as_string (test.test_mailbox.TestMMDF.test_invalid_nonascii_header_as_string)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\agent\_work\3\b\layout\Lib\test\test_mailbox.py", line 107, in test_invalid_nonascii_header_as_string
    self.assertEqual(self._box.get_string(key),
                     ~~~~~~~~~~~~~~~~~~~~^^^^^
  File "D:\agent\_work\3\b\layout\Lib\mailbox.py", line 842, in get_string
    self.get_bytes(key, from_)).as_string(unixfrom=from_)
                                ~~~~~~~~~^^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\message.py", line 188, in as_string
    g.flatten(self, unixfrom=unixfrom)
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\generator.py", line 115, in flatten
    self._write(msg)
    ~~~~~~~~~~~^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\generator.py", line 198, in _write
    self._write_headers(msg)
    ~~~~~~~~~~~~~~~~~~~^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\generator.py", line 225, in _write_headers
    self.write(self.policy.fold(h, v))
               ~~~~~~~~~~~~~~~~^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\_policybase.py", line 326, in fold
    return self._fold(name, value, sanitize=True)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\_policybase.py", line 369, in _fold
    parts.append(h.encode(linesep=self.linesep, maxlinelen=maxlinelen))
                 ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\header.py", line 367, in encode
    formatter.feed('', lines[0], charset)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\header.py", line 454, in feed
    encoded_lines = charset.header_encode_lines(string, self._maxlengths())
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\charset.py", line 312, in header_encode_lines
    header_bytes = _encode(string, codec)
                   ~~~~~~~^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\email\charset.py", line 157, in _encode
    return string.encode('ascii', 'surrogateescape')
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'ascii' codec can't encode character '\udbe1' in position 8: ordinal not in range(128)

@zooba
Copy link
Member Author

zooba commented Jan 2, 2024

My current theory is there's an ARM64 compiler/optimizer bug that's causing UTF-8 encoding to fail when it ought to pass, and so we aren't even supposed to be in this codepath.

I don't think that excuses passing surrogateescape here, that still seems wrong, even if we don't normally hit it.

@serhiy-storchaka
Copy link
Member

"surrogateescape" error handler encodes characters \udc80-\udcff as bytes \x80-\xff. It is normal to use it with "ascii" encoding. The valid input should include characters \x00-\x7f and \udc80-\udcff. But \udbe1 is outside of this range, and this is wrong.

@serhiy-storchaka
Copy link
Member

Do test_codecs or test_codeccallbacks fail?

@zooba
Copy link
Member Author

zooba commented Jan 3, 2024

Yeah, test_codecs.ASCIITest.test_decode_error fails as expected (I have some other failures due to a stack overflow issue that's ongoing, but this one seems most related).

Edit: Probably as expected. It's failing for some reason, and then failing to write the error in cp1252, which probably means it's trying to print the offending character...

======================================================================
FAIL: test_decode_error (test.test_codecs.ASCIITest.test_decode_error) (data=b'[\x80\xff]', error_handler='surrogateescape', expected='[\udc80\udcff]')
----------------------------------------------------------------------
test test_codecs crashed -- Traceback (most recent call last):
  File "D:\agent\_work\3\b\layout\Lib\test\libregrtest\single.py", line 178, in _runtest_env_changed_exc
    _load_run_test(result, runtests)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\test\libregrtest\single.py", line 135, in _load_run_test
    regrtest_runner(result, test_func, runtests)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\test\libregrtest\single.py", line 88, in regrtest_runner
    test_result = test_func()
                  ~~~~~~~~~^^
  File "D:\agent\_work\3\b\layout\Lib\test\libregrtest\single.py", line 132, in test_func
    return run_unittest(test_mod)
           ~~~~~~~~~~~~^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\test\libregrtest\single.py", line 37, in run_unittest
    return _run_suite(tests)
           ~~~~~~~~~~^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\test\libregrtest\single.py", line 57, in _run_suite
    result = runner.run(suite)
             ~~~~~~~~~~^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\unittest\runner.py", line 247, in run
    result.printErrors()
    ~~~~~~~~~~~~~~~~~~^^
  File "D:\agent\_work\3\b\layout\Lib\unittest\runner.py", line 147, in printErrors
    self.printErrorList('FAIL', self.failures)
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\unittest\runner.py", line 160, in printErrorList
    self.stream.writeln("%s" % err)
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "D:\agent\_work\3\b\layout\Lib\unittest\runner.py", line 26, in writeln
    self.write(arg)
    ~~~~~~~~~~^^^^^
  File "D:\agent\_work\3\b\layout\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 368-369: character maps to <undefined>

@zooba
Copy link
Member Author

zooba commented Jan 3, 2024

The valid input should include characters \x00-\x7f and \udc80-\udcff. But \udbe1 is outside of this range, and this is wrong.

So this would suggest that our test is broken? I can track down where that character is coming from, we do have some platform-specific code for making sure that a Unicode character is invalid on a given platform. It may be inappropriate to use that for a codec test.

@serhiy-storchaka
Copy link
Member

I suspect that ASCII and UTF-8 codecs may be broken. They use some clever code for optimization, and it may not work on a new platform. All other failures are derivative.

Does len(b'[\x80\xff]'.decode('ascii')) fail?

What are values of SIZEOF_SIZE_T, SIZEOF_VOID_P, ALIGNOF_SIZE_T, ALIGNOF_LONG on this platform?

Find all if (_Py_IS_ALIGNED(...)) in Objects/unicodeobject.c and Objects/stringlib/codecs.h and make the conditions always false (add 0&&). Does it help?

@serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed topic-email labels Jan 3, 2024
@zooba
Copy link
Member Author

zooba commented Jan 3, 2024

size_t and void* are both 8 bytes and aligned to 8, and long is 4 and aligned to 4, the same as on x64. And we've been building and running on the platform for a while now, so unless something changed in the optimizations recently, I can only imagine it's a compiler bug (possibly one that's already fixed but hasn't been deployed to CI systems everywhere yet).

The codec tests are failing as below (I got here by setting PYTHONIOENCODING="ascii:backslashreplace" before running tests, and I wonder whether this is generally a good thing for our CI?):

======================================================================
FAIL: test_decode_error (test.test_codecs.ASCIITest.test_decode_error) (data=b'[\x80\xff]', error_handler='surrogateescape', expected='[\udc80\udcff]')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\agent\_work\4\b\layout\Lib\test\test_codecs.py", line 3390, in test_decode_error
    self.assertEqual(data.decode('ascii', error_handler),
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                     expected)
                     ^^^^^^^^^
AssertionError: '[\udb80\udbff]' != '[\udc80\udcff]'
- [\udb80\udbff]
+ [\udc80\udcff]


======================================================================
FAIL: test_ascii (test.test_codecs.SurrogateEscapeTest.test_ascii)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\agent\_work\4\b\layout\Lib\test\test_codecs.py", line 2751, in test_ascii
    self.assertEqual(b"foo\x80bar".decode("ascii", "surrogateescape"),
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                     "foo\udc80bar")
                     ^^^^^^^^^^^^^^^
AssertionError: 'foo\udb80bar' != 'foo\udc80bar'
- foo\udb80bar
?    ^
+ foo\udc80bar
?    ^

So apparently it's putting surrogates under DB00 instead of DC00.

@zooba
Copy link
Member Author

zooba commented Jan 4, 2024

Looks like a codegen issue that's been fixed in the next compiler version (14.38).

In PyUnicode_DecodeASCII with 14.37, the line PyUnicode_WRITE(kind, data, writer.pos, c + 0xdc00) generates ldrsb w8,[sp,#0x58] to read c back from the local, but it sign extends it despite it being an unsigned char, so we get a subtraction instead. Compiler 14.38 gets it right.

Just as well the ARM64 builds are still only experimental... I'll sleep on it, but I think the only thing we could do here is block compiling with that version, and all that's really going to achieve is to break our own CI, so there's probably nothing to do apart from recording the issue so people can find it (and start citing compiler instability as a reason to remain experimental...).

@zooba
Copy link
Member Author

zooba commented Jan 9, 2024

Looks like it's not the real fix, and we're probably going to have to wait for a new compiler release. There's no reason to rush into reverting this change, as we just have to block the whole compiler version so it may as well stay there. Slight chance we'll find another option to disable just this issue, but it may also have multiple causes...

I'm filing a specific issue for a minimal version of our code, and also looking at getting the MSVC team to use our test suite in their own regression tests.

@zooba
Copy link
Member Author

zooba commented Jan 9, 2024

FTR, here's the minimal code, and showing that it's fixed in the next MSVC version. So we're just waiting on that to be properly released so that GitHub/Azure Pipelines can pick it up.

https://godbolt.org/z/89T4W5M3j

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode
Projects
None yet
Development

No branches or pull requests

2 participants