Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-39087: Add PyUnicode_GetUTF8Buffer(). #17659

Open
wants to merge 1 commit into
base: master
from

Conversation

@methane
Copy link
Member

methane commented Dec 19, 2019

@methane methane requested a review from python/windows-team as a code owner Dec 19, 2019
@methane methane requested review from vstinner and serhiy-storchaka Dec 19, 2019
def test_getutf8buffer(self):
from _testcapi import unicode_getutf8buffer

ascii = "foo"

This comment has been minimized.

Copy link
@vstinner

vstinner Dec 19, 2019

Member

nitpick: "ascii" is the name of a builtin function, maybe rename it to "text".

self.assertEqual(c1 + 1, c2)

mv.release()
del mv

This comment has been minimized.

Copy link
@vstinner

vstinner Dec 19, 2019

Member

Is "del mv" really needed for following tests? If yes, you might add a support.gc_collect() call.

self.assertEqual(unicode_getutf8buffer(bmp).tobytes(), b'\xc4\x80')
self.assertEqual(unicode_getutf8buffer(bmp2).tobytes(), b'\xef\xbf\xbf')
self.assertEqual(unicode_getutf8buffer(nonbmp).tobytes(), b'\xf4\x8f\xbf\xbf')
self.assertRaises(UnicodeEncodeError, unicode_getutf8buffer, 'a\ud800b\udfffc')

This comment has been minimized.

Copy link
@vstinner

vstinner Dec 19, 2019

Member

Could you add a test using errors=surrogatepass?

}
// Instead of calling PyBuffer_Release(), move the ownership of the obj
// to the memoryview object.
((PyMemoryViewObject*)result)->mbuf->master.obj = buffer.obj;

This comment has been minimized.

Copy link
@vstinner

vstinner Dec 19, 2019

Member

It seems a little strange to me that the function returns a Py_buffer, but you create a memory view and "hacks" the memory view internal object to return a view instead of the Py_buffer.

Maybe add two tests:

  • one test in pure C which checks Py_buffer attributes
  • one function exposing PyUnicode_GetUTF8Buffer() but returns a bytes object, just to check the UTF-8 encoded string: this function would call PyBuffer_Release()
Successful calls to :c:func:`PyUnicode_GetUTF8Buffer` must be paired with
calls to :c:func:`PyBuffer_Release`, similar to :c:func:`PyObject_GetBuffer`.

This comment has been minimized.

Copy link
@vstinner

vstinner Dec 19, 2019

Member

Since the motivation to add this new function was performance, would you mind to elaborate a little bit on that? Explain if a memory copy is needed. Explain if the function fills the internal UTF-8 cache if it's not already computed?

1, PyBUF_SIMPLE);
}

PyObject *bytes = _PyUnicode_AsUTF8String(unicode, errors);

This comment has been minimized.

Copy link
@vstinner

vstinner Dec 19, 2019

Member

Would you mind to add a comment to explain that you don't fill the internal UTF-8 cache on purpose? We have so many ways to encode Unicode to UTF-8, I never recall which one fills the cache or not.

return -1;
}

if (PyUnicode_UTF8(unicode) != NULL) {

This comment has been minimized.

Copy link
@vstinner

vstinner Dec 19, 2019

Member

I like this code path :-D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.