New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-30717: add unicode grapheme cluster break algorithm #2673
base: main
Are you sure you want to change the base?
Conversation
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. Thanks again to your contribution and we look forward to looking at it! |
0f82f82
to
62fd6e0
Compare
62fd6e0
to
a47de54
Compare
Hello? Someone here? |
Modules/unicodedata.c
Outdated
++p->pos; | ||
} | ||
} else { | ||
/* Raising of standard StopIteration exception with empty value. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just return NULL without raising a StopIteration.
Modules/unicodedata.c
Outdated
GraphemeClusterIterator *p = (GraphemeClusterIterator *)self; | ||
int kind = PyUnicode_KIND(p->str); | ||
void *pstr = PyUnicode_DATA(p->str); | ||
if (PyUnicode_READ(kind, pstr, p->pos)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iteration stops on the first NUL character. What about strings containing embedded NUL characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a legal (text) string then? I assumed no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? NUL characters don't have special meaning in Python strings.
Silent truncating the string on the first NUL character would look like a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, it was just my C++ biais :) fixing.
STATE_Any, | ||
} GCBState; | ||
|
||
static GCBState GRAPH_CLUSTER_AUTOMATON[15][20] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a const
array.
There are only 19 values in GraphemeClusterBreakType, but the size of the second dimension is 20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Tools/unicode/makeunicodedata.py
Outdated
if not os.path.exists(local): | ||
import urllib.request | ||
if version == '3.2.0': | ||
# irregular url structure | ||
url = 'http://www.unicode.org/Public/3.2-Update/' + local | ||
else: | ||
url = ('http://www.unicode.org/Public/%s/ucd/'+template) % (version, '') | ||
print(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debugging remnants?
Modules/unicodedata.c
Outdated
}; | ||
|
||
/*[clinic input] | ||
unicodedata.UCD.break_graphemes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I prefer iter_graphemes
because it's clear to return an iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it better too! Fixing!
Modules/unicodedata.c
Outdated
sizeof(GraphemeClusterIterator), /*tp_basicsize*/ | ||
0, /*tp_itemsize*/ | ||
(destructor)GCI_Del, /*tp_dealloc*/ | ||
0, /*tp_print*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use designated initializers for new types.
...
.tp_dealloc = (destructor)GCI_Del,
.tp_flags = Py_TPFLAGS_DEFAULT,
.tp_doc = "Internal grapheme cluster iterator object",
.tp_iter = GCI_iter,
.tp_iternext = GCI_iternext
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Modules/unicodedata.c
Outdated
typedef struct { | ||
PyObject_HEAD | ||
PyObject* str; | ||
int pos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use py_ssize_t instead of int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Modules/unicodedata.c
Outdated
GraphemeClusterIterator *p = (GraphemeClusterIterator *)self; | ||
int kind = PyUnicode_KIND(p->str); | ||
void *pstr = PyUnicode_DATA(p->str); | ||
if (PyUnicode_READ(kind, pstr, p->pos)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? NUL characters don't have special meaning in Python strings.
Silent truncating the string on the first NUL character would look like a bug.
Modules/unicodedata.c
Outdated
typedef struct { | ||
PyObject_HEAD | ||
PyObject* str; | ||
int pos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Py_ssize_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in the upcoming patch
Modules/unicodedata.c
Outdated
0, /* tp_clear */ | ||
0, /* tp_richcompare */ | ||
0, /* tp_weaklistoffset */ | ||
GCI_iter, /* tp_iter: __iter__() method */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyObject_SelfIter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in the upcoming patch
Modules/unicodedata.c
Outdated
int pos; | ||
} GraphemeClusterIterator; | ||
|
||
void GCI_Del(PyObject* x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The return type should be on separate line (see PEP 7).
- The function should be
static
. - Usually destructors have the
dealloc
suffix. - You can use the pointer to
GraphemeClusterIterator
. - And it is more common to add a space before
*
than after it.
static void
GCI_dealloc(GraphemeClusterIterator *it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in the upcoming patch
Modules/unicodedata.c
Outdated
|
||
#include "grapheme_cluster_break_automaton.h" | ||
|
||
PyObject* GCI_iternext(PyObject *self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most comments about the destructor are applicable to this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in the upcoming patch
Modules/unicodedata.c
Outdated
0, /*tp_getattro*/ | ||
0, /*tp_setattro*/ | ||
0, /*tp_as_buffer*/ | ||
Py_TPFLAGS_DEFAULT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the str
field can be not exact Unicode object, but an instance of the subclass of the str
class, the iterator should be a GC type. Add Py_TPFLAGS_HAVE_GC
, implement tp_traverse
, use PyObject_GC_New()
and PyObject_GC_Del()
, add PyObject_GC_Untrack()
in the destructor. See https://docs.python.org/3/c-api/gcsupport.html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, should I remove the current tp_dealloc and replace it with a tp_clear? Should the tp_dealloc be kept? Reading the doc, it seems that tp_traverse and tp_clear are related. Can you make that a little bit clearer :) ? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done something on my own that seems to work, but you should definitely double check that.
Modules/unicodedata.c
Outdated
0, /*tp_setattro*/ | ||
0, /*tp_as_buffer*/ | ||
Py_TPFLAGS_DEFAULT, | ||
"Internal grapheme cluster iterator object.", /* tp_doc */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the words "internal" and "object" are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Internal", "iterator" and "object" are all redundant. "Grapheme cluster iterator" seems just right. What do you think?
Modules/unicodedata.c
Outdated
}; | ||
|
||
/*[clinic input] | ||
unicodedata.UCD.break_graphemes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to support iterating a substring. Add optional start
and end
parameters as in str.index()
or re.Pattern.finditer()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in the upcoming patch
Sorry for the long wait. Are we good concerning the changes? Anything to add? |
To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request. If/when the requested changes have been made, please leave a comment that says, |
@Vermeille, please take a look at the most recent comments on the bug tracker for this issue. It looks like the suggested path forward is different than the solution you proposed here. Thanks! |
This PR is stale because it has been open for 30 days with no activity. |
This PR is stale because it has been open for 30 days with no activity. |
I have added GraphemeBreakProperty to UnicodeData.
An automaton to compute the rules for breaking grapheme clusters according to TR29 is included. It passes all the tests provided in GraphemeBreakTests.txt.
https://bugs.python.org/issue30717