Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-30717: add unicode grapheme cluster break algorithm #2673

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Vermeille
Copy link

@Vermeille Vermeille commented Jul 11, 2017

I have added GraphemeBreakProperty to UnicodeData.
An automaton to compute the rules for breaking grapheme clusters according to TR29 is included. It passes all the tests provided in GraphemeBreakTests.txt.

https://bugs.python.org/issue30717

@the-knights-who-say-ni
Copy link

the-knights-who-say-ni commented Jul 11, 2017

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

Thanks again to your contribution and we look forward to looking at it!

@Vermeille Vermeille changed the title WIP: add grapheme cluster break algorithm bpo-30717: WIP: add grapheme cluster break algorithm Jul 11, 2017
@Vermeille Vermeille changed the title bpo-30717: WIP: add grapheme cluster break algorithm bpo-30717: add grapheme cluster break algorithm Jul 14, 2017
@Vermeille Vermeille changed the title bpo-30717: add grapheme cluster break algorithm bpo-30717: add unicode grapheme cluster break algorithm Jul 14, 2017
@Vermeille
Copy link
Author

Vermeille commented Aug 2, 2017

Hello? Someone here?

++p->pos;
}
} else {
/* Raising of standard StopIteration exception with empty value. */
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just return NULL without raising a StopIteration.

GraphemeClusterIterator *p = (GraphemeClusterIterator *)self;
int kind = PyUnicode_KIND(p->str);
void *pstr = PyUnicode_DATA(p->str);
if (PyUnicode_READ(kind, pstr, p->pos)) {
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iteration stops on the first NUL character. What about strings containing embedded NUL characters?

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a legal (text) string then? I assumed no.

Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? NUL characters don't have special meaning in Python strings.

Silent truncating the string on the first NUL character would look like a bug.

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, it was just my C++ biais :) fixing.

STATE_Any,
} GCBState;

static GCBState GRAPH_CLUSTER_AUTOMATON[15][20] = {
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a const array.

There are only 19 values in GraphemeClusterBreakType, but the size of the second dimension is 20.

Copy link
Author

@Vermeille Vermeille Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if not os.path.exists(local):
import urllib.request
if version == '3.2.0':
# irregular url structure
url = 'http://www.unicode.org/Public/3.2-Update/' + local
else:
url = ('http://www.unicode.org/Public/%s/ucd/'+template) % (version, '')
print(url)
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging remnants?

@serhiy-storchaka serhiy-storchaka added the type-feature A feature request or enhancement label Aug 3, 2017
};

/*[clinic input]
unicodedata.UCD.break_graphemes
Copy link
Member

@methane methane Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I prefer iter_graphemes because it's clear to return an iterator.

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it better too! Fixing!

sizeof(GraphemeClusterIterator), /*tp_basicsize*/
0, /*tp_itemsize*/
(destructor)GCI_Del, /*tp_dealloc*/
0, /*tp_print*/
Copy link
Member

@methane methane Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use designated initializers for new types.

...
.tp_dealloc = (destructor)GCI_Del,
.tp_flags = Py_TPFLAGS_DEFAULT,
.tp_doc = "Internal grapheme cluster iterator object",
.tp_iter = GCI_iter,
.tp_iternext = GCI_iternext
};

Copy link
Author

@Vermeille Vermeille Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

typedef struct {
PyObject_HEAD
PyObject* str;
int pos;
Copy link
Member

@methane methane Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use py_ssize_t instead of int.

Copy link
Author

@Vermeille Vermeille Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

GraphemeClusterIterator *p = (GraphemeClusterIterator *)self;
int kind = PyUnicode_KIND(p->str);
void *pstr = PyUnicode_DATA(p->str);
if (PyUnicode_READ(kind, pstr, p->pos)) {
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? NUL characters don't have special meaning in Python strings.

Silent truncating the string on the first NUL character would look like a bug.

typedef struct {
PyObject_HEAD
PyObject* str;
int pos;
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Py_ssize_t

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the upcoming patch

0, /* tp_clear */
0, /* tp_richcompare */
0, /* tp_weaklistoffset */
GCI_iter, /* tp_iter: __iter__() method */
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyObject_SelfIter

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the upcoming patch

int pos;
} GraphemeClusterIterator;

void GCI_Del(PyObject* x)
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The return type should be on separate line (see PEP 7).
  • The function should be static.
  • Usually destructors have the dealloc suffix.
  • You can use the pointer to GraphemeClusterIterator.
  • And it is more common to add a space before * than after it.
static void
GCI_dealloc(GraphemeClusterIterator *it)

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the upcoming patch


#include "grapheme_cluster_break_automaton.h"

PyObject* GCI_iternext(PyObject *self)
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most comments about the destructor are applicable to this method.

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the upcoming patch

0, /*tp_getattro*/
0, /*tp_setattro*/
0, /*tp_as_buffer*/
Py_TPFLAGS_DEFAULT,
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the str field can be not exact Unicode object, but an instance of the subclass of the str class, the iterator should be a GC type. Add Py_TPFLAGS_HAVE_GC, implement tp_traverse, use PyObject_GC_New() and PyObject_GC_Del(), add PyObject_GC_Untrack() in the destructor. See https://docs.python.org/3/c-api/gcsupport.html.

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should I remove the current tp_dealloc and replace it with a tp_clear? Should the tp_dealloc be kept? Reading the doc, it seems that tp_traverse and tp_clear are related. Can you make that a little bit clearer :) ? Thanks!

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done something on my own that seems to work, but you should definitely double check that.

0, /*tp_setattro*/
0, /*tp_as_buffer*/
Py_TPFLAGS_DEFAULT,
"Internal grapheme cluster iterator object.", /* tp_doc */
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the words "internal" and "object" are redundant.

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Internal", "iterator" and "object" are all redundant. "Grapheme cluster iterator" seems just right. What do you think?

};

/*[clinic input]
unicodedata.UCD.break_graphemes
Copy link
Member

@serhiy-storchaka serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to support iterating a substring. Add optional start and end parameters as in str.index() or re.Pattern.finditer().

Copy link
Author

@Vermeille Vermeille Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the upcoming patch

@Vermeille
Copy link
Author

Vermeille commented Jan 11, 2018

Sorry for the long wait.

Are we good concerning the changes? Anything to add?

@brettcannon
Copy link
Member

brettcannon commented Feb 2, 2018

To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request.

If/when the requested changes have been made, please leave a comment that says, I have made the requested changes; please review again. That will trigger a bot to flag this pull request as ready for a follow-up review.

@csabella
Copy link
Contributor

csabella commented May 23, 2020

@Vermeille, please take a look at the most recent comments on the bug tracker for this issue. It looks like the suggested path forward is different than the solution you proposed here. Thanks!

@github-actions
Copy link

github-actions bot commented Feb 20, 2022

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Feb 20, 2022
@github-actions github-actions bot removed the stale Stale PR or inactive for long period of time. label Jul 28, 2022
@github-actions
Copy link

github-actions bot commented Aug 29, 2022

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting changes stale Stale PR or inactive for long period of time. type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants