New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc] Clarify exactly what \w matches in UNICODE mode #69929
Comments
The
Empirically, this appears to mean "everything in Unicode general categories L* and N*, plus U+005F (underscore)". That is a perfectly sensible definition and the documentation should state it in those terms. "Unicode word characters" could mean any number of different things; note for instance that UTS#18 gives a very different definition. (Further reading: https://gist.github.com/zackw/3077f387591376c7bf67 plus links therefrom). |
I would like to request also a clear explanation be given for the documentation in the 2.7 branch. From https://docs.python.org/2.7/library/re.html : "\w ... If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database" This is ambiguous. Does it mean the "Alphabetic" property from UAX#44? Does it mean something else? |
FWIW, the actual behavior of \w matching "everything in Unicode general categories L* and N*, plus U+005F (underscore)" is consistent across all versions I can conveniently test (2.7, 3.4, 3.5). In 2.7, there are four characters in general category Nl that \w doesn't match, but I believe that is just a bug, not an intentional difference of behavior. |
It's too late for the 2.7 docs, but the current docs can still be updated. |
Would a change like this be accurate? Matches Unicode word characters; this includes most alphanumeric characters as well as the underscore. In Unicode, alphanumeric characters are defined to be the general categories L + N (see https://unicode.org/reports/tr44/#General_Category_Values). If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched. |
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
…2015) Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
zwol mannequin commentedNov 27, 2015
•
edited by bedevere-bot
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: