Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Clarify exactly what \w matches in UNICODE mode #69929

Closed
zwol mannequin opened this issue Nov 27, 2015 · 5 comments
Closed

[doc] Clarify exactly what \w matches in UNICODE mode #69929

zwol mannequin opened this issue Nov 27, 2015 · 5 comments
Labels
3.9 3.10 3.11 docs Documentation in the Doc dir easy expert-regex type-feature A feature request or enhancement

Comments

@zwol
Copy link
Mannequin

zwol mannequin commented Nov 27, 2015

BPO 25743
Nosy @ezio-melotti, @iritkatriel, @slateny

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-11-27.15:50:58.768>
labels = ['expert-regex', 'easy', '3.9', '3.10', '3.11', 'type-feature', 'docs']
title = '[doc] Clarify exactly what \\w matches in UNICODE mode'
updated_at = <Date 2022-02-28.07:39:46.535>
user = 'https://bugs.python.org/zwol'

bugs.python.org fields:

activity = <Date 2022-02-28.07:39:46.535>
actor = 'slateny'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation', 'Regular Expressions']
creation = <Date 2015-11-27.15:50:58.768>
creator = 'zwol'
dependencies = []
files = []
hgrepos = []
issue_num = 25743
keywords = ['easy']
message_count = 5.0
messages = ['255463', '255464', '255465', '407440', '414180']
nosy_count = 7.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'docs@python', 'zwol', 'Andi McClure', 'iritkatriel', 'slateny']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue25743'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

Linked PRs

@zwol
Copy link
Mannequin Author

zwol mannequin commented Nov 27, 2015

The re module documentation does not do a good job of explaining exactly what \w matches. Quoting https://docs.python.org/3.5/library/re.html :

\w
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters
that can be part of a word in any language, as well as numbers
and the underscore.

Empirically, this appears to mean "everything in Unicode general categories L* and N*, plus U+005F (underscore)". That is a perfectly sensible definition and the documentation should state it in those terms. "Unicode word characters" could mean any number of different things; note for instance that UTS#18 gives a very different definition.

(Further reading: https://gist.github.com/zackw/3077f387591376c7bf67 plus links therefrom).

@zwol zwol mannequin assigned docspython Nov 27, 2015
@zwol zwol mannequin added the docs Documentation in the Doc dir label Nov 27, 2015
@AndiMcClure
Copy link
Mannequin

AndiMcClure mannequin commented Nov 27, 2015

I would like to request also a clear explanation be given for the documentation in the 2.7 branch. From https://docs.python.org/2.7/library/re.html :

"\w ... If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database"

This is ambiguous. Does it mean the "Alphabetic" property from UAX#44? Does it mean something else?

@zwol
Copy link
Mannequin Author

zwol mannequin commented Nov 27, 2015

FWIW, the actual behavior of \w matching "everything in Unicode general categories L* and N*, plus U+005F (underscore)" is consistent across all versions I can conveniently test (2.7, 3.4, 3.5).

In 2.7, there are four characters in general category Nl that \w doesn't match, but I believe that is just a bug, not an intentional difference of behavior.

@ezio-melotti ezio-melotti added expert-regex type-feature A feature request or enhancement labels Jan 4, 2016
@iritkatriel
Copy link
Member

iritkatriel commented Dec 1, 2021

It's too late for the 2.7 docs, but the current docs can still be updated.

@iritkatriel iritkatriel changed the title Clarify exactly what \w matches in UNICODE mode [doc] Clarify exactly what \w matches in UNICODE mode Dec 1, 2021
@slateny
Copy link
Contributor

slateny commented Feb 28, 2022

Would a change like this be accurate?

Matches Unicode word characters; this includes most alphanumeric characters as well as the underscore. In Unicode, alphanumeric characters are defined to be the general categories L + N (see https://unicode.org/reports/tr44/#General_Category_Values). If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
JelleZijlstra added a commit that referenced this issue Dec 20, 2022
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Dec 20, 2022
…-92015)

(cherry picked from commit 36a0b1d)

Co-authored-by: Stanley <46876382+slateny@users.noreply.github.com>
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Dec 20, 2022
…-92015)

(cherry picked from commit 36a0b1d)

Co-authored-by: Stanley <46876382+slateny@users.noreply.github.com>
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
miss-islington added a commit that referenced this issue Dec 20, 2022
(cherry picked from commit 36a0b1d)

Co-authored-by: Stanley <46876382+slateny@users.noreply.github.com>
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
miss-islington added a commit that referenced this issue Dec 20, 2022
(cherry picked from commit 36a0b1d)

Co-authored-by: Stanley <46876382+slateny@users.noreply.github.com>
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
@slateny slateny closed this as completed Dec 20, 2022
jonburdo pushed a commit to jonburdo/cpython that referenced this issue Dec 20, 2022
…2015)

Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.9 3.10 3.11 docs Documentation in the Doc dir easy expert-regex type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants