Skip to content

library/re.html: Definition of the behavior of character groups [] is incorrect. #94898

Closed as not planned
@fischbacher

Description

@fischbacher

Documentation

A general point here is that regular expressions are extremely information-dense ways to define Chomsky type 3 grammars.
As such, it matters that the definition is precise.

Currently, https://docs.python.org/3/library/re.html says:

"If the first character of the set is '^', all the characters that are not in the set will be matched."

Now, "If the first character of the set is '^'", this implies that the character '^' is part in the set (welcome to the tautology club).
So, according to this specification, such a character group could never match a '^'.

This is of course in conflict with how everybody proficient with regexps understands how this works.

Commonly, the behavior below would be understood as "the universally expected correct behavior that is good to rely on":

re.match('[^a]', '^')
<re.Match object; span=(0, 1), match='^'>

A correct statement would be:

If a character set definition starts with a '^', then it matches any character that is not matched by the character set definition obtained by stripping the leading '^' and subsequently escaping the then leading character, should it happen to be a '^' (so, [^^] matches any character except the 'caret' symbol).

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation in the Doc dirpendingThe issue will be closed if no feedback is providedtopic-regex

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions