Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group names of bytes regexes are strings #85152

Closed
qwenger mannequin opened this issue Jun 14, 2020 · 29 comments
Closed

group names of bytes regexes are strings #85152

qwenger mannequin opened this issue Jun 14, 2020 · 29 comments
Labels
3.8 expert-regex type-bug

Comments

@qwenger
Copy link
Mannequin

@qwenger qwenger mannequin commented Jun 14, 2020

BPO 40980
Nosy @animalize, @qwenger

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-06-14.21:03:34.705>
labels = ['expert-regex', 'type-bug', '3.8']
title = 'group names of bytes regexes are strings'
updated_at = <Date 2020-06-17.08:20:57.392>
user = 'https://github.com/qwenger'

bugs.python.org fields:

activity = <Date 2020-06-17.08:20:57.392>
actor = 'matpi'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Regular Expressions']
creation = <Date 2020-06-14.21:03:34.705>
creator = 'matpi'
dependencies = []
files = []
hgrepos = []
issue_num = 40980
keywords = []
message_count = 27.0
messages = ['371516', '371607', '371614', '371629', '371631', '371633', '371634', '371637', '371638', '371639', '371643', '371644', '371646', '371652', '371657', '371660', '371672', '371676', '371681', '371692', '371696', '371697', '371705', '371709', '371718', '371719', '371720']
nosy_count = 2.0
nosy_names = ['malin', 'matpi']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue40980'
versions = ['Python 3.8']

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 14, 2020

I noticed that match.groupdict() returns string keys, even for a bytes regex:

>>> import re
>>> re.match(b"(?P<a>)", b"").groupdict()
{'a': b''}

This seems somewhat strange, because string and bytes matching in re are kind of two separate parts, cf. doc:

Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

@qwenger qwenger mannequin added 3.8 expert-regex type-bug labels Jun 14, 2020
@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 15, 2020

This also affects functions/methods expecting a group name as parameter (e.g. match.group), the group name has to be passed as string.

@animalize
Copy link
Mannequin

@animalize animalize mannequin commented Jun 16, 2020

Group name is str is very reasonable. Essentially it is just a name, it has nothing to do with bytes.

Other names in Python are also str type, such as codec names, hashlib names.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

Agreed to some extent, but there is the difference that group names are embedded in the pattern, which has to be bytes if the target is bytes.

My use case is in an all-bytes, no-string project where I construct a large regular expression at startup, with semi-dynamical group names.

So it seems natural to have everything in bytes to concatenate the regular expression, incl. the group names.

But then group names that I receive back are strings, so I cannot look them up directly into the set of group names that I used to create the expression in the first place.

Of course I can live with it by storing them as strings in the first place and encode()'ing them during concatenation, but it does not feel "natural".

Furthermore, even if it is "just a name", a non-ascii group name will raise an error in bytes, even if encoded...:

>>> re.compile("(?P<" + "é" + ">)")
re.compile('(?P<é>)')
>>> re.compile(b"(?P<" + "é".encode() + b">)")
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    re.compile(b"(?P<" + "é".encode() + b">)")
  File "/usr/lib/python3.8/re.py", line 252, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'é' at position 4

So no, it's not really "just a name", considering that in Python "é" should is a valid name.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

should *be a valid name

@animalize
Copy link
Mannequin

@animalize animalize mannequin commented Jun 16, 2020

a non-ascii group name will raise an error in bytes, even if encoded

Looks like this is a language limitation:

    >>> b'é'
      File "<stdin>", line 1
    SyntaxError: bytes can only contain ASCII literal characters.

No problem if you use escaped character:

    >>> re.match(b'(?P<\xe9>)', b'').groupdict()
    {'é': b''}

There may be some inconveniences in your program, but IMO there is nothing wrong, maybe this issue can be closed.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

Of course an inconvenience in my program is not per se the reason to change the language. I just wanted to motivate that the current situation gives unexpected results.

"\xe9" doesn't look like proper utf-8 to me:

>>> "é".encode("latin-1")
b'\xe9'
>>> "é".encode()
b'\xc3\xa9'

Let's try another one: how would you go for Δ ("\u0394") as a group name?

>>> "Δ".encode()
b'\xce\x94'
>>> "Δ".encode("latin-1")
Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    "Δ".encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0394' in position 0: ordinal not in range(256)
>>> re.match(b'(?P<\xce\x94>)', b'').groupdict()
Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    re.match(b'(?P<\xce\x94>)', b'').groupdict()
  File "/usr/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
>>> re.match(b'(?P<\u0394>)', b'').groupdict()
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    re.match(b'(?P<\u0394>)', b'').groupdict()
  File "/usr/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name '\\u0394' at position 4

@animalize
Copy link
Mannequin

@animalize animalize mannequin commented Jun 16, 2020

latin1 is the character set that Unicode code point from \u0000 to \u00ff, and the characters are directly mapped from/to bytes.

So b'\xe9' is mapped to \u00e9, it is é.

Of course, characters with Unicode code point greater than 0xff are impossible to appear in bytes.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

So b'\xe9' is mapped to \u00e9, it is é.

Yes but \xe9 is not strictly valid utf-8, or say not the canonical representation of "é". So there is no way to get \xe9 starting from é without leaving utf-8. So starting with é as group name, I cannot programmatically encode it into a bytes pattern.

Of course, characters with Unicode code point greater than 0xff are impossible to appear in bytes.

But \xce and \x94 are both lower than \xff, yet using \xce\x94 ("Δ".encode()) in a group name fails.

According to the doc, the sole constraint on group names is that they have to be valid and unique Python identifiers. So this should work:

# Δ is a valid identifier
>>> "Δ".isidentifier()
True
>>> Δ = 1
>>> Δ
1
>>> import re
>>> name = "Δ"
>>> re.match(b"(?P<" + name.encode() + b">)", b"")
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    re.match(b"(?P<" + name.encode() + b">)", b"")
  File "/usr/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
re.match(b'(?P<\xce\x94>)', b'').groupdict()

@animalize
Copy link
Mannequin

@animalize animalize mannequin commented Jun 16, 2020

In this case, you can only use 'latin1', which directly map one character (\u0000-\u00FF) to/from one byte.

If use 'utf-8', it may map one character to multiple bytes, such as 'Δ' -> b'\xce\x94'

'\x94' is an invalid identifier, it will raise an error:

    >>> '\xce'.isidentifier()   # '\xce' is 'Î'
    True
    >>> '\x94'.isidentifier()
    False

You may close this issue (I can't close it), we can continue the discussion.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

But Δ has no latin-1 representation. So Δ currently cannot be used as a group name in bytes regex, although it is a valid Python identifier. So that's a bug.

I mean, if you insist of having group names as strings even for bytes regexes, then it is not reasonable to prevent them from going _in_.

b"(??<\xce\x94>)" is a valid utf-8-encoded bytestring, why wouldn't you accept it as a valid re pattern?

IMHO, either

  • group names from byte regexes should be returned as bytes
  • or any utf-8-encoded representation of a valid Python identifier should be accepted as a group name of a bytes regex pattern.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

Sorry, b"(?P<\xce\x94>)"

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

The issue with the second variant is that utf-8 is an arbitrary (although default) choice.

But: re is doing that same arbitrary choice already in decoding the group names into a string, which is my original complaint!

@animalize
Copy link
Mannequin

@animalize animalize mannequin commented Jun 16, 2020

It seems you don't know some knowledge of encoding yet.

Naturally, bytes cannot contain character which Unicode code point is greater than \u00ff. So you can only use "latin1" encoding, which map from character to byte (or reverse) directly.

"utf-8", "utf-16" and "utf-32" are all encoding codecs, "utf-8" should not have a special status in this scene.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

It seems you don't know some knowledge of encoding yet.

I don't have to be ashamed of my knowledge of encoding. Yet you are right that I was missing a subtlety, which is that latin-1 is a strict subset of Unicode rather than a completely arbitrary encoding. Thank you for that.

So what you are saying is that group names in bytes regexes can only be specified directly (without -explicit- encoding), so de facto they are limited to the latin-1 subset.

Very well.

But then, once again:

  1. why convert them to string when spitting them out? bytes they were when going in, bytes they should remain... **By converting them you are choosing an arbitrary encoding, even if it is the "natural" one.**
  2. this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names. If this was really the case, then I would expect to be able to use any string for which .isidentifier() is true as a group name, programmatically.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

I prove my point that the decoding to string is arbitrary:

>>> import re
>>> orig_name = "Ř"
>>> orig_ch = orig_name.encode("cp1250") # Because why not?
>>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
>>> name == orig_name
False
>>> name
'Ø'
>>> name.encode("latin-1") == orig_ch
True

For any dynamically-constructed bytes regex pattern, a string group name as output is unusable. Only after latin-1-reencoding can it be safely compared. This latin-1 choice is arbitrary.

@animalize
Copy link
Mannequin

@animalize animalize mannequin commented Jun 16, 2020

this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names.

Not all latin-1 characters are valid identifier, for example:

    >>> '\x94'.encode('latin1')
    b'\x94'
    >>> '\x94'.isidentifier()
    False

There is a workaround, you can convert bytes to str with "latin-1" decoder before processing, IIRC there will be no extra overhead (memory/speed) during processing, then the name and content are the same type. :)

@animalize
Copy link
Mannequin

@animalize animalize mannequin commented Jun 16, 2020

Please look at these:

    >>> orig_name = "Ř"
    >>> orig_ch = orig_name.encode("cp1250") # Because why not?
    >>> orig_ch
    b'\xd8'
    >>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
    >>> name
    'Ø'  # '\xd8'
    >>> name == orig_name
    False
    >>> name.encode("latin-1")
    b'\xd8'
    >>> name.encode("latin-1") == orig_ch
    True

"Ř" (\u0158) --cp1250--> b'\xd8'
"Ø" (\u00d8) --latin-1--> b'\xd8'

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

> this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names.

Not all latin-1 characters are valid identifier, for example:

\>\>\> '\\x94'.encode('latin1')
b'\\x94'
\>\>\> '\\x94'.isidentifier()
False

True but that's not the point. Δ is a valid Python identifier but not a valid group name in bytes regexes, because it is not in the latin-1 plane. The documentation does not mention this.

There is a workaround, you can convert bytes to str with "latin-1" decoder before processing, IIRC there will be no extra overhead (memory/speed) during processing, then the name and content are the same type. :)

I am not searching a workaround for my current code.

And the simplest workaround is to latin-1-convert back to bytes, because re should not latin-1-convert to string in the first place.

Are you saying that the proper way to use bytes regexes is to use string regexes instead?

Please look at these:

\>\>\> orig_name = "Ř"
\>\>\> orig_ch = orig_name.encode("cp1250") # Because why not?
\>\>\> orig_ch
b'\\xd8'
\>\>\> name = list(re.match(b"(?P\<" + orig_ch + b"\>)", b"").groupdict().keys())[0]
\>\>\> name
'Ø'  # '\\xd8'
\>\>\> name == orig_name
False
\>\>\> name.encode("latin-1")
b'\\xd8'
\>\>\> name.encode("latin-1") == orig_ch
True

"Ř" (\u0158) --cp1250--> b'\xd8'
"Ø" (\u00d8) --latin-1--> b'\xd8'

That's no surprize, I carefully crafted this example. :-)

Rather, that is exactly my point: several different strings (which can all be valid Python identifiers) can have the same single-byte representation, simply by the mean of different encodings (duh).

So why convert group names to strings when outputting them from matches, when you don't know where the bytes come from, or even whether they ever were strings? That should be left to the programmer.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

And there's no need for a cryptic encoding like cp1250 for this problem to arise. Here is a simple example with Python's default encoding utf-8:

>>> a = "ú"
>>> b = list(re.match(b"(?P<" + a.encode() + b">)", b"").groupdict())[0]
>>> a.isidentifier()
True
>>> b.isidentifier()
True
>>> b
'ú'
>>> a.encode() == b.encode("latin1")
True

For reference, here is the very source of the issue: https://github.com/python/cpython/blob/master/Lib/sre_parse.py#L228

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

The problem can also be played in reverse, maybe it is more telling:

# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# maybe we can try to infer it by decoding the bytestring?
# let's try to do it with the default encoding... that natural, right?
>>> p.decode()
'(?P<ú>)'

# so we can reasonably expect the group name to be ú, right?
>>> list(re.compile(p).groupindex.keys()).pop()
'ú'

# Fail.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 16, 2020

You questioned my knowledge of encodings. Let's quote from one of the most famous introductory articles on the subject (https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/):

It does not make sense to have a string without knowing what encoding it uses

So I have that bytestring that comes from somewhere, maybe it was originally utf-8 or cp1250 or ... encoded, but I won't tell or don't know, the only thing I swear is that it originally was a valid Python identifier.
Now I pass it as a group name in re.match (it was a valid Python identifier, so that has to be alright per the docs) and I get back a (unicode) string.
re.match, how dare you giving me back a string when _you have no clue what my bytestring originally represented, resp. what it originally was encoded with_?
Maybe re.match will even crash, because it wrongly and assumes the bytestring to have been latin-1 encoded!

So: latin-1 is an arbitrary choice that is no better than any other, and the fact that it "naturally" converts bytes to unicode code points is an implementation detail.
If you want to keep it so, it ought (cf. the quote above) to be made clear in the docs that group names come out as latin-1-encoded strings, with all the restrictions that follow from that choice.
But the more logical way would be to renounce this arbitrary encoding altogether.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 17, 2020

I just had an "aha moment": What re claims is that, rather than doing as I suggested:

# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# maybe we can try to infer it by decoding the bytestring?
# let's try to do it with the default encoding... that's natural, right?
>>> p.decode()
'(?P<ú>)'

the actual way to know what group name is represented would be to look at the (unicode) string with the same "graphical representation":

# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# to discover it, we instead consider the string that "looks the same":
>>> "(?P<\xc3\xba>)"
'(?P<ú>)'

# ok so the group name will be "ú"

This way of going from bytes to strings _naively_ (which happens to be called latin-1) makes IMHO as much sense as saying that 0x10, 0b10 and 0o10 should be the same value, just because they "look the same" in the source code.

This is like throwing away everything we ever learned about Unicode and how a code point is fundamentally different from what is stored in memory.

@animalize
Copy link
Mannequin

@animalize animalize mannequin commented Jun 17, 2020

Why you always want to use "utf-8" encoded identifier as group name in bytes pattern.

The direction is: a group name written in bytes pattern, and will convert to str. Not this direction: strgroup name -(utf8)->bytespattern ->str` group name

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 17, 2020

Because utf-8 is Python's default encoding, e.g. in source files, decode() and encode(). Literally everywhere.

If you ask around "I have a bytestring, I need a string, what do I do?", using latin-1 will not be the first answer (and moreover, the correct answer should be "it depends on the encoding", which re happily ignores by just asserting one).

Saying "just strip that b prefix, it's fine" cannot be taken seriously.

Yes latin-1 will never give an error on converting a bytestring, because it has full coverage of the 256 byte values, but saying that this is the reason why it should be used instead of another is forgetting why we have Unicode in the first place. **It is just pretending that Unicode never was a thing**. It is not because it can decode any bytestring that it will not return garbage _when the bytestring is not latin-1-encoded in the first place_.

Take a look at the documentation: https://docs.python.org/3/howto/unicode.html
7 references to latin-1, none saying that latin-1 is the way to go because it is so much better than anything else.

latin-1 used to be prominent in the 2.x world, it should slowly be time to recognize that this is over, and we cannot ignore anymore that encoding is a thing.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 17, 2020

If I don't have to think about the str -> bytes direction, re should first stop going in the other direction.

When I have bytes regexes I actually don't care about strings and would happily receive group names as bytes. But no, re decides that latin-1 is the way to go, and this way it 1) reduces my freedom in the choice of the group names, 2) makes me need to go read the internals to understand the the encoding it arbitrarily chose is latin-1, so that I can undo it properly and get back what I always wanted - a bytes group name.

@qwenger
Copy link
Mannequin Author

@qwenger qwenger mannequin commented Jun 17, 2020

bytes are _not_ Unicode code points, not even in the 256 range. End of the story.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Apr 18, 2022

Latin1 Is an implementation detail of the RE parser. There is no deep meaning in this.

Group names are purposed to be human-readable. This is why they are limited to be identifiers. Non-ASCII characters in bytes pattern are not human-readable. I think we should only allow ASCII-only identifiers as group names in bytes patterns. The question is whether it needs a deprecation period?

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Apr 21, 2022

Close this issue as "not a bug". See #91760 for more strict rules which can eliminate confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.8 expert-regex type-bug
Projects
None yet
Development

No branches or pull requests

1 participant