Tokenize does not roundtrip {{ after \n #125008

wyattscarpenter · 2024-10-05T17:48:05Z

Bug report

Bug description:

import tokenize, io
source_code = r'''
f"""{80 * '*'}\n{{test}}{{details}}{{test2}}\n{80 * '*'}"""
'''
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)

Expected:


f"""{80 *'*'}\n{{test}}{{details}}{{test2}}\n{80 *'*'}"""

Got:


f"""{80 *'*'}\n{test}}{{details}}{{test2}}\n{80 *'*'}"""

Note the absence of a second { in the {{ after the \n — but in no other positions.

Unlike some other roundtrip failures of tokenize, some of which are minor infelicities, this one actually creates a syntactically invalid program on roundtrip, which is quite bad. You get a SyntaxError: f-string: single '}' is not allowed when trying to use the results.

CPython versions tested on:

3.12

Operating systems tested on:

Linux, Windows

Linked PRs

The text was updated successfully, but these errors were encountered:

wyattscarpenter · 2024-10-05T17:54:41Z

Furthermore, here is the output of the following code:

import tokenize, io
source_code = r'f"\n{{test}}"'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
for t in tokens:
  print(t)

TokenInfo(type=61 (FSTRING_START), string='f"', start=(1, 0), end=(1, 2), line='f"\\n{{test}}"')
TokenInfo(type=62 (FSTRING_MIDDLE), string='\\n{', start=(1, 2), end=(1, 5), line='f"\\n{{test}}"')
TokenInfo(type=62 (FSTRING_MIDDLE), string='test}', start=(1, 6), end=(1, 11), line='f"\\n{{test}}"')
TokenInfo(type=63 (FSTRING_END), string='"', start=(1, 12), end=(1, 13), line='f"\\n{{test}}"')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 13), end=(1, 14), line='f"\\n{{test}}"')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

So, it seems that the line is getting in alright, but the \n{{ is getting turned into a \n{ in the tokenizer somehow.

Same erroneous output for the bytes version (with rb-string, BytesIO and tokenize.tokenize).

AlexWaygood · 2024-10-05T18:19:15Z

It looks like this was a regression in Python 3.12; I can't reproduce the behaviour with Python 3.11. I'm guessing it was caused by the PEP-701 changes.

AlexWaygood · 2024-10-05T18:23:46Z

Reproduced on the main branch as well.

tomasr8 · 2024-10-05T21:05:05Z

This seems to happen with other escape characters as well:

import tokenize, io
source_code = r'f"""\t{{test}}"""'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)  # f"""\t{test}}"""

import tokenize, io
source_code = r'f"""\r{{test}}"""'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)  # f"""\r{test}}"""

tomasr8 · 2024-10-05T21:45:14Z

I think the issue is in this method:

cpython/Lib/tokenize.py

Lines 187 to 208 in 16cd6cc

    
           def escape_brackets(self, token): 
        
               characters = [] 
        
               consume_until_next_bracket = False 
        
               for character in token: 
        
                   if character == "}": 
        
                       if consume_until_next_bracket: 
        
                           consume_until_next_bracket = False 
        
                       else: 
        
                           characters.append(character) 
        
                   if character == "{": 
        
                       n_backslashes = sum( 
        
                           1 for char in _itertools.takewhile( 
        
                               "\\".__eq__, 
        
                               characters[-2::-1] 
        
                           ) 
        
                       ) 
        
                       if n_backslashes % 2 == 0: 
        
                           characters.append(character) 
        
                       else: 
        
                           consume_until_next_bracket = True 
        
                   characters.append(character) 
        
               return "".join(characters)

This PR fixed the handling of Unicode literals (e.g. \\N{foo}), but it seems to only
be checking for the presence of backslashes without checking if they are followed by N. This appears to fix that:

            if character == "{":
                n_backslashes = sum(
                    1 for char in _itertools.takewhile(
                        "\\".__eq__,
                        characters[-2::-1]
                    )
                )
-               if n_backslashes % 2 == 0:
+               if n_backslashes % 2 == 0 or characters[-1] != "N":
                    characters.append(character)
                else:
                    consume_until_next_bracket = True

…onGH-125013) (cherry picked from commit db23b8bb13863fcd88ff91bc22398f8e0312039e) Co-authored-by: Tomas R. <tomas.roun8@gmail.com>

…125013) (#125021)

…125013) (#125020)

cf python/cpython#125008

wyattscarpenter added the type-bug An unexpected behavior, bug, or error label Oct 5, 2024

wyattscarpenter mentioned this issue Oct 5, 2024

Implement support for "mypy: ignore" comments python/mypy#17875

Draft

AlexWaygood assigned pablogsal and lysnikolaou Oct 5, 2024

AlexWaygood added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Oct 5, 2024

Eclips4 added the topic-parser label Oct 5, 2024

bedevere-app bot mentioned this issue Oct 5, 2024

gh-125008: Fix tokenize.untokenize roundtrip for \n{{ #125013

Merged

pablogsal pushed a commit that referenced this issue Oct 6, 2024

gh-125008: Fix tokenize.untokenize roundtrip for \n{{ (#125013)

db23b8b

This was referenced Oct 6, 2024

[3.13] gh-125008: Fix tokenize.untokenize roundtrip for \n{{ (GH-125013) #125020

Merged

[3.12] gh-125008: Fix tokenize.untokenize roundtrip for \n{{ (GH-125013) #125021

Merged

pablogsal closed this as completed Oct 6, 2024

pablogsal pushed a commit that referenced this issue Oct 6, 2024

[3.12] gh-125008: Fix tokenize.untokenize roundtrip for \n{{ (GH-…

db4b382

…125013) (#125021)

pablogsal pushed a commit that referenced this issue Oct 6, 2024

[3.13] gh-125008: Fix tokenize.untokenize roundtrip for \n{{ (GH-…

b30da22

…125013) (#125020)

wyattscarpenter added a commit to wyattscarpenter/mypy that referenced this issue Oct 7, 2024

implement workaround for \n{{ roundtripping bug

ef13623

cf python/cpython#125008

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize does not roundtrip {{ after \n #125008

Tokenize does not roundtrip {{ after \n #125008

wyattscarpenter commented Oct 5, 2024 •

edited by bedevere-app bot

Loading

wyattscarpenter commented Oct 5, 2024 •

edited

Loading

AlexWaygood commented Oct 5, 2024

AlexWaygood commented Oct 5, 2024

tomasr8 commented Oct 5, 2024

tomasr8 commented Oct 5, 2024

Tokenize does not roundtrip {{ after \n #125008

Tokenize does not roundtrip {{ after \n #125008

Comments

wyattscarpenter commented Oct 5, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

wyattscarpenter commented Oct 5, 2024 • edited Loading

AlexWaygood commented Oct 5, 2024

AlexWaygood commented Oct 5, 2024

tomasr8 commented Oct 5, 2024

tomasr8 commented Oct 5, 2024

wyattscarpenter commented Oct 5, 2024 •

edited by bedevere-app bot

Loading

wyattscarpenter commented Oct 5, 2024 •

edited

Loading