gh-102856: Initial implementation of PEP 701 #102855

pablogsal · 2023-03-20T22:25:37Z

Issue: PEP 701 – Syntactic formalization of f-strings #102856

Parser/tokenizer.h

Co-authored-by: Eclips4 <80244920+Eclips4@users.noreply.github.com>

Parser/tokenizer.h

Infer format specificer enclosing automatically

….com/pablogsal/cpython into test_cmd_line

…-sprint' into fstring-grammar-rebased-after-sprint

…sprint test: Fix fstring related tests in test_tokenize.py

…le-before-end

…replacement-field capture more errors in fstring replacement field

Grammar/python.gram

sunmy2019 · 2023-04-13T10:54:30Z

One issue is that, with current grammar

f"{lambda x:{123}}"

will be recognized as a valid lambda, but

f"{lambda x: {123}}"
f"{lambda x:{123} }"

won't. It definitely confuses the users.

I can't figure out an elegant way to fix this under current tokens. Since the info of in_format_spec only exists when the token is being tokenized, then the information is lost when exiting that

One workaround is to emit an empty fstring_middle to prevent any further match by the lambdef.

Another workaround is to add 2 tokens: FSTRING_REPLACEMENT_FIELD_START/END, this preserves the in_format_spec info when passed to the parser.

pablogsal · 2023-04-13T11:22:53Z

@sunmy2019 the changes may also make test_tokenize.TestRoundtrip.test_random_files fail for some cases, but that may be an older failure.

sunmy2019 · 2023-04-13T11:33:51Z

@sunmy2019 the changes may also make test_tokenize.TestRoundtrip.test_random_files fail for some cases, but that may be an older failure.

I ran cpu heavy tests yeterday, and found this failure. See here: pablogsal#67 (comment)

Both the tokenize and the untokenize function needs a rewrite.

bedevere-bot · 2023-04-13T12:13:01Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 18f69e6 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

pablogsal · 2023-04-13T13:51:02Z

We are almost there! We have a failing test in some buildbots:

https://buildbot.python.org/all/#/builders/802/builds/760

I cannot reproduce in my mac machine. Maybe someone has more luck with a Linux system

isidentical · 2023-04-13T15:13:20Z

I cannot reproduce in my mac machine. Maybe someone has more luck with a Linux system

No luck on my side either (with a Linux machine + debug build + refleaks) for test_ast/test_fstring/test_tokenize. Trying the whole test suite (is there a specific option I might be missing?)

lysnikolaou · 2023-04-13T15:24:01Z

I'm able to reproduce on a Debian container using Docker on my macOS. The problem has to do with code like eval('f""'). When the f-string is too small, it results in either start_char or peek1 or both here to be EOF. For some reason, on this machine with this configuration they're not -1 (EOF), but rather 255, which means that the relevant check in tok_backup fails and we have a fatar error raised from here. I can't explain why they wouldn't be EOF until now, but I'm looking.

lysnikolaou · 2023-04-13T15:28:04Z

More info. When running it with Python, I get the following error:

root@9ee555036b0f:/usr/src/cpython# cat t.py
eval('f"a"')
root@9ee555036b0f:/usr/src/cpython# ./python t.py
Fatal Python error: tok_backup: tok_backup: wrong character
Python runtime state: initialized

Current thread 0x0000ffff9de38750 (most recent call first):
  File "/usr/src/cpython/t.py", line 1 in <module>
Aborted

Here's a simple step though tok_get_fstring_mode on gdb in the last pass that generates the error

(gdb) file ./python
Reading symbols from ./python...
warning: File "/usr/src/cpython/python-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /usr/src/cpython/python-gdb.py
line to your configuration file "/root/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
(gdb) break tok_get_fstring_mode
Breakpoint 1 at 0x17119c: file Parser/tokenizer.c, line 2442.
(gdb) r t.py
Starting program: /usr/src/cpython/python t.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

Breakpoint 1, tok_get_fstring_mode (tok=0xaaab0ecfec60, current_tok=0xaaab0ecff7d0, token=0xffffc95f8278) at Parser/tokenizer.c:2442
2442	{
(gdb) c
Continuing.

Breakpoint 1, tok_get_fstring_mode (tok=0xaaab0ecfec60, current_tok=0xaaab0ecff7d0, token=0xffffc95f8278) at Parser/tokenizer.c:2442
2442	{
(gdb) p tok->cur
$1 = 0xffff883fc1e3 "\""
(gdb) p tok->buf
$2 = 0xffff883fc1e0 "f\"a\""
(gdb) n
2448	    tok->start = tok->cur;
(gdb)
2449	    tok->first_lineno = tok->lineno;
(gdb)
2450	    tok->starting_col_offset = tok->col_offset;
(gdb)
2454	    char start_char = tok_nextc(tok);
(gdb)
2455	    char peek1 = tok_nextc(tok);
(gdb) p start_char
$3 = 34 '"'
(gdb) s
tok_nextc (tok=0xaaab0ecfec60) at Parser/tokenizer.c:1169
1169	{
(gdb) n
1172	        if (tok->cur != tok->inp) {
(gdb)
1176	        if (tok->done != E_OK) {
(gdb)
1179	        if (tok->fp == NULL) {
(gdb)
1180	            rc = tok_underflow_string(tok);
(gdb) s
tok_underflow_string (tok=0xaaab0ecfec60) at Parser/tokenizer.c:965
965	tok_underflow_string(struct tok_state *tok) {
(gdb) list
960	    } while (tok->inp[-1] != '\n');
961	    return 1;
962	}
963
964	static int
965	tok_underflow_string(struct tok_state *tok) {
966	    char *end = strchr(tok->inp, '\n');
967	    if (end != NULL) {
968	        end++;
969	    }
(gdb)
970	    else {
971	        end = strchr(tok->inp, '\0');
972	        if (end == tok->inp) {
973	            tok->done = E_EOF;
974	            return 0;
975	        }
976	    }
977	    if (tok->start == NULL) {
978	        tok->buf = tok->cur;
979	    }
(gdb) n
966	    char *end = strchr(tok->inp, '\n');
(gdb)
967	    if (end != NULL) {
(gdb)
971	        end = strchr(tok->inp, '\0');
(gdb)
972	        if (end == tok->inp) {
(gdb)
973	            tok->done = E_EOF;
(gdb)
974	            return 0;
(gdb)
tok_nextc (tok=0xaaab0ecfec60) at Parser/tokenizer.c:1189
1189	        if (tok->debug) {
(gdb) list
1184	        }
1185	        else {
1186	            rc = tok_underflow_file(tok);
1187	        }
1188	#if defined(Py_DEBUG)
1189	        if (tok->debug) {
1190	            fprintf(stderr, "line[%d] = ", tok->lineno);
1191	            print_escape(stderr, tok->cur, tok->inp - tok->cur);
1192	            fprintf(stderr, "  tok->done = %d\n", tok->done);
1193	        }
(gdb)
1194	#endif
1195	        if (!rc) {
1196	            tok->cur = tok->inp;
1197	            return EOF;
1198	        }
1199	        tok->line_start = tok->cur;
1200
1201	        if (contains_null_bytes(tok->line_start, tok->inp - tok->line_start)) {
1202	            syntaxerror(tok, "source code cannot contain null bytes");
1203	            tok->cur = tok->inp;
(gdb) n
1195	        if (!rc) {
(gdb)
1196	            tok->cur = tok->inp;
(gdb)
1197	            return EOF;
(gdb)
tok_get_fstring_mode (tok=0xaaab0ecfec60, current_tok=0xaaab0ecff7d0, token=0xffffc95f8278) at Parser/tokenizer.c:2456
2456	    tok_backup(tok, peek1);
(gdb) p peek1
$4 = 255 '\377'

isidentical · 2023-04-13T15:30:05Z

When the f-string is too small, it results in either start_char or peek1 or both here to be EOF.

Oh, this kind of makes sense. At least on how we got there. I wonder whether we could simply look at the peek1 if the start_char is {/}. This would prevent the secondary tok_nextc/tok_backup pair when in case the string is too small.

E.g. something like this (just as a hack to test if it works):

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index d88d737860..34f291cf89 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -2452,8 +2452,14 @@ tok_get_fstring_mode(struct tok_state *tok, tokenizer_mode* current_tok, struct
     // If we start with a bracket, we defer to the normal mode as there is nothing for us to tokenize
     // before it.
     char start_char = tok_nextc(tok);
-    char peek1 = tok_nextc(tok);
-    tok_backup(tok, peek1);
+    char peek1;
+    if (start_char == '{' || start_char == '}') {
+        peek1 = tok_nextc(tok);
+        tok_backup(tok, peek1);
+    }
+    else {
+        peek1 = '0';
+    }
     tok_backup(tok, start_char);
 
     if ((start_char == '{' && peek1 != '{') || (start_char == '}' && peek1 != '}')) {

For me, eval(f"a")

bedevere-bot · 2023-04-13T15:33:36Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit d28efe1 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

lysnikolaou · 2023-04-13T15:38:36Z

When the f-string is too small, it results in either start_char or peek1 or both here to be EOF.

Oh, this kind of makes sense. At least on how we got there. I wonder whether we could simply look at the peek1 if the start_char is {/}. This would prevent the secondary tok_nextc/tok_backup pair when in case the string is too small.

Not sure whether this is the actual problem though. tok_backup is okay to handle EOF and, on the other platforms we're testing, everything seems to work okay. The reason is that every check will fail, until we reach this which should be able to handle things correctly.

The big questions to me is how do we end up with peek1 == 255, when it very clearly came from return EOF and the subsequent check c == EOF in tok_backup fails.

bedevere-bot added the awaiting core review label Mar 20, 2023

pablogsal changed the title ~~Initial implementation of PEP 701~~ gh-102856: Initial implementation of PEP 701 Mar 20, 2023

bedevere-bot mentioned this pull request Mar 20, 2023

PEP 701 – Syntactic formalization of f-strings #102856

Open

4 tasks

Initial implementation of PEP 701

ed0ef34

pablogsal force-pushed the fstring-grammar-rebased-after-sprint branch from 7cb2e44 to ed0ef34 Compare March 20, 2023 22:29

test: Fix fstring related tests in test_tokenize.py

e7417df

Eclips4 reviewed Mar 20, 2023

View reviewed changes

Parser/tokenizer.h Outdated Show resolved Hide resolved

Update Parser/tokenizer.h

95b5d07

Co-authored-by: Eclips4 <80244920+Eclips4@users.noreply.github.com>

pablogsal commented Mar 21, 2023

View reviewed changes

Parser/tokenizer.h Outdated Show resolved Hide resolved

Eclips4 and others added 21 commits March 21, 2023 19:29

Remove test_debug_f_string_feature_version

277550c

Emit FSTRING_MIDDLE for literal after last expression and before quotes

b80f3c9

Infer format specificer enclosing automatically

739c16f

Modify test_syntaxerror_multi_line_fstring

3c1fe55

Merge pull request #47 from Eclips4/fstring-grammar-rebased-after-sprint

0fc633c

Merge pull request #49 from isidentical/simplify-layers

99a1a09

Infer format specificer enclosing automatically

Reuse existing tok->lineno for f-string line number denotations

ad562e5

Add error messages for some empty expressions

c976007

unify string type checks

bd7483e

Merge branch 'fstring-grammar-rebased-after-sprint' of https://github…

ac85475

….com/pablogsal/cpython into test_cmd_line

fix test after adding new rules for error messages

108fade

Merge pull request #50 from Eclips4/test_cmd_line

f1246db

Merge pull request #51 from isidentical/more-simplifications

b41a3b8

Merge remote-tracking branch 'pablogsal/fstring-grammar-rebased-after…

bbbd433

…-sprint' into fstring-grammar-rebased-after-sprint

update test result with new changes

3c2aea4

Merge pull request #46 from mgmacias95/fstring-grammar-rebased-after-…

081f5a2

…sprint test: Fix fstring related tests in test_tokenize.py

update test_eof with new changes

f9fbee0

Address feedback

d3378c5

Merge branch 'fstring-grammar-rebased-after-sprint' into fstring-midd…

76974c3

…le-before-end

Fix test_tokenize

fa7f12b

Address feedback // Update generate_token.py

0026415

sunmy2019 and others added 7 commits April 10, 2023 22:18

add some specific rules

2470dfa

Merge branch 'main' into fstring-grammar-rebased-after-sprint

7fe556e

add an entity for empty expression, fix tests

11cd51b

specialize error message for lambda

143524b

handle lambda in f-string

91faa31

Merge pull request #67 from sunmy2019/capture-more-errors-in-fstring-…

ef9c43a

…replacement-field capture more errors in fstring replacement field

Merge branch 'main' into fstring-grammar-rebased-after-sprint

3eb7616

pablogsal marked this pull request as ready for review April 13, 2023 10:05

pablogsal requested review from lysnikolaou and iritkatriel as code owners April 13, 2023 10:05

pablogsal commented Apr 13, 2023

View reviewed changes

Grammar/python.gram Outdated Show resolved Hide resolved

lysnikolaou and others added 2 commits April 13, 2023 12:22

Move format spec and conversion char into their own rules

ef5d5ab

Regenerate frozenmain

1855694

Skip test_fstring for now in untokenize tests

18f69e6

pablogsal added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 13, 2023

bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 13, 2023

Fix smelly tests

d28efe1

pablogsal added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 13, 2023

bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 13, 2023

gh-102856: Initial implementation of PEP 701 #102855

gh-102856: Initial implementation of PEP 701 #102855

pablogsal commented Mar 20, 2023 •

edited by bedevere-bot

sunmy2019 commented Apr 13, 2023

pablogsal commented Apr 13, 2023

sunmy2019 commented Apr 13, 2023

bedevere-bot commented Apr 13, 2023

pablogsal commented Apr 13, 2023

isidentical commented Apr 13, 2023

lysnikolaou commented Apr 13, 2023

lysnikolaou commented Apr 13, 2023

isidentical commented Apr 13, 2023 •

edited

bedevere-bot commented Apr 13, 2023

lysnikolaou commented Apr 13, 2023 •

edited

gh-102856: Initial implementation of PEP 701 #102855

Are you sure you want to change the base?

gh-102856: Initial implementation of PEP 701 #102855

Conversation

pablogsal commented Mar 20, 2023 • edited by bedevere-bot

sunmy2019 commented Apr 13, 2023

pablogsal commented Apr 13, 2023

sunmy2019 commented Apr 13, 2023

bedevere-bot commented Apr 13, 2023

pablogsal commented Apr 13, 2023

isidentical commented Apr 13, 2023

lysnikolaou commented Apr 13, 2023

lysnikolaou commented Apr 13, 2023

isidentical commented Apr 13, 2023 • edited

bedevere-bot commented Apr 13, 2023

lysnikolaou commented Apr 13, 2023 • edited

pablogsal commented Mar 20, 2023 •

edited by bedevere-bot

isidentical commented Apr 13, 2023 •

edited

lysnikolaou commented Apr 13, 2023 •

edited