Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-102856: Initial implementation of PEP 701 #102855

Open
wants to merge 89 commits into
base: main
Choose a base branch
from

Conversation

pablogsal
Copy link
Member

@pablogsal pablogsal commented Mar 20, 2023

@pablogsal pablogsal changed the title Initial implementation of PEP 701 gh-102856: Initial implementation of PEP 701 Mar 20, 2023
@pablogsal pablogsal force-pushed the fstring-grammar-rebased-after-sprint branch from 7cb2e44 to ed0ef34 Compare March 20, 2023 22:29
Parser/tokenizer.h Outdated Show resolved Hide resolved
Co-authored-by: Eclips4 <80244920+Eclips4@users.noreply.github.com>
Parser/tokenizer.h Outdated Show resolved Hide resolved
Eclips4 and others added 21 commits March 21, 2023 19:29
Infer format specificer enclosing automatically
…-sprint' into fstring-grammar-rebased-after-sprint
…sprint

test: Fix fstring related tests in test_tokenize.py
@pablogsal pablogsal marked this pull request as ready for review April 13, 2023 10:05
Grammar/python.gram Outdated Show resolved Hide resolved
@sunmy2019
Copy link
Contributor

One issue is that, with current grammar

f"{lambda x:{123}}"

will be recognized as a valid lambda, but

f"{lambda x: {123}}"
f"{lambda x:{123} }"

won't. It definitely confuses the users.

I can't figure out an elegant way to fix this under current tokens. Since the info of in_format_spec only exists when the token is being tokenized, then the information is lost when exiting that

One workaround is to emit an empty fstring_middle to prevent any further match by the lambdef.

Another workaround is to add 2 tokens: FSTRING_REPLACEMENT_FIELD_START/END, this preserves the in_format_spec info when passed to the parser.

@pablogsal
Copy link
Member Author

@sunmy2019 the changes may also make test_tokenize.TestRoundtrip.test_random_files fail for some cases, but that may be an older failure.

@sunmy2019
Copy link
Contributor

@sunmy2019 the changes may also make test_tokenize.TestRoundtrip.test_random_files fail for some cases, but that may be an older failure.

I ran cpu heavy tests yeterday, and found this failure. See here: pablogsal#67 (comment)

Both the tokenize and the untokenize function needs a rewrite.

@pablogsal pablogsal added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 13, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 18f69e6 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 13, 2023
@pablogsal
Copy link
Member Author

We are almost there! We have a failing test in some buildbots:

https://buildbot.python.org/all/#/builders/802/builds/760

I cannot reproduce in my mac machine. Maybe someone has more luck with a Linux system

@isidentical
Copy link
Sponsor Member

I cannot reproduce in my mac machine. Maybe someone has more luck with a Linux system

No luck on my side either (with a Linux machine + debug build + refleaks) for test_ast/test_fstring/test_tokenize. Trying the whole test suite (is there a specific option I might be missing?)

@lysnikolaou
Copy link
Contributor

I'm able to reproduce on a Debian container using Docker on my macOS. The problem has to do with code like eval('f""'). When the f-string is too small, it results in either start_char or peek1 or both here to be EOF. For some reason, on this machine with this configuration they're not -1 (EOF), but rather 255, which means that the relevant check in tok_backup fails and we have a fatar error raised from here. I can't explain why they wouldn't be EOF until now, but I'm looking.

@lysnikolaou
Copy link
Contributor

More info. When running it with Python, I get the following error:

root@9ee555036b0f:/usr/src/cpython# cat t.py
eval('f"a"')
root@9ee555036b0f:/usr/src/cpython# ./python t.py
Fatal Python error: tok_backup: tok_backup: wrong character
Python runtime state: initialized

Current thread 0x0000ffff9de38750 (most recent call first):
  File "/usr/src/cpython/t.py", line 1 in <module>
Aborted

Here's a simple step though tok_get_fstring_mode on gdb in the last pass that generates the error

(gdb) file ./python
Reading symbols from ./python...
warning: File "/usr/src/cpython/python-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /usr/src/cpython/python-gdb.py
line to your configuration file "/root/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
(gdb) break tok_get_fstring_mode
Breakpoint 1 at 0x17119c: file Parser/tokenizer.c, line 2442.
(gdb) r t.py
Starting program: /usr/src/cpython/python t.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

Breakpoint 1, tok_get_fstring_mode (tok=0xaaab0ecfec60, current_tok=0xaaab0ecff7d0, token=0xffffc95f8278) at Parser/tokenizer.c:2442
2442	{
(gdb) c
Continuing.

Breakpoint 1, tok_get_fstring_mode (tok=0xaaab0ecfec60, current_tok=0xaaab0ecff7d0, token=0xffffc95f8278) at Parser/tokenizer.c:2442
2442	{
(gdb) p tok->cur
$1 = 0xffff883fc1e3 "\""
(gdb) p tok->buf
$2 = 0xffff883fc1e0 "f\"a\""
(gdb) n
2448	    tok->start = tok->cur;
(gdb)
2449	    tok->first_lineno = tok->lineno;
(gdb)
2450	    tok->starting_col_offset = tok->col_offset;
(gdb)
2454	    char start_char = tok_nextc(tok);
(gdb)
2455	    char peek1 = tok_nextc(tok);
(gdb) p start_char
$3 = 34 '"'
(gdb) s
tok_nextc (tok=0xaaab0ecfec60) at Parser/tokenizer.c:1169
1169	{
(gdb) n
1172	        if (tok->cur != tok->inp) {
(gdb)
1176	        if (tok->done != E_OK) {
(gdb)
1179	        if (tok->fp == NULL) {
(gdb)
1180	            rc = tok_underflow_string(tok);
(gdb) s
tok_underflow_string (tok=0xaaab0ecfec60) at Parser/tokenizer.c:965
965	tok_underflow_string(struct tok_state *tok) {
(gdb) list
960	    } while (tok->inp[-1] != '\n');
961	    return 1;
962	}
963
964	static int
965	tok_underflow_string(struct tok_state *tok) {
966	    char *end = strchr(tok->inp, '\n');
967	    if (end != NULL) {
968	        end++;
969	    }
(gdb)
970	    else {
971	        end = strchr(tok->inp, '\0');
972	        if (end == tok->inp) {
973	            tok->done = E_EOF;
974	            return 0;
975	        }
976	    }
977	    if (tok->start == NULL) {
978	        tok->buf = tok->cur;
979	    }
(gdb) n
966	    char *end = strchr(tok->inp, '\n');
(gdb)
967	    if (end != NULL) {
(gdb)
971	        end = strchr(tok->inp, '\0');
(gdb)
972	        if (end == tok->inp) {
(gdb)
973	            tok->done = E_EOF;
(gdb)
974	            return 0;
(gdb)
tok_nextc (tok=0xaaab0ecfec60) at Parser/tokenizer.c:1189
1189	        if (tok->debug) {
(gdb) list
1184	        }
1185	        else {
1186	            rc = tok_underflow_file(tok);
1187	        }
1188	#if defined(Py_DEBUG)
1189	        if (tok->debug) {
1190	            fprintf(stderr, "line[%d] = ", tok->lineno);
1191	            print_escape(stderr, tok->cur, tok->inp - tok->cur);
1192	            fprintf(stderr, "  tok->done = %d\n", tok->done);
1193	        }
(gdb)
1194	#endif
1195	        if (!rc) {
1196	            tok->cur = tok->inp;
1197	            return EOF;
1198	        }
1199	        tok->line_start = tok->cur;
1200
1201	        if (contains_null_bytes(tok->line_start, tok->inp - tok->line_start)) {
1202	            syntaxerror(tok, "source code cannot contain null bytes");
1203	            tok->cur = tok->inp;
(gdb) n
1195	        if (!rc) {
(gdb)
1196	            tok->cur = tok->inp;
(gdb)
1197	            return EOF;
(gdb)
tok_get_fstring_mode (tok=0xaaab0ecfec60, current_tok=0xaaab0ecff7d0, token=0xffffc95f8278) at Parser/tokenizer.c:2456
2456	    tok_backup(tok, peek1);
(gdb) p peek1
$4 = 255 '\377'

@isidentical
Copy link
Sponsor Member

isidentical commented Apr 13, 2023

When the f-string is too small, it results in either start_char or peek1 or both here to be EOF.

Oh, this kind of makes sense. At least on how we got there. I wonder whether we could simply look at the peek1 if the start_char is {/}. This would prevent the secondary tok_nextc/tok_backup pair when in case the string is too small.

E.g. something like this (just as a hack to test if it works):

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index d88d737860..34f291cf89 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -2452,8 +2452,14 @@ tok_get_fstring_mode(struct tok_state *tok, tokenizer_mode* current_tok, struct
     // If we start with a bracket, we defer to the normal mode as there is nothing for us to tokenize
     // before it.
     char start_char = tok_nextc(tok);
-    char peek1 = tok_nextc(tok);
-    tok_backup(tok, peek1);
+    char peek1;
+    if (start_char == '{' || start_char == '}') {
+        peek1 = tok_nextc(tok);
+        tok_backup(tok, peek1);
+    }
+    else {
+        peek1 = '0';
+    }
     tok_backup(tok, start_char);
 
     if ((start_char == '{' && peek1 != '{') || (start_char == '}' && peek1 != '}')) {

For me, eval(f"a")

@pablogsal pablogsal added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 13, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit d28efe1 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Apr 13, 2023
@lysnikolaou
Copy link
Contributor

lysnikolaou commented Apr 13, 2023

When the f-string is too small, it results in either start_char or peek1 or both here to be EOF.

Oh, this kind of makes sense. At least on how we got there. I wonder whether we could simply look at the peek1 if the start_char is {/}. This would prevent the secondary tok_nextc/tok_backup pair when in case the string is too small.

Not sure whether this is the actual problem though. tok_backup is okay to handle EOF and, on the other platforms we're testing, everything seems to work okay. The reason is that every check will fail, until we reach this which should be able to handle things correctly.

The big questions to me is how do we end up with peek1 == 255, when it very clearly came from return EOF and the subsequent check c == EOF in tok_backup fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants