Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upbpo-35808: Retire pgen and use pgen2 to generate the parser #11814
Conversation
pablogsal
requested a review
from
gvanrossum
Feb 11, 2019
the-knights-who-say-ni
added
the
CLA signed
label
Feb 11, 2019
bedevere-bot
added
the
awaiting merge
label
Feb 11, 2019
pablogsal
self-assigned this
Feb 11, 2019
pablogsal
force-pushed the
pablogsal:pgen2
branch
3 times, most recently
from
23ff976
to
20fcc02
Feb 11, 2019
pablogsal
requested a review
from python/windows-team
as a
code owner
Feb 11, 2019
pablogsal
force-pushed the
pablogsal:pgen2
branch
2 times, most recently
from
d60bfe5
to
adc73e6
Feb 11, 2019
ned-deily
reviewed
Feb 11, 2019
It doesn't appear that there are changes to configure.ac so is there a need to churn aclocal.m4 and configure? |
pablogsal
force-pushed the
pablogsal:pgen2
branch
from
adc73e6
to
5448767
Feb 11, 2019
This comment has been minimized.
This comment has been minimized.
Do you need my review? I'm super behind on stuff so it may be a while, and I don't want to block you. |
pablogsal
requested a review
from
ambv
Feb 11, 2019
This comment has been minimized.
This comment has been minimized.
Hi, in the next line: Line 1736 in 5448767 there is a reference to $(PGEN) and that does not exist. By the way, that don't give any error because is empty |
This comment has been minimized.
This comment has been minimized.
Good catch! |
This comment has been minimized.
This comment has been minimized.
@gvanrossum Don't worry. I will try to get someone else to review, but the fact that the interpreter builds and passes the test is a good assurance that the patch is not fundamentally wrong. |
pablogsal
requested review from
vstinner
and
serhiy-storchaka
Feb 11, 2019
This comment has been minimized.
This comment has been minimized.
Hum. I'm not super excited by duplicating Lib/lib2to3/pgen2/ as Parser/pgen2/ "with a few changes". It sounds like a good recipe to duplicate the maintenance. Would it be possible to modify Lib/lib2to3/pgen2/ to respect your constraints instead? Maybe even with "#ifdef" ( Maybe my question doesn't make sense, the code diverges enough to justify to have two "copies" of the "same code". |
This comment has been minimized.
This comment has been minimized.
@vstinner We could add some branching on the original pgen2, but IMHO I think having the code in Being said that, if we think is ok to add branching to pgen2 and invoke that from P.S. Thanks for reviewing this big PR! |
eamanu
reviewed
Feb 11, 2019
Searching PGEN on CPython root I have the some issues:
|
vstinner
reviewed
Feb 11, 2019
There should be a way to modify lib2to3.pgen2 just enough to allow an user to change some constants and functions, no? I compared Parser/pgen2/ and Lib/lib2to3/pgen2/ using meld. Most code is duplicated, but "some" is modified. That's my worry. The 194 lines of grammar.py are almost all the same, except of opmap_raw constant which is different. That's why I talked about "#ifdef". Would it be possible to modify lib2to3 to allow to modify opmap_raw? For this specific case, it looks like lib2to3 doesn't support PEP 572 whereas your change adds support for ":="... It means that we have 2 copies of the same file, but one is outdated? What's the point of having an outdated parser which doesn't support PEP 572 in lib2to3? lib2to3 should also be modified to support ":=", no? Can't Parser/pgen2/ reuses Lib/lib2to3/pgen2/grammar.py? For a practical reason, I would suggest to rename Parser/pgen2/ to Parser/pgen/: it would avoid conflict with lib2to3.pgen2 package and it reuses "pgen" name :-) Parser/pgen2/pgen.py: multiple changes are to add support for older Python versions. Would it be acceptable to modify lib2to3 and reuse its pgen.py in Parser/pgen2/? Parser/pgen2/tokenize.py: most code is the same, but you replace " with ' and other coding style changes. I'm not against coding style changes, except that here it makes the maintenance harder to keep Parser/pgen2/tokenize.py in sync with Lib/lib2to3/pgen2/tokenize.py. Either revert coding style changes or update also Lib/lib2to3/pgen2/tokenize.py. Parser/pgen2/token.py: you added BACKQUOTE, removed ELLIPSIS, etc. That's maybe one of the most tricky difference which may prevent core reuse? Or other modules should be made configurable to accept a "custom" token module? Rather than being "hardcoded" to use a specific "token" module? I'm talking about |
This comment has been minimized.
This comment has been minimized.
bedevere-bot
commented
Feb 11, 2019
When you're done making the requested changes, leave the comment: |
bedevere-bot
added
awaiting changes
and removed
awaiting merge
labels
Feb 11, 2019
This comment has been minimized.
This comment has been minimized.
@vstinner I will work on changes that will address these issues trying to reuse pgen2 as much as possible. Will ping you when is ready :) |
zooba
reviewed
Feb 11, 2019
Signing off on the changes to the Windows build files. Haven't looked at the rest. |
pablogsal
reviewed
Feb 11, 2019
**{f'{prefix}"""': double3prog for prefix in _strprefixes}, | ||
**{prefix: None for prefix in _strprefixes}} | ||
"'''": single3prog, '"""': double3prog} | ||
endprogs.update({"{}'''".format(prefix): single3prog for prefix in _strprefixes}) |
This comment has been minimized.
This comment has been minimized.
pablogsal
Feb 11, 2019
•
Author
Member
These are needed because Travis uses an old version of python for the regenerations step.
This comment has been minimized.
This comment has been minimized.
I have made some commits triying to use pgen2, but there are some problems that only show on Travis and Windows.....I will try to find what is happening. |
pablogsal
force-pushed the
pablogsal:pgen2
branch
from
09cef2e
to
e2ed0d1
Feb 20, 2019
This comment has been minimized.
This comment has been minimized.
As I have rebased, the old conversations that were attached to commits can be found here: 80c1217#commitcomment-32381091 104b8e7#commitcomment-32380993 @gvanrossum Ok, I have decided to go to the option you initially suggested and leave
The reasons there is a full implementation of the parser generator in
|
gvanrossum
reviewed
Feb 21, 2019
(Honestly I think at this point you might as well have started over with a new PR. But alla.) |
pablogsal
force-pushed the
pablogsal:pgen2
branch
from
f386c4f
to
c986980
Feb 21, 2019
pablogsal
force-pushed the
pablogsal:pgen2
branch
from
c986980
to
3d593ef
Feb 21, 2019
gvanrossum
reviewed
Feb 21, 2019
Getting closer. Maybe we can we completely independent from lib2to3. |
"grammar", type=str, help="The file with the grammar definition in EBNF format" | ||
) | ||
parser.add_argument( | ||
"gramminit_h", |
This comment has been minimized.
This comment has been minimized.
# Add token to the module cache so tokenize.py uses that excact one instead of | ||
# the one in the stdlib of the interpreter executing this file. | ||
sys.modules['token'] = token | ||
tokenize = importlib.machinery.SourceFileLoader('tokenize', |
This comment has been minimized.
This comment has been minimized.
gvanrossum
Feb 21, 2019
Member
This still looks fragile. Why do we need to use the latest tokenize to parse the Grammar file? The “meta” grammar is super simple, it just has NAME, string literals, and some basic punctuation and operators. The tokenize module from Python 2.4 can handle this. :-)
This comment has been minimized.
This comment has been minimized.
pablogsal
Feb 21, 2019
•
Author
Member
The tokenize module from Python 2.4 can handle this.
:)
Why do we need to use the latest tokenize to parse the Grammar file?
Is not that the tokenize cannot handle the grammar is that but that the tokenizer uses different values for the tokens, it fails when constructing the dfas when calling self.parse()
:
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/src/Parser/pgen/__main__.py", line 36, in <module>
main()
File "/src/Parser/pgen/__main__.py", line 29, in main
p = ParserGenerator(args.grammar, token_lines, verbose=args.verbose)
File "/src/Parser/pgen/pgen.py", line 20, in __init__
self.dfas, self.startsymbol = self.parse()
File "/src/Parser/pgen/pgen.py", line 173, in parse
self.expect(self.tokens['OP'], ":")
File "/src/Parser/pgen/pgen.py", line 337, in expect
type, value, self.type, self.value)
File "/src/Parser/pgen/pgen.py", line 356, in raise_error
self.end[1], self.line))
File "./Grammar/Grammar", line 13
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
^
SyntaxError: expected 54/:, got 52/:
This is because OP has the value of 52 in Python3.5 (in this example) and 54 in the tokens that we construct from Grammar/Tokens
(or in Lib/tokens.py
). This difference is because the value of 52 is yielded by the tokenize (from Python3.5) when calling next(self.generator)
in gettoken
. Maybe I am missing something here, but that is the problem I found when triying to use the tokenize from the running Python :(
@@ -0,0 +1,97 @@ | |||
from lib2to3.pgen2 import grammar |
This comment has been minimized.
This comment has been minimized.
import collections | ||
import importlib.machinery | ||
|
||
# Use Lib/token.py and Lib/tokenize.py to obtain the tokens. To maintain this |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
OK I will try to play with this myself.
On Thu, Feb 21, 2019 at 9:36 AM Pablo Galindo ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In Parser/pgen/pgen.py
<#11814 (comment)>:
> +
+# Use Lib/token.py and Lib/tokenize.py to obtain the tokens. To maintain this
+# compatible with older versions of Python, we need to make sure that we only
+# import these two files (and not any of the dependencies of these files).
+
+CURRENT_FOLDER_LOCATION = os.path.dirname(os.path.realpath(__file__))
+LIB_LOCATION = os.path.realpath(os.path.join(CURRENT_FOLDER_LOCATION, '..', '..', 'Lib'))
+TOKEN_LOCATION = os.path.join(LIB_LOCATION, 'token.py')
+TOKENIZE_LOCATION = os.path.join(LIB_LOCATION, 'tokenize.py')
+
+token = importlib.machinery.SourceFileLoader('token',
+ TOKEN_LOCATION).load_module()
+# Add token to the module cache so tokenize.py uses that excact one instead of
+# the one in the stdlib of the interpreter executing this file.
+sys.modules['token'] = token
+tokenize = importlib.machinery.SourceFileLoader('tokenize',
The tokenize module from Python 2.4 can handle this.
:)
Why do we need to use the latest tokenize to parse the Grammar file?
Is not that the tokenize cannot handle the grammar is that if the
tokenizer uses different values for the tokens, it fails when constructing
the dfas when calling self.parse():
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/src/Parser/pgen/__main__.py", line 36, in <module>
main()
File "/src/Parser/pgen/__main__.py", line 29, in main
p = ParserGenerator(args.grammar, token_lines, verbose=args.verbose)
File "/src/Parser/pgen/pgen.py", line 20, in __init__
self.dfas, self.startsymbol = self.parse()
File "/src/Parser/pgen/pgen.py", line 173, in parse
self.expect(self.tokens['OP'], ":")
File "/src/Parser/pgen/pgen.py", line 337, in expect
type, value, self.type, self.value)
File "/src/Parser/pgen/pgen.py", line 356, in raise_error
self.end[1], self.line))
File "./Grammar/Grammar", line 13
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
^
SyntaxError: expected 54/:, got 52/:
This is because OP has the value of 52 in Python3.5 (in this example) and
54 in the tokens that we construct from Grammar/Tokens (or in
Lib/tokens.py). This difference is because the value of 52 is yielded by
the tokenize (from Python3.5) when calling next(self.generator) in
gettoken. Maybe I am missing something here, but that is the problem I
found when triying to use the tokenize from the running Python :(
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11814 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACwrMsvafFsHzCCrxp4OM99kzKGQCCOWks5vPtkKgaJpZM4azYg1>
.
--
--Guido (mobile)
|
This comment has been minimized.
This comment has been minimized.
I think I have the solution, at least for |
This comment has been minimized.
This comment has been minimized.
Of course! Yeah, I can confirm that this works perfectly. Thanks for the patience on going with me through this problem. :) With this, if we read from |
pablogsal
force-pushed the
pablogsal:pgen2
branch
from
8ff1e44
to
b1fae75
Feb 22, 2019
gvanrossum
approved these changes
Feb 22, 2019
Ship it! |
bedevere-bot
added
awaiting merge
and removed
awaiting changes
labels
Feb 22, 2019
This comment has been minimized.
This comment has been minimized.
@vstinner can you re-review this? |
This comment has been minimized.
This comment has been minimized.
@pablogsal You can remove the other requested reviewers who haven't said anything yet. |
pablogsal
removed request for
ambv
,
benjaminp
and
serhiy-storchaka
Feb 27, 2019
vstinner
dismissed
their
stale review
Mar 1, 2019
Code changed a lot since I reviewed the PR
This comment has been minimized.
This comment has been minimized.
@gvanrossum @pablogsal: Sorry, I was busy with other stuff last days and so I failed to follow this PR. Well, I reviewed an earlier version of the code. The code changed a lot in the meanwhile. I trust Guido to be able to review such change properly. Pablo: please don't wait for me if you want to merge this PR. |
This comment has been minimized.
This comment has been minimized.
@pablogsal, feel free to merge. |
pablogsal
merged commit 1f24a71
into
python:master
Mar 1, 2019
5 checks passed
bedevere-bot
removed
the
awaiting merge
label
Mar 1, 2019
This comment has been minimized.
This comment has been minimized.
Whee! |
pablogsal commentedFeb 11, 2019
•
edited
This problem turned out to be a bit more complex because pgen2 needed some modifications to make it compatible with the old pgen output (apart from the extra glue to generate
graminit.c
andgraminit.h
). I had also to downgrade some f-strings and '**'-unpacking because travis uses an old python for the regeneration step.This PR implements the following:
Removes pgen and all related files from the source tree and Makefile.
Substitutes pgen for a modified and reduced (only pgen functionality is
included) version of Lib/lib2to3/pgen2:
Add new tokens (like COLONEQUAL, TYPE_IGNORE,... etc) to
synchronize with the current Grammar.
All dfa names are included in the label list (this has the effect
of including 'file_input', 'eval_input', ... in the list of
labels to guarantee compatibility with the old pgen generated
files).
The order in which dfas are generated is preserved to maintain
compatibility with old pgen generated files and avoid empty stacks
when the parser processes the rule. If this change is not made,
the parser fails with:
"""' ... It's a token we know
DFA 'and_expr', state 0: Push 'shift_expr'
DFA 'shift_expr', state 0: Push 'arith_expr'
DFA 'arith_expr', state 0: Push 'term'
DFA 'term', state 0: Push 'factor'
DFA 'factor', state 0: Push 'power'
DFA 'power', state 0: Push 'atom_expr'
DFA 'atom_expr', state 0: Push 'atom'
DFA 'atom', state 0: Shift.
Token NEWLINE/'' ... It's a token we know
DFA 'atom', state 5: Pop ...
DFA 'atom_expr', state 2: Pop ...
DFA 'power', state 1: Pop ...
DFA 'factor', state 2: Pop ...
DFA 'term', state 1: Pop ...
DFA 'arith_expr', state 1: Pop ...
DFA 'shift_expr', state 1: Pop ...
DFA 'and_expr', state 1: Pop ...
Error: bottom of stack.
This seem to happen because instead of starting with
'single_input' / 'file_input' ... etc it starts with
'and_expr' just because its the second element in the
dfa list in gramminit.c. This makes the parser to push
tokens from 'and_expr' so before the shift can happen,
it arrives at an empty stack when popping the nonterminals.
First sets are represented with python sets intead of dictionaries
with 1 in the values.
Added (optional) verbose output that dumps different elements of
the grammar and displays the first sets of every nonterminal.
pgen2 now is invoked as:
python -m Parser.pgen2 Grammar/Grammar Include/graminit.h Python/graminit.c
(optionally add '-v' for verbose output).
pgen included "<>" in the list of labels but pgen2 does not as it
thinks is already in the list of tokens because it collides with
the value of "!=". This is not a problem because this label is a
terminal and is handled manually and Parser/token.c produces the
token NOTEQUAL anyway). This is the reason there are now 183 labels
where pgen generates 184).
Changes the regen-grammar to use pgen2 instead of pgen.
Regenerates Makefiles using autoreconf.
https://bugs.python.org/issue35808
This is just a (working) first proposal to start discussing the issue