Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upbpo-31650: PEP 552 (Deterministic pycs) implementation #4575
Conversation
benjaminp
requested a review
from
Nov 26, 2017
the-knights-who-say-ni
added
the
CLA signed
label
Nov 26, 2017
bedevere-bot
added
the
awaiting merge
label
Nov 26, 2017
benjaminp
force-pushed the
benjamin-pep-552
branch
4 times, most recently
from
780bf48
to
cd28db3
Nov 26, 2017
benjaminp
requested review from
gpshead,
Yhg1s and
brettcannon
Nov 27, 2017
against the source at runtime to determine if the pyc needs to be | ||
regenerated. | ||
|
||
.. attribute:: UNCHECKED_HASH |
This comment has been minimized.
This comment has been minimized.
gpshead
Nov 28, 2017
Member
The text below merely says "a hash is included but not validated" but makes no mention of what is validated in this situation. the pyc-invalidation docs cover this in more detail. It seems worth repeating that this just means a hash is generated when generating a pyc but is not checked by default here.
It'd also be good to include a statement of when this might be useful (packaging distribution installed pre-generated pyc files where something else is guaranteeing they cannot go out of sync for example?). Though perhaps that description belongs in import.rst's pyc-invalidation section.
This comment has been minimized.
This comment has been minimized.
benjaminp
Nov 28, 2017
Author
Contributor
The text below merely says "a hash is included but not validated" but makes no mention of what is validated in this situation. the pyc-invalidation docs cover this in more detail. It seems worth repeating that this just means a hash is generated when generating a pyc but is not checked by default here.
Yeah, I was trying to funnel everything into import.rst
. I will beef up this part a big, though.
It'd also be good to include a statement of when this might be useful (packaging distribution installed pre-generated pyc files where something else is guaranteeing they cannot go out of sync for example?). Though perhaps that description belongs in import.rst's pyc-invalidation section.
I put some commentary about why you might want this in the what's new? entry. Maybe, it should go somewhere more "permanent", too, though.
@@ -266,23 +284,29 @@ def main(): | |||
if args.workers is not None: | |||
args.workers = args.workers or None | |||
|
|||
ivl_mode = args.invalidation_mode.replace('-', '_').upper() |
This comment has been minimized.
This comment has been minimized.
gpshead
Nov 28, 2017
Member
Since you use the names in choices=
above to generate the mode name, perhaps use a similar reverse transformation to generate a (sorted/stable-ordered) list of names in the choices tuple above?
choices=sorted(mode.lower().replace('_', '-') for mode in args.invalidation_mode),
or similar?
This comment has been minimized.
This comment has been minimized.
def _validate_bytecode_header(data, source_stats=None, name=None, path=None): | ||
"""Validate the header of the passed-in bytecode against source_stats (if | ||
given) and returning the bytecode that can be compiled by compile(). | ||
def _classify_pyc(data, name, exc_details): |
This comment has been minimized.
This comment has been minimized.
gpshead
Nov 28, 2017
Member
Can you document what the arguments are?
name
is not obvious - could it be given a better variable name? is it the pyc file's basename or the full pathname?
exc_details
is unobvious.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
return flags | ||
|
||
|
||
def _validate_timestamp_pyc(data, source_stats, name, exc_details): |
This comment has been minimized.
This comment has been minimized.
gpshead
Nov 28, 2017
Member
It seems suspicious that this raises no error when the source_stats
dict-like object fails to contain mtime
or size
members.
If this is intentional, document why.
This comment has been minimized.
This comment has been minimized.
file.""" | ||
|
||
def _code_to_timestamp_pyc(code, mtime=0, source_size=0): | ||
"Produce the data for a timestamp-based pyc." | ||
data = bytearray(MAGIC_NUMBER) |
This comment has been minimized.
This comment has been minimized.
gpshead
Nov 28, 2017
Member
It is unfortunate that the constant is called MAGIC_NUMBER
when in fact it is a bytes object containing the binary magic data. Can it be renamed as this is "just" _bootstrap_external or would that break some API somewhere?
bytearray(MAGIC_NUMBER)
looks suspicious if a reader naively assumes that MAGIC_NUMBER is an integer...
This comment has been minimized.
This comment has been minimized.
benjaminp
Nov 28, 2017
Author
Contributor
Yeah, it turns out the integer version is actually called _RAW_MAGIC_NUMBER
, which a priori sounds more likely to be a bytearray. oops
At any rate, I think it's correct that this constant is exposed as some bytes not a number, since the only valid operation is to write it into a file or compare it to some bytes from a file.
source_stats=st, name=fullname, | ||
path=bytecode_path) | ||
flags = _classify_pyc(data, fullname, exc_details) | ||
bytes_data = data[16:] |
This comment has been minimized.
This comment has been minimized.
gpshead
Nov 28, 2017
Member
Consider using memoryview(data)[16:]
to avoid a large copy assuming _compile_bytecode is happy with that (it appears to be, marshal.loads()
accepts memoryview
)
This comment has been minimized.
This comment has been minimized.
'path': path, | ||
} | ||
_classify_pyc(data, fullname, exc_details) | ||
return _compile_bytecode(data[16:], name=fullname, bytecode_path=path) |
This comment has been minimized.
This comment has been minimized.
gpshead
Nov 28, 2017
Member
memoryview(data)[16:]
if you also do this in the other place i made this comment.
This comment has been minimized.
This comment has been minimized.
@@ -5,18 +5,25 @@ | |||
from ._bootstrap import spec_from_loader | |||
from ._bootstrap import _find_spec | |||
from ._bootstrap_external import MAGIC_NUMBER |
This comment has been minimized.
This comment has been minimized.
gpshead
Nov 28, 2017
Member
This is a public API exposing MAGIC_NUMBER
so it's probably best to ignore my previous comment about changing the name of the thing to avoid confusion over its type.
except ImportError as exc: | ||
self.msgout(2, "raise ImportError: " + str(exc), pathname) | ||
raise | ||
co = marshal.loads(marshal_data) | ||
co = marshal.loads(data[16:]) |
This comment has been minimized.
This comment has been minimized.
bytecode = importlib._bootstrap_external._code_to_hash_pyc( | ||
code, | ||
source_hash, | ||
invalidation_mode == PycInvalidationMode.CHECKED_HASH, |
This comment has been minimized.
This comment has been minimized.
gpshead
Nov 28, 2017
Member
A ==
test to get a boolean value in a list of args is easy to misread as a typo for a keyword argument being passed. Disambiguate by adding ()s around it or passing this by name?
Thanks for the comments! Update incoming. |
benjaminp
force-pushed the
benjamin-pep-552
branch
from
cd28db3
to
737f925
Nov 28, 2017
gpshead
assigned
gpshead and
brettcannon
Nov 30, 2017
Pretty much all of my comments are minor, so no need to have to double-check any changes made based on my feedback. |
.. cmdoption:: --invalidation-mode [timestamp|checked-hash|unchecked-hash] | ||
|
||
Control how the generated pycs will be invalidated at runtime. The default | ||
setting, ``timestamp``, means that pyc files with the source timestamp and |
This comment has been minimized.
This comment has been minimized.
brettcannon
Dec 1, 2017
Member
I just wanted to double-check that we are referring to the files as "pyc" and "pycs" throughout the rest of the docs instead of ".pyc
files" or something? Basically just checking we're being consistent.
@@ -1327,6 +1330,14 @@ an :term:`importer`. | |||
.. versionchanged:: 3.6 | |||
Accepts a :term:`path-like object`. | |||
|
|||
.. function:: source_hash(source_bytes) | |||
|
|||
Return the hash of *source_bytes* as byte string. A hash-based pyc embeds the |
This comment has been minimized.
This comment has been minimized.
metadata. | ||
|
||
Python also supports "hash-based" cache files, which store a hash of a source | ||
file contents rather than its metadata. There are two variants of hash-based |
This comment has been minimized.
This comment has been minimized.
hash in the cache file. If a checked hash-based cache file is found to be | ||
invalid, Python regenerates it and writes a new checked hash-based cache | ||
file. For unchecked hash-based pycs, Python simply assumes the cache file is | ||
valid if it exists. Hash-based pyc validation behavior may be override with the |
This comment has been minimized.
This comment has been minimized.
elif len(raw_timestamp) != 4: | ||
message = 'reached EOF while reading timestamp in {!r}'.format(name) | ||
if len(data) < 16: | ||
message = 'reached EOF while reading pyc header of {!r}'.format(name) |
This comment has been minimized.
This comment has been minimized.
flags = _r_long(data[4:8]) | ||
# Only the first two flags are defined. | ||
if flags & ~0b11: | ||
raise ImportError('invalid flags {!r} in {!r}'.format(flags, name), |
This comment has been minimized.
This comment has been minimized.
""" | ||
if _r_long(data[8:12]) != (source_mtime & 0xFFFFFFFF): | ||
message = 'bytecode is stale for {!r}'.format(name) |
This comment has been minimized.
This comment has been minimized.
raise ImportError(message, **exc_details) | ||
if (source_size is not None and | ||
_r_long(data[12:16]) != (source_size & 0xFFFFFFFF)): | ||
raise ImportError('bytecode is stale for {!r}'.format(name), |
This comment has been minimized.
This comment has been minimized.
del_source=del_source) | ||
test('_temp', mapping, bc_path) | ||
|
||
def _test_no_marshal(self, *, del_source=False): | ||
with util.create_modules('_temp') as mapping: | ||
bc_path = self.manipulate_bytecode('_temp', mapping, | ||
lambda bc: bc[:12], | ||
lambda bc: bc[:16], |
This comment has been minimized.
This comment has been minimized.
brettcannon
Dec 1, 2017
Member
I should have made these constants instead of using a literal when I originally wrote this code. Sorry about that. :(
@@ -2216,6 +2249,14 @@ PyInit_imp(void) | |||
d = PyModule_GetDict(m); | |||
if (d == NULL) | |||
goto failure; | |||
PyObject *pyc_mode = PyUnicode_FromString(_Py_CheckHashBasedPycsMode); | |||
if (pyc_mode == NULL) |
This comment has been minimized.
This comment has been minimized.
benjaminp
force-pushed the
benjamin-pep-552
branch
from
737f925
to
b3a7070
Dec 3, 2017
some minor comments to address but overall i believe this is good to go in when you are ready. thanks! |
@@ -98,6 +104,8 @@ static const char usage_3[] = "\ | |||
also PYTHONWARNINGS=arg\n\ | |||
-x : skip first line of source, allowing use of non-Unix forms of #!cmd\n\ | |||
-X opt : set implementation-specific option\n\ | |||
--check-hash-based-pycs always|default|never:\n\ | |||
control how Python invalidates hash-based .pcy files\n\ |
This comment has been minimized.
This comment has been minimized.
}; | ||
|
||
PyDoc_STRVAR(module_doc, | ||
"perfmap module."); |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
benjaminp
Dec 9, 2017
Author
Contributor
eek, now you know what crud I have sitting about in my checkout.
if (flags != 0) { | ||
// Hash-based pyc. We currently refuse to handle checked hash-based | ||
// pycs. We could validate hash-based pycs against the source, but it | ||
// seems likely that most people putting hash-based pycs in a zipfile |
This comment has been minimized.
This comment has been minimized.
benjaminp
force-pushed the
benjamin-pep-552
branch
from
e5d54da
to
f498a91
Dec 9, 2017
benjaminp
merged commit 42aa93b
into
master
Dec 9, 2017
bedevere-bot
removed
the
awaiting merge
label
Dec 9, 2017
benjaminp
deleted the
benjamin-pep-552
branch
Dec 9, 2017
This comment has been minimized.
This comment has been minimized.
Thanks again for the reviews, Brett and Greg! |
benjaminp commentedNov 26, 2017
•
edited by bedevere-bot
Python now supports checking bytecode cache up-to-dateness with a hash of the
source contents rather than volatile source metadata. See the PEP for details.
While a fairly straightforward idea, quite a lot of code had to be modified due
to the pervasiveness of pyc implementation details in the codebase. Changes in
this commit include:
The core changes to importlib to understand how to read, validate, and
regenerate hash-based pycs.
Support for generating hash-based pycs in py_compile and compileall.
Modifications to our siphash implementation to support passing a custom
key. We then expose it to importlib through _imp.
Updates to all places in the interpreter, standard library, and tests that
manually generate or parse pyc files to grok the new format.
Support in the interpreter command line code for long options like
--check-hash-based-pycs.
Tests and documentation for all of the above.
https://bugs.python.org/issue31650