Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-31650: PEP 552 (Deterministic pycs) implementation #4575

Merged
merged 14 commits into from Dec 9, 2017

Conversation

@benjaminp
Copy link
Contributor

commented Nov 26, 2017

Python now supports checking bytecode cache up-to-dateness with a hash of the
source contents rather than volatile source metadata. See the PEP for details.

While a fairly straightforward idea, quite a lot of code had to be modified due
to the pervasiveness of pyc implementation details in the codebase. Changes in
this commit include:

  • The core changes to importlib to understand how to read, validate, and
    regenerate hash-based pycs.

  • Support for generating hash-based pycs in py_compile and compileall.

  • Modifications to our siphash implementation to support passing a custom
    key. We then expose it to importlib through _imp.

  • Updates to all places in the interpreter, standard library, and tests that
    manually generate or parse pyc files to grok the new format.

  • Support in the interpreter command line code for long options like
    --check-hash-based-pycs.

  • Tests and documentation for all of the above.

https://bugs.python.org/issue31650

@benjaminp benjaminp requested a review from Nov 26, 2017

@benjaminp benjaminp force-pushed the benjamin-pep-552 branch 4 times, most recently from 780bf48 to cd28db3 Nov 26, 2017

@benjaminp benjaminp requested review from gpshead, Yhg1s and brettcannon Nov 27, 2017

against the source at runtime to determine if the pyc needs to be
regenerated.

.. attribute:: UNCHECKED_HASH

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

The text below merely says "a hash is included but not validated" but makes no mention of what is validated in this situation. the pyc-invalidation docs cover this in more detail. It seems worth repeating that this just means a hash is generated when generating a pyc but is not checked by default here.

It'd also be good to include a statement of when this might be useful (packaging distribution installed pre-generated pyc files where something else is guaranteeing they cannot go out of sync for example?). Though perhaps that description belongs in import.rst's pyc-invalidation section.

This comment has been minimized.

Copy link
@benjaminp

benjaminp Nov 28, 2017

Author Contributor

The text below merely says "a hash is included but not validated" but makes no mention of what is validated in this situation. the pyc-invalidation docs cover this in more detail. It seems worth repeating that this just means a hash is generated when generating a pyc but is not checked by default here.

Yeah, I was trying to funnel everything into import.rst. I will beef up this part a big, though.

It'd also be good to include a statement of when this might be useful (packaging distribution installed pre-generated pyc files where something else is guaranteeing they cannot go out of sync for example?). Though perhaps that description belongs in import.rst's pyc-invalidation section.

I put some commentary about why you might want this in the what's new? entry. Maybe, it should go somewhere more "permanent", too, though.

@@ -266,23 +284,29 @@ def main():
if args.workers is not None:
args.workers = args.workers or None

ivl_mode = args.invalidation_mode.replace('-', '_').upper()

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

Since you use the names in choices= above to generate the mode name, perhaps use a similar reverse transformation to generate a (sorted/stable-ordered) list of names in the choices tuple above?

choices=sorted(mode.lower().replace('_', '-') for mode in args.invalidation_mode),

or similar?

This comment has been minimized.

Copy link
@benjaminp

benjaminp Nov 28, 2017

Author Contributor

sgtm

def _validate_bytecode_header(data, source_stats=None, name=None, path=None):
"""Validate the header of the passed-in bytecode against source_stats (if
given) and returning the bytecode that can be compiled by compile().
def _classify_pyc(data, name, exc_details):

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

Can you document what the arguments are?
name is not obvious - could it be given a better variable name? is it the pyc file's basename or the full pathname?
exc_details is unobvious.

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

A similar comment applies to all other new function def's below.

This comment has been minimized.

Copy link
@benjaminp

benjaminp Nov 28, 2017

Author Contributor

will do

return flags


def _validate_timestamp_pyc(data, source_stats, name, exc_details):

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

It seems suspicious that this raises no error when the source_stats dict-like object fails to contain mtime or size members.

If this is intentional, document why.

This comment has been minimized.

Copy link
@benjaminp

benjaminp Nov 28, 2017

Author Contributor

Yeah, I'll clean this up.

file."""

def _code_to_timestamp_pyc(code, mtime=0, source_size=0):
"Produce the data for a timestamp-based pyc."
data = bytearray(MAGIC_NUMBER)

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

It is unfortunate that the constant is called MAGIC_NUMBER when in fact it is a bytes object containing the binary magic data. Can it be renamed as this is "just" _bootstrap_external or would that break some API somewhere?

bytearray(MAGIC_NUMBER) looks suspicious if a reader naively assumes that MAGIC_NUMBER is an integer...

This comment has been minimized.

Copy link
@benjaminp

benjaminp Nov 28, 2017

Author Contributor

Yeah, it turns out the integer version is actually called _RAW_MAGIC_NUMBER, which a priori sounds more likely to be a bytearray. oops

At any rate, I think it's correct that this constant is exposed as some bytes not a number, since the only valid operation is to write it into a file or compare it to some bytes from a file.

source_stats=st, name=fullname,
path=bytecode_path)
flags = _classify_pyc(data, fullname, exc_details)
bytes_data = data[16:]

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

Consider using memoryview(data)[16:] to avoid a large copy assuming _compile_bytecode is happy with that (it appears to be, marshal.loads() accepts memoryview)

This comment has been minimized.

Copy link
@benjaminp

benjaminp Nov 28, 2017

Author Contributor

sgtm

'path': path,
}
_classify_pyc(data, fullname, exc_details)
return _compile_bytecode(data[16:], name=fullname, bytecode_path=path)

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

memoryview(data)[16:] if you also do this in the other place i made this comment.

This comment has been minimized.

Copy link
@benjaminp

benjaminp Nov 28, 2017

Author Contributor

will do

@@ -5,18 +5,25 @@
from ._bootstrap import spec_from_loader
from ._bootstrap import _find_spec
from ._bootstrap_external import MAGIC_NUMBER

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

This is a public API exposing MAGIC_NUMBER so it's probably best to ignore my previous comment about changing the name of the thing to avoid confusion over its type.

except ImportError as exc:
self.msgout(2, "raise ImportError: " + str(exc), pathname)
raise
co = marshal.loads(marshal_data)
co = marshal.loads(data[16:])

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

memoryview(data)[16:]

bytecode = importlib._bootstrap_external._code_to_hash_pyc(
code,
source_hash,
invalidation_mode == PycInvalidationMode.CHECKED_HASH,

This comment has been minimized.

Copy link
@gpshead

gpshead Nov 28, 2017

Member

A == test to get a boolean value in a list of args is easy to misread as a typo for a keyword argument being passed. Disambiguate by adding ()s around it or passing this by name?

@benjaminp
Copy link
Contributor Author

left a comment

Thanks for the comments! Update incoming.

@benjaminp benjaminp force-pushed the benjamin-pep-552 branch from cd28db3 to 737f925 Nov 28, 2017

@brettcannon
Copy link
Member

left a comment

Pretty much all of my comments are minor, so no need to have to double-check any changes made based on my feedback.

.. cmdoption:: --invalidation-mode [timestamp|checked-hash|unchecked-hash]

Control how the generated pycs will be invalidated at runtime. The default
setting, ``timestamp``, means that pyc files with the source timestamp and

This comment has been minimized.

Copy link
@brettcannon

brettcannon Dec 1, 2017

Member

I just wanted to double-check that we are referring to the files as "pyc" and "pycs" throughout the rest of the docs instead of ".pyc files" or something? Basically just checking we're being consistent.

@@ -1327,6 +1330,14 @@ an :term:`importer`.
.. versionchanged:: 3.6
Accepts a :term:`path-like object`.

.. function:: source_hash(source_bytes)

Return the hash of *source_bytes* as byte string. A hash-based pyc embeds the

This comment has been minimized.

Copy link
@brettcannon

brettcannon Dec 1, 2017

Member

"as byte string" -> "as bytes".

metadata.

Python also supports "hash-based" cache files, which store a hash of a source
file contents rather than its metadata. There are two variants of hash-based

This comment has been minimized.

Copy link
@brettcannon

brettcannon Dec 1, 2017

Member

"source file contents" -> "source file's contents"

hash in the cache file. If a checked hash-based cache file is found to be
invalid, Python regenerates it and writes a new checked hash-based cache
file. For unchecked hash-based pycs, Python simply assumes the cache file is
valid if it exists. Hash-based pyc validation behavior may be override with the

This comment has been minimized.

Copy link
@brettcannon

brettcannon Dec 1, 2017

Member

"override" -> "overridden"

elif len(raw_timestamp) != 4:
message = 'reached EOF while reading timestamp in {!r}'.format(name)
if len(data) < 16:
message = 'reached EOF while reading pyc header of {!r}'.format(name)

This comment has been minimized.

Copy link
@brettcannon

brettcannon Dec 1, 2017

Member

Might as well change this to an f-string.

flags = _r_long(data[4:8])
# Only the first two flags are defined.
if flags & ~0b11:
raise ImportError('invalid flags {!r} in {!r}'.format(flags, name),

This comment has been minimized.

Copy link
@brettcannon

brettcannon Dec 1, 2017

Member

f-string

"""
if _r_long(data[8:12]) != (source_mtime & 0xFFFFFFFF):
message = 'bytecode is stale for {!r}'.format(name)

This comment has been minimized.

Copy link
@brettcannon

brettcannon Dec 1, 2017

Member

f-string

raise ImportError(message, **exc_details)
if (source_size is not None and
_r_long(data[12:16]) != (source_size & 0xFFFFFFFF)):
raise ImportError('bytecode is stale for {!r}'.format(name),

This comment has been minimized.

Copy link
@brettcannon

brettcannon Dec 1, 2017

Member

f-string

del_source=del_source)
test('_temp', mapping, bc_path)

def _test_no_marshal(self, *, del_source=False):
with util.create_modules('_temp') as mapping:
bc_path = self.manipulate_bytecode('_temp', mapping,
lambda bc: bc[:12],
lambda bc: bc[:16],

This comment has been minimized.

Copy link
@brettcannon

brettcannon Dec 1, 2017

Member

I should have made these constants instead of using a literal when I originally wrote this code. Sorry about that. :(

@@ -2216,6 +2249,14 @@ PyInit_imp(void)
d = PyModule_GetDict(m);
if (d == NULL)
goto failure;
PyObject *pyc_mode = PyUnicode_FromString(_Py_CheckHashBasedPycsMode);
if (pyc_mode == NULL)

This comment has been minimized.

Copy link
@brettcannon

brettcannon Dec 1, 2017

Member

Missing curly braces.

@benjaminp benjaminp force-pushed the benjamin-pep-552 branch from 737f925 to b3a7070 Dec 3, 2017

@gpshead
gpshead approved these changes Dec 9, 2017
Copy link
Member

left a comment

some minor comments to address but overall i believe this is good to go in when you are ready.

thanks!

@@ -98,6 +104,8 @@ static const char usage_3[] = "\
also PYTHONWARNINGS=arg\n\
-x : skip first line of source, allowing use of non-Unix forms of #!cmd\n\
-X opt : set implementation-specific option\n\
--check-hash-based-pycs always|default|never:\n\
control how Python invalidates hash-based .pcy files\n\

This comment has been minimized.

Copy link
@gpshead

gpshead Dec 9, 2017

Member

.pyc files

};

PyDoc_STRVAR(module_doc,
"perfmap module.");

This comment has been minimized.

Copy link
@gpshead

gpshead Dec 9, 2017

Member

where'd this file come from?

This comment has been minimized.

Copy link
@benjaminp

benjaminp Dec 9, 2017

Author Contributor

eek, now you know what crud I have sitting about in my checkout.

if (flags != 0) {
// Hash-based pyc. We currently refuse to handle checked hash-based
// pycs. We could validate hash-based pycs against the source, but it
// seems likely that most people putting hash-based pycs in a zipfile

This comment has been minimized.

Copy link
@gpshead

gpshead Dec 9, 2017

Member

+1 agreed!

benjaminp added 12 commits Oct 1, 2017
bpo-31650: PEP 552 (Deterministic pycs) implementation
Python now supports checking bytecode cache up-to-dateness with a hash of the
source contents rather than volatile source metadata. See the PEP for details.

While a fairly straightforward idea, quite a lot of code had to be modified due
to the pervasiveness of pyc implementation details in the codebase. Changes in
this commit include:

- The core changes to importlib to understand how to read, validate, and
  regenerate hash-based pycs.

- Support for generating hash-based pycs in py_compile and compileall.

- Modifications to our siphash implementation to support passing a custom
  key. We then expose it to importlib through _imp.

- Updates to all places in the interpreter, standard library, and tests that
  manually generate or parse pyc files to grok the new format.

- Support in the interpreter command line code for long options like
  --check-hash-based-pycs.

- Tests and documentation for all of the above.
benjaminp added 2 commits Dec 2, 2017

@benjaminp benjaminp force-pushed the benjamin-pep-552 branch from e5d54da to f498a91 Dec 9, 2017

@benjaminp benjaminp merged commit 42aa93b into master Dec 9, 2017

4 checks passed

bedevere/issue-number Issue number 31650 found
Details
bedevere/news News entry found in Misc/NEWS.d
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@benjaminp benjaminp deleted the benjamin-pep-552 branch Dec 9, 2017

@benjaminp

This comment has been minimized.

Copy link
Contributor Author

commented Dec 9, 2017

Thanks again for the reviews, Brett and Greg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.