Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dataclass code caching #92650

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented May 11, 2022

This is a minimal working implementation of "code-caching" for dataclasses. It's heavily inspired by https://github.com/dabeaz/dataklasses, and works by reusing generated code objects for dataclasses that differ only in the names of their fields. "Template" code objects are lazily created with placeholder values (__field_0__, __field_1__) that are patched at method generation time using their replace method. Annotations and default arguments for __init__ methods are assigned manually, as well.

I thought I would stop here and gather feedback/review before going further. A bit more information:

For microbenchmarks on "simple" dataclasses with 1-10 elements and no "special" fields, this branch results in 2x-3x faster class generation time. The test_dataclasses suite, which contains lots of examples of advanced use-cases and actually does some real work with them, runs about 40% faster vs. main.

I've also included some counters for measuring cache stats. These indicate that when running test_dataclasses, 1,428 methods are generated, but only 112 don't have suitable templates in the code cache yet and need to be constructed using exec. So even for the wide range of dataclasses present in this program, we're still able to maintain a hit rate above 90% (__init__ methods are, predictably, the source of most of the misses).

@brandtbucher brandtbucher added performance stdlib 3.12 labels May 11, 2022
@brandtbucher brandtbucher requested a review from ericvsmith May 11, 2022
@gpshead
Copy link
Member

@gpshead gpshead commented May 14, 2022

Could this be further sped up by having dataclasses.py come with a pre-seeded code cache from inlined code that'd already be part of the .pyc file thus avoiding runtime calls to exec() entirely for things that match its shapes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 awaiting core review performance stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants