Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-91719: Make MSVC generate somewhat faster switch code #91718

Merged
merged 4 commits into from Apr 21, 2022

Conversation

Copy link
Member

@gvanrossum gvanrossum commented Apr 20, 2022

Apparently a switch on an 8-bit quantity where all cases are
present generates a more efficient jump (doing only one indexed
memory load instead of two).

See faster-cpython/ideas#321 (comment)

Apparently a switch on an 8-bit quantity where all cases are
present generates a more efficient jump (doing only one indexed
memory load instead of two).

See faster-cpython/ideas#321 (comment)
@gvanrossum gvanrossum changed the title Make MSVC generate somewhat faster switch code GH-91719: Make MSVC generate somewhat faster switch code Apr 20, 2022
@markshannon
Copy link
Member

@markshannon markshannon commented Apr 20, 2022

Would it make more sense to redefine opcode to be a uint8_t, rather than casting it?

We should probably make use_tracing an 8 bit unsigned integer as well.

@gvanrossum
Copy link
Member Author

@gvanrossum gvanrossum commented Apr 20, 2022

Would it make more sense to redefine opcode to be a uint8_t, rather than casting it?

Yeah, I had considered that, it makes sense. I'll confirm that it has the same effect.

We should probably make use_tracing an 8 bit unsigned integer as well.

I don't see why -- it's not used in a similar switch AFAICT, and it's not cramped for space in its struct. I assume for most other operations the cost of loading an int and loading a byte is effectively the same, since the CPU has to load a whole cache line (32 or 64 bytes) anyway.

@gvanrossum
Copy link
Member Author

@gvanrossum gvanrossum commented Apr 20, 2022

I don't believe this needs a news blurb.

@markshannon
Copy link
Member

@markshannon markshannon commented Apr 20, 2022

I don't see why -- it's not used in a similar switch AFAICT, and it's not cramped for space in its struct. I assume for most other operations the cost of loading an int and loading a byte is effectively the same, since the CPU has to load a whole cache line (32 or 64 bytes) anyway.

The dispatch sequence includes opcode |= cframe.use_tracing.
If cframe.use_tracing is a 32 bit int, then the compiler needs to add a cast.
If it is the same type as opcode, it does not.

@gvanrossum
Copy link
Member Author

@gvanrossum gvanrossum commented Apr 20, 2022

The dispatch sequence includes opcode |= cframe.use_tracing.
If cframe.use_tracing is a 32 bit int, then the compiler needs to add a cast.
If it is the same type as opcode, it does not.

Okay, I'll make that change.

@gvanrossum
Copy link
Member Author

@gvanrossum gvanrossum commented Apr 20, 2022

@markshannon, please re-review. I confirmed that the switch still uses a single indirection (goto *(base + offset_table[opcode])). I also found that the opcode |= use_tracing is a single instruction (but maybe it always was one?).

Python/ceval.c Outdated Show resolved Hide resolved
Tools/scripts/generate_opcode_h.py Outdated Show resolved Hide resolved
@markshannon
Copy link
Member

@markshannon markshannon commented Apr 21, 2022

Looks good to me

@gvanrossum gvanrossum merged commit f8dc618 into python:main Apr 21, 2022
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants