Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace built-in hashlib with verified implementations from HACL* #99108

Open
msprotz opened this issue Nov 4, 2022 · 2 comments
Open

Replace built-in hashlib with verified implementations from HACL* #99108

msprotz opened this issue Nov 4, 2022 · 2 comments
Labels
type-feature A feature request or enhancement

Comments

@msprotz
Copy link

msprotz commented Nov 4, 2022

Feature or enhancement

We propose to replace the non-OpenSSL cryptographic primitives in hashlib with high-assurance, verified versions from the HACL* project.

Pitch

As evidenced by the recent SHA3 buffer overflow, cryptographic primitives are tricky to implement correctly. There might be issues with memory management, exceeding lengths, incorrect buffer management, or worse, incorrect implementations in corner cases.

The HACL* project https://github.com/hacl-star/hacl-star provides verified implementations of cryptographic primitives. These implementations are mathematically shown to be:

  • memory safe (no buffer overflows, no use-after-free)
  • functionally correct (they always compute the right result)
  • side-channel resistant (the most egregious variants of side-channels, such as memory and timing leaks, are ruled out by construction).

See https://hacl-star.github.io/Overview.html#what-is-verified-software for a longer description of how formal methods can help write high-assurance software and rule out entire classes of bugs.

The performance of HACL* is competitive with, and sometimes exceeds, that of OpenSSL. HACL* is distributed as pure C, and therefore is portable. Parts of HACL* have been adopted in Mozilla, Linux, the Tezos blockchain, and many more, thereby demonstrating that formally verified code is ready for production-time.

Previous discussion

Tagging @alex with whom I've informally discussed this.

@msprotz msprotz added the type-feature A feature request or enhancement label Nov 4, 2022
@gpshead
Copy link
Member

gpshead commented Nov 7, 2022

(missing context: there was a private email thread cc'ing ~7 of us including myself, @tiran, @alex, @msprotz & others that led to this issue and PR) in response to #98517 CVE happening with our old XKCP sha3 vendored code.

One reason I might support this work across all of our required always supported hashes is that if these perform well, I'd be happy to see the _hashlib module wrapping OpenSSL deprecated entirely. (Others probably disagree with that, so we may never get there). One less thing depending on OpenSSL sounds nice - but the OpenSSL EVP interface and these algorithms are far less likely to be where OpenSSL CVE bugs lurk (famous last words). BUT... if the vendored third party code we use as a fallback for hash algorithms can be replaced with something nearly guaranteed not to be the next CVE as the hack-star project appears to aim to be (I cannot actually be the judge of that), and be high performance... Why should we bother involving OpenSSL for these at all?

If this sounds like the opposite of my statements over buried in the blake3 rejection thread - it's because I did not seek an alternative to the variety of vendored fallback hash algorithm implementations we carry there. So the goal in those statements was to lean towards simplicity instead of complexity. (which has already paid off via our XKCP removal in favor of tiny_sha3 by not having Python 3.11+ suffer from the sha3 XKCP CVE that the larger complex third party code in 3.6-3.10 did)

We provide and ship native implementations for the hash algorithms anyways as we support builds without OpenSSL present, and support building on systems with libssl versions or variants that do not provide everything.

Q: Would adopting this code increase our maintenance burden?

A: Probably, yes. We'd presumably want to update it with more recent versions from hacl-star from time to time. Though the theory is that there should never be a security need to do so? In the past we've been asked to update vendored hash algorithm code from time to time but have hesitated due to the complexity and perceived risk, and assumed lack of reward given people who care about performance would presumably have a modern OpenSSL to get that from. Though if we did get rid of the OpenSSL wrapper code that'd also be a reduced burden.

Q: What about performance?

A: Good question. That needs continual measuring on our Tiered supported platforms. Don't focus solely on thruput. OpenSSL's EVP APIs are known to have silly-slow setup time. So something hashing a pile of 234 byte objects may be slow no matter how good OpenSSL's bulk data thruput is. Benchmark small message thruput as well as bulk data thruput.

Parting thought: There is a broader desire to wean us off of our OpenSSL dependency. I realize that's a hard thing so long as the _ssl module exists, but it is a long term desire of many. This kind of change would be one necessary step, regardless of when it is taken.

@msprotz
Copy link
Author

msprotz commented Nov 8, 2022

Thanks for your comment, Greg. Some related thoughts.

  • We have support for all of the algorithms in hashlib (md5, sha1, sha2, sha3, blake2). So there is a clear pathway to consolidating those implementations together into a single vendored library that is side-channel resistant, and that has proofs of correctness.
  • Getting rid of OpenSSL is feasible, but not today (as you accurately pointed out on gh-99108: Import SHA2-224 and SHA2-256 from HACL* #99109). For now, we are proposing pure C that is easy to compile, portable, fast, and secure. We have a technology for mixing ASM and C, or even using C compiler intrinsics -- that's what we used to beat OpenSSL in a 2020 paper, on one specific algorithm. But then there are numerous drawbacks:
    • your build becomes more complicated: you need to probe the current toolchain and make sure it even knows about e.g. that fancy AVX512 intrinsic you're about to use; or that it can use inline ASM; etc.
    • your runtime becomes more complicated: once the compiled binary is copied to another computer, you need to perform run-time CPU detection, so that in the event that you did manage to compile that AVX512 version, then you do use it if the processor you're running on happens to have those fancy instructions
    • your maintenance burden becomes greater, and you need to decide how many ASM versions you want to support, for which architecture families, etc.

None of this is infeasible -- it's just a matter of deciding what is good for a Python-in-the-future where hashlib no longer depends on OpenSSL. For instance, you might say "let's maintain a portable C version, and one single ASM version that uses SHAEXT when available (fastest)", and this is something we could make happen within HACL*.

Just to give you a sense of what we have right now, allow me to go into greater detail:

  • MD5, SHA1: legacy broken algorithms, no particular performance optimizations have been performed
  • SHA2: on-par with OpenSSL noasm+nohw, option to use SHAEXT when available but currently not part of gh-99108: Import SHA2-224 and SHA2-256 from HACL* #99109 -- happy to do that in a followup PR
  • SHA3: a few low-hanging fruits on our side to make performance better, happy to work on that if this is what it takes for Python to integrate it
  • Blake2: regular portable C, AVX and AVX2 versions, just like the reference implementation; same as SHA3, probably needs a few performance tweaks before integrating.

Hope this helps. As you said, this kind of change is a first step in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants