15,000 lines of verified cryptography now in Python

In November 2022, I opened issue 99108 on Python’s GitHub repository, arguing that after a recent CVE in its implementation of SHA3, Python should embrace verified code for all of its hash-related infrastructure.

As of last week, this issue is now closed, and every single hash and HMAC algorithm exposed by default in Python is now provided by HACL*, the verified cryptographic library. There was no loss of functionality, and the transition was entirely transparent for Python users. Python now vendors (includes in its repository) 15,000 lines of verified C code from HACL*. Pulling newer versions from the upstream HACL* repository is entirely automated and is done by invoking a script. HACL* was able to successfully implement new features to meet all of the requirements of Python, such as: additional modes for the Blake2 family of algorithms, a new API for SHA3 that covers all Keccak variants, strict abstraction patterns to deal with build system difficulties, proper error management (notably, allocation failures), and instantiating HACL’s generic “streaming” functionality with the HMAC algorithm, including an optimization that requires keeping two hash states at once.

This is the culmination of 2.5 years of work, and could not have happened without the invaluable help of Aymeric Fromherz, who shouldered a lot of the implementation burden. Son Ho had a key contribution early on, generalizing HACL’s “streaming” functionality to be able to handle block algorithms that require a “pre-input”. This slightly obscure generalization was actually essential to implement a suitable, optimized HMAC that keeps two hash states under the hood. On the Python side, Gregory P. Smith, Bénédikt Tran, and later Chris Eibl were big champions and provided a lot of help. Finally, the HACS series of workshops created connections (hello, Paul Kehrer!) and provided sufficient momentum to make this happen. A warm thank you to both the Python and verified cryptographic communities!

As I oftentimes like to do, I’ll provide a little bit of a look behind the scenes, and comment on some of the more interesting technical aspects that are too low-level for a research paper.

A Primer on Streaming APIs

Many cryptographic algorithms are block algorithms, meaning that they assume their input is provided block by block, with special treatment for the first and last blocks. Well-known block algorithms include hash algorithms, MAC algorithms (e.g. Poly1305, HMAC), and more. In practice, the block API is not user-friendly: rarely does one have the data already chunked in blocks; furthermore, computing a result (e.g., a hash) invalidates the state of the block algorithm, which makes e.g. computing the intermediary hashes of the TLS transcript difficult.

For these reasons, cryptographic libraries typically expose streaming APIs, meaning clients can provide inputs of any length; the library then takes care of buffering the input, flushing the buffer as soon as a full block is obtained. A streaming API typically also allows extracting intermediary hashes without invalidating the state.

Streaming APIs are hard, because they manipulate long-lived state with complex invariants: the (unverified) reference implementation of SHA3 was hit with a bad CVE in 2022, which Python “inherited”, because it vendored that very same SHA3 implementation.

Streaming APIs are hard also because the underlying block algorithms differ in a myriad of different ways: all hash algorithms accept an empty final block, except for Blake2; some need to retain the key at run-time (Poly1305), some can discard it after initialization (HMAC, optimized); some need some initial input before processing the key (Blake2), some don’t; and so on.

Given this inherent complexity, streaming algorithms are a good candidate for verification: we wrote a research paper in 2021 about this very problem.

Fully generic verification

The main idea from the paper is that one can capture, using dependent types, what a block algorithm is. Once that’s done, it suffices to author and verify a generic streaming construction once and for all over an abstract definition of block algorithms. Then, just like you instantiate a template in C++, you apply the generic streaming construction to a concrete block API and voilà – a streaming API for that one block algorithm, “for free”.

The fist hitch is that there’s quite a big delta between what we presented in the paper (Listing 12), and what actually lives in the repository. Specifically, capturing any block algorithm requires a lot of genericity.

The user may or may not specify the length of the final digest – for instance, each SHA3 hash has a fixed output length, but the Shake variants produce a result whose length is user-provided.
The block algorithm may expect to receive a pre-input before the blocks of data. For e.g. SHA2, the pre-input is empty, but for Blake2 in keyed hash mode, the pre-input is the key block.
Blocks cannot be processed eagerly, because some algorithms (Blake2) do not allow for the final block to be empty – this vastly complicates buffer management, and interacts with the pre-input.
Some algorithms need to retain extra pieces of state: for instance, the key length for Blake2 can be tweaked at initialization-time, but needs to be retained in the long-lived state.
To avoid invalidating the block state upon intermediary digest extraction, our streaming API copies the state under the hood – in some cases, it’s easier to stack-allocate this copy, but in other cases, a heap-allocated copy makes more sense.
In some cases, we wanted one API per algorithm (we have four APIs for SHA2-{224,256,384,512}); in other cases, we wanted one API per algorithm family (we have one API that covers all 6 Keccak algorithms: 4 variants of SHA3, and 2 variants of Shake).

Getting to that level of genericity took multiple rounds, driven by the successive requirements of Python. Ultimately, for HMAC, which was the final algorithm we landed in Python, we realized that our proofs and definitions were generic enough that we did not need any further tweaks to “instantiate” our generic streaming API with HMAC.

A bulletproof build

One highlight of submitting a PR to Python is that their infrastructure has more CI coverage than we could possibly dream of: a complete build of Python runs over 50+ toolchains and architectures. The flipside? We discovered some pretty annoying corner cases.

One particularly tricky build issue surfaced when dealing with HMAC. As a reminder, HMAC is a generic construction that, given a hash algorithm, provides a keyed message authentication code – in short, there is a high-level HMAC piece of code that defers most of the work to individual hash algorithms. Each hash algorithm may itself come in a variety of implementations: for instance, HMAC-Blake2b is implemented both by HMAC-Blake2b-32 (regular implementation) and HMAC-Blake2b-256 (AVX2 wide-vector implementation).

This already causes problems: HMAC.c may call functions from Blake2b_256.c, if Python is running on a machine with AVX2. However, only Blake2b_256.c may be compiled with -mavx2: code from HMAC.c will execute on all machines, even those without AVX2, meaning it must not be compiled with -mavx2. So far, so good, and this is something we had done before.

The problem came with HMAC.c creating the initial state for Blake2b_256.c:

#include <immintrin.h>

// ...
  __m256i *blake2b_256_state = aligned_malloc(sizeof(__m256i)*4);
// ...

Most toolchains were happy with this code – immintrin.h defines the type __m256i, and even though HMAC.c cannot assume AVX2 instructions are available, it’s not to hard for a compiler to zero-initialize blake2b_256_state without resorting to AVX2 instructions… except, some older compilers refused to process the immintrin.h header unless -mavx2 was used, which defeated the whole purpose.

This required a considerable amount of refactoring to use the well-known “C abstract struct” pattern, which essentially defines an abstract type in C.

// Blake2b_256.h
typedef struct blake2b_256_st_s blake2b_256_st;
blake2b_256_st *blake2b_256_malloc();

// Blake2b_256.c
#include <immintrin.h>
typedef struct blake2b_256_st_s {
    __m256i contents[4];
} blake2b_256_st;

// HMAC.c
blake2b_256_st *blake2b_256_state = blake2b_256_malloc();

What made this extra difficult is that the C code is auto-generated from F*, which has a very different notion of abstraction. The compiler that goes from F* to C, krml, had to be overhauled to perform a much more fine-grained analysis that handles various levels of visibility (public functions, library-internal functions, translation-unit internal functions) even in the presence of such “abstract structs”.

Handling memory allocation failures

While our original modeling of C in F* allowed reasoning about memory allocation failures, no one had ever bothered to do so in practice. For Python, it was desirable to be able to propagate memory allocation failures. This meant we had to refine our definition of a generic, mutable piece of state (such as the block state); our definition of a block algorithm (such as SHA2-256); and our generic streaming construction to all be able to propagate memory allocation failures all the way up to the caller. Thankfully, this didn’t turn out to be a huge deal: we inserted option types all along the way, and because we had one single generic streaming construction, the implementation and proofs had to be updated only once for the 15+ concrete instances of the streaming API.

The presence of option types in the source compiles to tagged unions in the generated C; this is a little verbose, and we may change our definition of a piece of state to feature a has_failed run-time function that can assess whether a memory allocation failed, at the expense of more complexity and verification effort.

Propagating changes from upstream HACL* to Python

My initial Python PR contained a shell script that would fetch the required files from the upstream HACL* repository; ditch a bunch of superfluous definitions in headers via well-crafted sed invocations; and tweak a few include paths in-place, also using my favorite refactoring tool (yes, sed). The benefit was the the initial PR was lean and clean.

Later on, once it became clear that the upstream code was maintainable and pretty stable, that pile of seds was eliminated, on the basis that it’s not the end of the world if a header contains a few extra definitions, and it all makes maintenance easier.

Now, anyone who wishes to refresh HACL* can run the shell script in their checkout of Python, and provided they tweak the expected hash in Python’s SBOM (software bill of materials), they are good to go and can integrate the latest improvements.

Conclusion

I’m delighted to see such a large-scale integration of verified cryptographic code in a flagship project like Python. This demonstrates that verified cryptographic is not only ready from an academic perspective, but also mature enough to be integrated in real-world software while meeting all engineering expectations. Thanks to everyone who helped along this journey!