5 Years of Meta-Programming Cryptography

For the past five years or so, I have been exploring, along with many other collaborators, the space of verified cryptography and protocols. The premise is simple: to deliver critical code that comes with strong properties of correctness, safety and security. Some readers may be familiar with Project Everest: this is the umbrella under which a large chunk of this work happened.

This blog post focuses on the specific topic of using meta-programming and compile-time evaluation to scale up cryptographic verification. This theme has been a key technical ingredient in almost every paper since 2016, with each successive publication making a greater use of either one of these techniques. The topic is of interest to my programming languages (PL)-oriented readers. And, as far as I know, I haven’t really written anything that highlights this unifying theme in the research arc of verified crypto and protocols.

These techniques all come together in our most recent paper on Noise*, in which we use a Futamura projection and a combination of deep and shallow embeddings to run a protocol compiler on the F* normalizer. The Noise* work is the culmination of five years of working on PL techniques for cryptography, and it will be presented this week at Oakland (S&P). This is a great timing for a retrospective!

A disclaimer before I go any further: this is my non-professional blog, and this post will inevitably capture my personal views and individual recollections.

Basic Partial Evaluation in Low*

In 2017, we introduced the Low* toolchain. The core idea is that we can model a palatable subset of C directly within F*.

// Allocate a zero-initialized array of 32 bytes on the stack.
let digest = Array.alloca 32ul 0uy in
SHA2.hash digest input input_len

The special Array module is crafted so that F* enforces a C-like discipline, permitting reasoning about liveness, disjointness, array lengths, and so on. (There are many more such modules in Low*.) In the example above, F* would typically check that i) digest and input are live pointers, that ii) input_len represents the length of input, that iii) digest and input are disjoint, and so on. Provided your code verifies successfully and abides by further (mostly syntactic) restrictions, then a special compiler called KaRaMeL (née KReMLin) will turn it into C code.

uint8_t digest[32];
SHA2_hash(digest, input, input_len);

As we were developing what would become HACL*, an issue quickly arose. For the purposes of verification, it is wise to split up a function into many small helpers, each with their own pre- and post-conditions. This makes verification more robust; code becomes more maintainable; and small helpers promote code reuse and modularity. However, at code-generation time, this produces a C file with a multitude of three-line functions scattered all over the place. This is not idiomatic, makes C code harder to follow, and generally diminishes our credibility when we want verification-agnostic practitioners to take us seriously. I wrote extensively about this a few years back: even though it’s auto-generated, producing quality C code matters!

The initial workaround (added Dec 2016) was to add an ad-hoc optimization pass to KaRaMeL, the F*-to-C compiler, using a new F* keyword, dubbed [@substitute]. See this vintage example for a typical usage in HACL*: at call-site, KaRaMeL would simply replace any function marked as [@substitute] with its body, substituting effective arguments for formal parameters. This is simply an inlining step, also known as a beta-reduction step. (The careful reader might notice that this is of course unsound in the presence of effectful function arguments, which may be evaluated twice: fortunately, F* guarantees that effectful arguments are let-bound, and thus evaluated only once.)

This was the first attempt at performing partial evaluation at compile-time in order to generate better code, and got us going for a while. However, it had several drawbacks.

F* already had a normalizer, that is, an interpreter built into the type inference machinery, for the purpose of higher-order unification. In other words, F* could already reduce terms at compile-time, and owing to its usage of a Krivine Abstract Machine (KAM), was much better equipped than KaRaMeL to do so! The KAM is well-studied, reduces terms in De Bruijn form, and is formally described, which is much better than adding an ad-hoc reduction step somewhere as a KaRaMeL nanopass.
F*’s normalizer performs a host of other reduction steps; if beta-reductions happen only in KaRaMeL, then a large amount of optimizations are actually impossible, because you need the composition of F*’s existing reduction steps, and the desired extra beta-reductions, rather than one after the other. In other words, you lose on how much partial evaluation you can trigger.

To illustrate the latter point, consider the example below. Circa 2016, the conditional could not be eliminated in the example below, because the beta-reduction (in KaRaMeL) happened after the conditional elimination step (in F*’s existing normalizer).

// St for "stateful"
[@substitute]
let f (x: bool): St ... =
  if x then
    ...
  else
    ...

let main () =
  f true

This was fixed in September 2017, when F* allowed its inline_for_extraction keyword to apply to stateful beta-redexes. This suddenly “unlocked” many more potential use-cases, which we quickly leveraged. The earlier example now uses the new keyword.

Overloaded Operators via Partial Evaluation

In 2018, HACL* entered a deep refactoring that eventually culminated in all of the original code being rewritten. That rewrite led to more concise, reusable and effective code.

One of the key changes was overloaded operators. This is an interesting technical feature, as it relies on the conjunction of four F* features: compile-time reduction, implicit argument inference, fine-grained reduction hints and cross-module inlining.

The problem is as follows. The Low* model of C in F* has one module per integer type; effectively, we expose the base types found in C99’s <inttypes.h>. Unfortunately, this means each integer type (there are 9 of them) has its own set of operators. This prevents modularity (functions need to be duplicated for e.g. uint32 and uint64), and hinders readability and programmer productivity.

Overloaded operators are defined as follows:

// IntTypes.fsti, Obviously, a simplification!
type int_width = | U32 | U64 | ...

inline_for_extraction
type int_type (x: int_width) =
  match x with
  | U32 -> LowStar.UInt32.t
  | U64 -> LowStar.UInt64.t

[@@ strict_on_arguments [0]]
inline_for_extraction
val (+): #w:inttype -> int_type w -> int_type w -> int_type w

This interface file only introduces an abstract signature for +; the definition is in the corresponding implementation file.

// IntTypes.fst
let (+) #w x y =
  match w with
  | U32 -> LowStar.UInt32.add x y
  | U64 -> LowStar.UInt64.add x y

Compared to our earlier [@substitute] example, this is a lot more sophisticated! To make this work in practice, we need to combine several mechanisms.

The inline_for_extraction qualifier indicates that this definition needs to be evaluated and reduced at compile-time by F*. Indeed, the dependent type int_type and the dependently-typed + do not compile to C, so any use of these definitions must be evaluated away at compile-time.
The strict_on_arguments attribute indicates that this definition should not be reduced eagerly, as it would lead to combinatorial explosion in certain cases. Rather, beta-reduction should take place only when the 0-th argument to + is a constant, which guarantees the match reduces immediately.
The definition is a val in the interface, meaning it is abstract; this is intentional, and prevents the SMT solver from “seeing” the body of the definition in client modules, and attempting a futile case analysis. To make this work, a special command-line flag must be passed to F*, called --cmi, for cross-module inlining. It makes sure inline_for_extraction traverses abstraction boundaries, but only at extraction-time, not verification-time.
Finally, in order for this to be pleasant to use, we indicate to F* that it ought to infer the width w, using # to denote an implicit argument.

These abstractions form the foundation of HACL* “v2”. They are briefly described in our CCS Paper.

(Note: this predates the type class mechanism in F*. The same effect would be achievable with type classes, no doubt.)

Hand-written Templates for Hashes

Operators enable code-sharing in the small; we want to enable code-sharing in the large. A prime example is the Merkle-Damgård (MD) family of hash functions (MD5, SHA1, SHA2). All of these functions follow the same high-level structure:

// VASTLY simplified!
let hash (dst: array uint8) (input: array uint8) (input_len: size_t) =
  // Initialize internal hash state
  let state = init () in
  // Compute how many complete blocks are in input, relying on truncating
  // integer division, then feed them into the internal hash state.
  let complete_blocks = (input_len / block_size) * block_size;
  update_many_blocks state input complete_blocks;
  // Compute how many bytes are left in the last partial block. Use pointer
  // arithmetic, than process those leftover bytes as the "last block".
  let partial_block_len = input_len - complete_blocks;
  update_final_block state (input + complete_blocks) partial_block_len;
  // Computation is over: we "extract" the internal hash state into the final
  // digest dst.
  extract dst state

The hash function is generic (identical) across all hash algorithms; furthermore, the update_final_block and update_many_blocks functions are themselves generic, and only depend on the block update function update_block which is specific to each hash in the family.

We could follow the trick we used earlier, and write something along the lines of:

type alg = SHA2_256 | SHA2_512 | ...

let hash (a: alg) ... =
  let state = init a in
  update_many_blocks a state input; ...

Then, we would have to make sure none of the a parameters remain in the final C code; this in turn would require us to inline all of the helper functions such as init, update_many_blocks, etc. into hash, so as to be able to write:

let hash_sha2_256 = hash SHA2_256

This works, and does produce first-order, low-level C code that contains no run-time checks over the exact kind of MD algorithm; but this comes at the expensive of readability: we end up with one huge, illegible hash function, with no trace left of the init/update/finish structure at the heart of the construction. (There are other issues with this approach, notably what happens when you have multiple implementations of e.g. SHA2/256.)

We can do much better than that, and generate quality low-level code without sacrificing readability. The trick is to write hash as a higher-order function, but perform enough partial evaluation that by extraction-time, the higher-order control-flow is gone.

inline_for_extraction
let mk_hash 
  (init: unit -> St internal_state ...)
  (update_many_blocks: internal_state -> array uint8 -> size_t -> St unit ...)
  ...
  (dst: array uint8)
  (input: array uint8)
=
  // same code, but this time using the function "pointers" passed as parameters

This function is obviously not the kind of function that one wants to see in C! It uses function pointers and would be wildly inefficient, not to mention that its types would be inexpressible in C.

But one can partially apply mk_hash to its first few arguments:

let hash_sha2_256 = mk_hash init_sha2_256 update_many_blocks_sha2_256 ...

At compile-time, F* performs enough beta-reductions that hash_sha2_256 becomes a specialized hash function that calls init_sha2_256, update_many_blocks_sha2_256, etc., thus producing legible, compact code that calls specialized functions.

// After beta-reduction
let hash_sha2_256 (dst: array uint8) (input: array uint8) =
  // The structure of the function calls is preserved; we get intelligible code
  // as opposed to an unscrutable 500-line function body where everything has
  // been inlined.
  let state = init_sha2_256 () in
  update_many_blocks_sha2_256 state input;
  update_final_block_sha2_256 state (input - input % block_size);
  extract_sha2_256 dst state

The technique can be used recursively:

let update_many_blocks_sha2_256 = mk_update_many_blocks update_one_block_sha2_256

In short, to add a new algorithm in the MD family, it simply suffices to write and verify its update_one_block_* function. Then, instantiating the mk_* family of functions yields a completely specialized copy of the code, that bears no trace of the original higher-order, generic infrastructure.

This is akin to a template in C++, where hash<T> is specialized by the compiler into monomorphic instances for various values of T, except here, we don’t need any special support in the compiler! This is all done with beta-reductions and partial applications.

This technique was developed, settled, and adopted circa 2019, and remains in place to this day for hashes – it is briefly described in the EverCrypt Paper.

Automated Templates

This technique works well for algorithms whose call-graph is not very deep. Here, hash calls into update_many_blocks, which itself calls into update_one_block – there are no further levels of specialized calls. But for much larger pieces of code, manually restructuring many source files with a very deep call graph can prove extraordinarily tedious. No one wants to manually add all of those function parameters!

The solution is simply to perform this rewriting automatically with Meta-F*. Meta-F* allows the user to write regular F* code and have it be executed at compile-time; F* exposes its internals via a set of safe-by-construction APIs, which in turn allows meta-programs to inspect definitions, perform proofs, generate new definitions, and much more. This is the technique known as elaborator reflection and pioneered by Lean and Idris.

In our case, the user writes their code “normally”, and adds attributes to indicate which functions need to be rewritten into the mk_* form. Then, a generic meta-program (written once) traverses the function graph, inspects definitions, rewrites them using the mk_* pattern, and inserts the higher-order-style functions into the program. This all happens automatically without user intervention. All that is left for the user is to “instantiate” the template for their particular choice of base functions (e.g., let update_many_blocks_sha2_256 = mk_update_many_blocks update_one_block_sha2_256).

The tactic was developed throughout late 2019, and was adopted at scale in early 2020. At the time of this blog post, this “higher-order rewriting tactic” remains the second largest Meta-F* program ever written; most algorithms in HACL now rely on the tactic. We use it to “instantiate” an algorithm with a choice of primitives (e.g. HPKE with ChachaPoly+Curve25519+SHA256), a choice of implementations (e.g. ChachaPoly with Chacha-AVX2+Poly-AVX2), or both. We describe this technique in great detail in a pre-print.

Futamura projection with Recursion on the Normalizer

The final and most complex example in our partial evaluation journey is Noise*, a protocol compiler for the Noise Protocol Framework. Briefly, Noise is a DSL (domain-specific language) that describes a series of key establishment protocols, whose purpose is to establish a shared secret, hence a secure communication channel, between two parties, known as the initiator and the responder. A typical Noise program looks as follows:

IKpsk2:
 <- s
 ...
 -> e, es, s, ss
 <- e, ee, se, psk

Arrows indicate the flow of data: -> flows from initiator to responder, and conversely. The first line, before the ..., indicates data that is available out-of-band: in this case, the initiator knows the server’s static (s) key.

The actual handshake happens below the .... Each token indicates a particular cryptographic operation: es, for instance, indicates a Diffie-Hellman (DH) operation between the initiator’s ephemeral key and the responder’s static key; furthermore, it is implicitly understood that any further communication will be encrypted using a key derived from the DH secret.

There are 59 protocols in the Noise family; naturally, we want to write and verify an implementation only once! This new challenge is slightly more complicated than what we saw previously with the hashes: a line is a list of tokens, and a handshake is a list of list of tokens. We thus need to operate over recursive data structures at compile-time – our code better terminate!

We represent programs in the Noise DSL as a deep embedding:

type token = E | ES | S | SS
type step = list token
type handshake = list step

We write our code in what we call a “hybrid embedding” style. Parts of our code operate over the deep-embedding, recurse over the steps of the handshake, and maintain compile-time data, such as “this is the i-th step of the handshake”. Other parts use the regular Low* shallow embedding and perform the actual run-time operations.

The part of the code that operates over the deep embedding executes at compile-time, and thus belongs to the first stage. After the first stage has executed, all that remains is code for the second stage, using the Low* shallow embedding, hence the name “hybrid embedding”.

Concretely, we write an interpreter for Noise programs, whose actions are in Low*:

let rec eval_token (t: token) (s: state) =
  match t with
  | ES ->
      // This is the Low* call and appears in the generated code. The
      // surroundings (let rec, match) are "compiler code" and reduce away at
      // compile-time.
      diffie_hellman s.encryption_key s.ephemeral s.remote_static
  | ...

and eval_step (tokens: step) (s: state) =
  match tokens with
  | t :: tokens ->
      eval_token t s;
      eval_step tokens s
  | [] ->
      ()

and ...

Thanks to the first Futamura projection, we can partially apply the eval_* series of functions to one specific program in the Noise* DSL. In the case of the first step of IKpsk2, we obtain:

eval_step [ E; ES S; SS ] s

~> // reduces to

eval_token E s;
eval_step [ ES; S; SS ] s

~> // reduces to

eval_token E s;
eval_token ES s;
eval_step [ S; SS ] s

~> // reduces to

generate_ephemeral s.ephemeral;
diffie_hellman s.encryption_key s.ephemeral s.remote_static
...

In other words, we have embedded a protocol compiler in F*’s normalizer; to compile a Noise program, it suffices to perform a partial application and the compiler runs at compile-time in F*. The compiler produces a complete protocol stack, including state machine, defensive API, data structures for managing the peer list, and state serialization/deserialization, fully specialized for the given Noise protocol.

This technique allows for a fair degree of sophistication; we have expression-level compiler steps (the matches) above; but we also have type-level reduction steps: for instance, some fields of the state reduce to unit when not needed for the chosen Noise protocol, which in turn guarantees that they won’t appear in the resulting C code.

Naturally, making this work in practice requires a high degree of familiarity with F*’s internals; and I would lie if I said that we never pulled our hair trying to debug a normalizer loop! But the end result is conceptually quite satisfying, and thanks to Son Ho’s incredible work, yielded a very exciting paper that he will present this week at Oakland.

Looking back…

Verifying code is hard, labor-intensive, and exhausting. Any technique that makes the task easier yields substantial gains of productivity (and student happiness). Opportunities abound for leveraging PL techniques to automate the production of verified code! Instead of directly writing Low* code, it is oftentimes much more efficient to write code that produces Low* code. The extra degree of indirection pays off; and the most dramatic gains are achieved when the team’s expertise combines cryptography and PL.