For the past five years or so, I have been exploring, along with many other collaborators, the space of verified cryptography and protocols. The premise is simple: to deliver critical code that comes with strong properties of correctness, safety and security. Some readers may be familiar with Project Everest: this is the umbrella under which a large chunk of this work happened.
This blog post focuses on the specific topic of using meta-programming and compile-time evaluation to scale up cryptographic verification. This theme has been a key technical ingredient in almost every paper since 2016, with each successive publication making a greater use of either one of these techniques. The topic is of interest to my programming languages (PL)-oriented readers. And, as far as I know, I haven’t really written anything that highlights this unifying theme in the research arc of verified crypto and protocols.
These techniques all come together in our most recent paper on Noise*, in which we use a Futamura projection and a combination of deep and shallow embeddings to run a protocol compiler on the F* normalizer. The Noise* work is the culmination of five years of working on PL techniques for cryptography, and it will be presented this week at Oakland (S&P). This is a great timing for a retrospective!
A disclaimer before I go any further: this is my non-professional blog, and this post will inevitably capture my personal views and individual recollections.
Basic Partial Evaluation in Low*
In 2017, we introduced the Low* toolchain. The core idea is that we can model a palatable subset of C directly within F*.
// Allocate a zero-initialized array of 32 bytes on the stack. let digest = Array.alloca 32ul 0uy in SHA2.hash digest input input_len
Array module is crafted so that F* enforces a C-like discipline,
permitting reasoning about liveness, disjointness, array lengths, and so on.
(There are many more such modules in Low*.) In the example above, F* would
typically check that i)
input are live pointers, that ii)
input_len represents the length of
input, that iii)
disjoint, and so on. Provided your code verifies successfully and abides by
further (mostly syntactic) restrictions, then a special compiler called
KaRaMeL (née KReMLin) will turn it into C code.
uint8_t digest; SHA2_hash(digest, input, input_len);
As we were developing what would become HACL*, an issue quickly arose. For the purposes of verification, it is wise to split up a function into many small helpers, each with their own pre- and post-conditions. This makes verification more robust; code becomes more maintainable; and small helpers promote code reuse and modularity. However, at code-generation time, this produces a C file with a multitude of three-line functions scattered all over the place. This is not idiomatic, makes C code harder to follow, and generally diminishes our credibility when we want verification-agnostic practitioners to take us seriously. I wrote extensively about this a few years back: even though it’s auto-generated, producing quality C code matters!
The initial workaround (added Dec 2016) was to add an ad-hoc optimization pass
to KaRaMeL, the F*-to-C compiler, using a new F* keyword, dubbed
[@substitute]. See this vintage example
for a typical usage in HACL*: at call-site, KaRaMeL would simply replace any
function marked as
[@substitute] with its body, substituting effective arguments
for formal parameters. This is simply an inlining step, also known as a
beta-reduction step. (The careful reader might notice that this is of course
unsound in the presence of effectful function arguments, which may be evaluated
twice: fortunately, F* guarantees that effectful arguments are let-bound, and
thus evaluated only once.)
This was the first attempt at performing partial evaluation at compile-time in order to generate better code, and got us going for a while. However, it had several drawbacks.
- F* already had a normalizer, that is, an interpreter built into the type inference machinery, for the purpose of higher-order unification. In other words, F* could already reduce terms at compile-time, and owing to its usage of a Krivine Abstract Machine (KAM), was much better equipped than KaRaMeL to do so! The KAM is well-studied, reduces terms in De Bruijn form, and is formally described, which is much better than adding an ad-hoc reduction step somewhere as a KaRaMeL nanopass.
- F*’s normalizer performs a host of other reduction steps; if beta-reductions happen only in KaRaMeL, then a large amount of optimizations are actually impossible, because you need the composition of F*’s existing reduction steps, and the desired extra beta-reductions, rather than one after the other. In other words, you lose on how much partial evaluation you can trigger.
To illustrate the latter point, consider the example below. Circa 2016, the conditional could not be eliminated in the example below, because the beta-reduction (in KaRaMeL) happened after the conditional elimination step (in F*’s existing normalizer).
// St for "stateful" [@substitute] let f (x: bool): St ... = if x then ... else ... let main () = f true
This was fixed in September 2017, when F* allowed its
keyword to apply to stateful beta-redexes. This suddenly “unlocked” many
more potential use-cases, which we quickly leveraged. The earlier example now
the new keyword.
Overloaded Operators via Partial Evaluation
In 2018, HACL* entered a deep refactoring that eventually culminated in all of the original code being rewritten. That rewrite led to more concise, reusable and effective code.
One of the key changes was overloaded operators. This is an interesting technical feature, as it relies on the conjunction of four F* features: compile-time reduction, implicit argument inference, fine-grained reduction hints and cross-module inlining.
The problem is as follows. The Low* model of C in F* has one module per
integer type; effectively, we expose the base types found in C99’s
this means each integer type (there are 9 of them) has its own set of operators.
This prevents modularity (functions need to be duplicated for e.g.
uint64), and hinders readability and programmer productivity.
Overloaded operators are defined as follows:
// IntTypes.fsti, Obviously, a simplification! type int_width = | U32 | U64 | ... inline_for_extraction type int_type (x: int_width) = match x with | U32 -> LowStar.UInt32.t | U64 -> LowStar.UInt64.t [@@ strict_on_arguments ] inline_for_extraction val (+): #w:inttype -> int_type w -> int_type w -> int_type w
This interface file only introduces an abstract signature for
definition is in the corresponding implementation file.
// IntTypes.fst let (+) #w x y = match w with | U32 -> LowStar.UInt32.add x y | U64 -> LowStar.UInt64.add x y
Compared to our earlier
[@substitute] example, this is a lot more
sophisticated! To make this work in practice, we need to combine several
inline_for_extractionqualifier indicates that this definition needs to be evaluated and reduced at compile-time by F*. Indeed, the dependent type
int_typeand the dependently-typed
+do not compile to C, so any use of these definitions must be evaluated away at compile-time.
strict_on_argumentsattribute indicates that this definition should not be reduced eagerly, as it would lead to combinatorial explosion in certain cases. Rather, beta-reduction should take place only when the 0-th argument to
+is a constant, which guarantees the match reduces immediately.
- The definition is a
valin the interface, meaning it is abstract; this is intentional, and prevents the SMT solver from “seeing” the body of the definition in client modules, and attempting a futile case analysis. To make this work, a special command-line flag must be passed to F*, called
--cmi, for cross-module inlining. It makes sure
inline_for_extractiontraverses abstraction boundaries, but only at extraction-time, not verification-time.
- Finally, in order for this to be pleasant to use, we indicate to F* that it
ought to infer the width
#to denote an implicit argument.
These abstractions form the foundation of HACL* “v2”. They are briefly described in our CCS Paper.
(Note: this predates the type class mechanism in F*. The same effect would be achievable with type classes, no doubt.)
Hand-written Templates for Hashes
Operators enable code-sharing in the small; we want to enable code-sharing in the large. A prime example is the Merkle-Damgård (MD) family of hash functions (MD5, SHA1, SHA2). All of these functions follow the same high-level structure:
// VASTLY simplified! let hash (dst: array uint8) (input: array uint8) (input_len: size_t) = // Initialize internal hash state let state = init () in // Compute how many complete blocks are in input, relying on truncating // integer division, then feed them into the internal hash state. let complete_blocks = (input_len / block_size) * block_size; update_many_blocks state input complete_blocks; // Compute how many bytes are left in the last partial block. Use pointer // arithmetic, than process those leftover bytes as the "last block". let partial_block_len = input_len - complete_blocks; update_final_block state (input + complete_blocks) partial_block_len; // Computation is over: we "extract" the internal hash state into the final // digest dst. extract dst state
hash function is generic (identical) across all hash algorithms;
update_many_blocks functions are
themselves generic, and only depend on the block update function
update_block which is specific to each hash in the family.
We could follow the trick we used earlier, and write something along the lines of:
type alg = SHA2_256 | SHA2_512 | ... let hash (a: alg) ... = let state = init a in update_many_blocks a state input; ...
Then, we would have to make sure none of the
a parameters remain in the final
C code; this in turn would require us to inline all of the helper functions
update_many_blocks, etc. into
hash, so as to be able to
let hash_sha2_256 = hash SHA2_256
This works, and does produce first-order, low-level C code that contains no
run-time checks over the exact kind of MD algorithm; but this comes at the
expensive of readability: we end up with one huge, illegible
function, with no trace left of the
init/update/finish structure at the heart
of the construction. (There are other issues with this approach, notably what
happens when you have multiple implementations of e.g. SHA2/256.)
We can do much better than that, and generate quality low-level code without
sacrificing readability. The trick is to write
hash as a higher-order
function, but perform enough partial evaluation that by extraction-time, the
higher-order control-flow is gone.
inline_for_extraction let mk_hash (init: unit -> St internal_state ...) (update_many_blocks: internal_state -> array uint8 -> size_t -> St unit ...) ... (dst: array uint8) (input: array uint8) = // same code, but this time using the function "pointers" passed as parameters
This function is obviously not the kind of function that one wants to see in C! It uses function pointers and would be wildly inefficient, not to mention that its types would be inexpressible in C.
But one can partially apply
mk_hash to its first few arguments:
let hash_sha2_256 = mk_hash init_sha2_256 update_many_blocks_sha2_256 ...
At compile-time, F* performs enough beta-reductions that
hash_sha2_256 becomes a specialized hash function that calls
update_many_blocks_sha2_256, etc., thus producing legible,
compact code that calls specialized functions.
// After beta-reduction let hash_sha2_256 (dst: array uint8) (input: array uint8) = // The structure of the function calls is preserved; we get intelligible code // as opposed to an unscrutable 500-line function body where everything has // been inlined. let state = init_sha2_256 () in update_many_blocks_sha2_256 state input; update_final_block_sha2_256 state (input - input % block_size); extract_sha2_256 dst state
The technique can be used recursively:
let update_many_blocks_sha2_256 = mk_update_many_blocks update_one_block_sha2_256
In short, to add a new algorithm in the MD family, it simply suffices to write
and verify its
update_one_block_* function. Then, instantiating the
family of functions yields a completely specialized copy of the code, that bears
no trace of the original higher-order, generic infrastructure.
This is akin to a template in C++, where
hash<T> is specialized by the
compiler into monomorphic instances for various values of
T, except here, we
don’t need any special support in the compiler! This is all done with
beta-reductions and partial applications.
This technique was developed, settled, and adopted circa 2019, and remains in place to this day for hashes – it is briefly described in the EverCrypt Paper.
This technique works well for algorithms whose call-graph is not very deep.
hash calls into
update_many_blocks, which itself calls into
update_one_block – there are no further levels of specialized calls. But for
much larger pieces of code, manually restructuring many source files with a very deep
call graph can prove extraordinarily tedious. No one wants to manually add all
of those function parameters!
The solution is simply to perform this rewriting automatically with Meta-F*. Meta-F* allows the user to write regular F* code and have it be executed at compile-time; F* exposes its internals via a set of safe-by-construction APIs, which in turn allows meta-programs to inspect definitions, perform proofs, generate new definitions, and much more. This is the technique known as elaborator reflection and pioneered by Lean and Idris.
In our case, the user writes their code “normally”, and adds attributes to
indicate which functions need to be rewritten into the
mk_* form. Then, a
generic meta-program (written once) traverses the function graph, inspects
definitions, rewrites them using the
mk_* pattern, and inserts the
higher-order-style functions into the program. This all happens automatically
without user intervention. All that is left for the user is to “instantiate” the
template for their particular choice of base functions (e.g.,
let update_many_blocks_sha2_256 = mk_update_many_blocks update_one_block_sha2_256).
The tactic was developed throughout late 2019, and was adopted at scale in early 2020. At the time of this blog post, this “higher-order rewriting tactic” remains the second largest Meta-F* program ever written; most algorithms in HACL now rely on the tactic. We use it to “instantiate” an algorithm with a choice of primitives (e.g. HPKE with ChachaPoly+Curve25519+SHA256), a choice of implementations (e.g. ChachaPoly with Chacha-AVX2+Poly-AVX2), or both. We describe this technique in great detail in a pre-print.
Futamura projection with Recursion on the Normalizer
The final and most complex example in our partial evaluation journey is Noise*, a protocol compiler for the Noise Protocol Framework. Briefly, Noise is a DSL (domain-specific language) that describes a series of key establishment protocols, whose purpose is to establish a shared secret, hence a secure communication channel, between two parties, known as the initiator and the responder. A typical Noise program looks as follows:
IKpsk2: <- s ... -> e, es, s, ss <- e, ee, se, psk
Arrows indicate the flow of data:
-> flows from initiator to responder, and
conversely. The first line, before the
..., indicates data that is available
out-of-band: in this case, the initiator knows the server’s static (
The actual handshake happens below the
.... Each token indicates a particular
es, for instance, indicates a Diffie-Hellman (DH)
operation between the initiator’s ephemeral key and the responder’s static key;
furthermore, it is implicitly understood that any further communication will be
encrypted using a key derived from the DH secret.
There are 59 protocols in the Noise family; naturally, we want to write and verify an implementation only once! This new challenge is slightly more complicated than what we saw previously with the hashes: a line is a list of tokens, and a handshake is a list of list of tokens. We thus need to operate over recursive data structures at compile-time – our code better terminate!
We represent programs in the Noise DSL as a deep embedding:
type token = E | ES | S | SS type step = list token type handshake = list step
We write our code in what we call a “hybrid embedding” style. Parts of our code
operate over the deep-embedding, recurse over the steps of the handshake, and
maintain compile-time data, such as “this is the
i-th step of the handshake”.
Other parts use the regular Low* shallow embedding and perform the actual
The part of the code that operates over the deep embedding executes at compile-time, and thus belongs to the first stage. After the first stage has executed, all that remains is code for the second stage, using the Low* shallow embedding, hence the name “hybrid embedding”.
Concretely, we write an interpreter for Noise programs, whose actions are in Low*:
let rec eval_token (t: token) (s: state) = match t with | ES -> // This is the Low* call and appears in the generated code. The // surroundings (let rec, match) are "compiler code" and reduce away at // compile-time. diffie_hellman s.encryption_key s.ephemeral s.remote_static | ... and eval_step (tokens: step) (s: state) = match tokens with | t :: tokens -> eval_token t s; eval_step tokens s |  -> () and ...
Thanks to the first Futamura
we can partially apply the
eval_* series of functions to one specific program
in the Noise* DSL. In the case of the first step of
IKpsk2, we obtain:
eval_step [ E; ES S; SS ] s ~> // reduces to eval_token E s; eval_step [ ES; S; SS ] s ~> // reduces to eval_token E s; eval_token ES s; eval_step [ S; SS ] s ~> // reduces to generate_ephemeral s.ephemeral; diffie_hellman s.encryption_key s.ephemeral s.remote_static ...
In other words, we have embedded a protocol compiler in F*’s normalizer; to compile a Noise program, it suffices to perform a partial application and the compiler runs at compile-time in F*. The compiler produces a complete protocol stack, including state machine, defensive API, data structures for managing the peer list, and state serialization/deserialization, fully specialized for the given Noise protocol.
This technique allows for a fair degree of sophistication; we have
expression-level compiler steps (the
matches) above; but we also have
type-level reduction steps: for instance, some fields of the state reduce to
unit when not needed for the chosen Noise protocol, which in turn guarantees
that they won’t appear in the resulting C code.
Naturally, making this work in practice requires a high degree of familiarity with F*’s internals; and I would lie if I said that we never pulled our hair trying to debug a normalizer loop! But the end result is conceptually quite satisfying, and thanks to Son Ho’s incredible work, yielded a very exciting paper that he will present this week at Oakland.
Verifying code is hard, labor-intensive, and exhausting. Any technique that makes the task easier yields substantial gains of productivity (and student happiness). Opportunities abound for leveraging PL techniques to automate the production of verified code! Instead of directly writing Low* code, it is oftentimes much more efficient to write code that produces Low* code. The extra degree of indirection pays off; and the most dramatic gains are achieved when the team’s expertise combines cryptography and PL.