5 Years of Meta-Programming Cryptography
For the past five years or so, I have been exploring, along with many other collaborators, the space of verified cryptography and protocols. The premise is simple: to deliver critical code that comes with strong properties of correctness, safety and security. Some readers may be familiar with Project Everest: this is the umbrella under which a large chunk of this work happened.
This blog post focuses on the specific topic of using meta-programming and compile-time evaluation to scale up cryptographic verification. This theme has been a key technical ingredient in almost every paper since 2016, with each successive publication making a greater use of either one of these techniques. The topic is of interest to my programming languages (PL)-oriented readers. And, as far as I know, I haven’t really written anything that highlights this unifying theme in the research arc of verified crypto and protocols.
These techniques all come together in our most recent paper on Noise*, in which we use a Futamura projection and a combination of deep and shallow embeddings to run a protocol compiler on the F* normalizer. The Noise* work is the culmination of five years of working on PL techniques for cryptography, and it will be presented this week at Oakland (S&P). This is a great timing for a retrospective!
A disclaimer before I go any further: this is my non-professional blog, and this post will inevitably capture my personal views and individual recollections.
Basic Partial Evaluation in Low*
In 2017, we introduced the Low* toolchain. The core idea is that we can model a palatable subset of C directly within F*.
// Allocate a zero-initialized array of 32 bytes on the stack.
let digest = Array.alloca 32ul 0uy in
SHA2.hash digest input input_len
The special Array
module is crafted so that F* enforces a C-like discipline,
permitting reasoning about liveness, disjointness, array lengths, and so on.
(There are many more such modules in Low*.) In the example above, F* would
typically check that i) digest
and input
are live pointers, that ii)
input_len
represents the length of input
, that iii) digest
and input
are
disjoint, and so on. Provided your code verifies successfully and abides by
further (mostly syntactic) restrictions, then a special compiler called
KaRaMeL (née KReMLin) will turn it into C code.
uint8_t digest[32];
SHA2_hash(digest, input, input_len);
As we were developing what would become HACL*, an issue quickly arose. For the purposes of verification, it is wise to split up a function into many small helpers, each with their own pre- and post-conditions. This makes verification more robust; code becomes more maintainable; and small helpers promote code reuse and modularity. However, at code-generation time, this produces a C file with a multitude of three-line functions scattered all over the place. This is not idiomatic, makes C code harder to follow, and generally diminishes our credibility when we want verification-agnostic practitioners to take us seriously. I wrote extensively about this a few years back: even though it’s auto-generated, producing quality C code matters!
The initial workaround (added Dec 2016) was to add an ad-hoc optimization pass
to KaRaMeL, the F*-to-C compiler, using a new F* keyword, dubbed
[@substitute]
. See this vintage example
for a typical usage in HACL*: at call-site, KaRaMeL would simply replace any
function marked as [@substitute]
with its body, substituting effective arguments
for formal parameters. This is simply an inlining step, also known as a
beta-reduction step. (The careful reader might notice that this is of course
unsound in the presence of effectful function arguments, which may be evaluated
twice: fortunately, F* guarantees that effectful arguments are let-bound, and
thus evaluated only once.)
This was the first attempt at performing partial evaluation at compile-time in order to generate better code, and got us going for a while. However, it had several drawbacks.
- F* already had a normalizer, that is, an interpreter built into the type inference machinery, for the purpose of higher-order unification. In other words, F* could already reduce terms at compile-time, and owing to its usage of a Krivine Abstract Machine (KAM), was much better equipped than KaRaMeL to do so! The KAM is well-studied, reduces terms in De Bruijn form, and is formally described, which is much better than adding an ad-hoc reduction step somewhere as a KaRaMeL nanopass.
- F*’s normalizer performs a host of other reduction steps; if beta-reductions happen only in KaRaMeL, then a large amount of optimizations are actually impossible, because you need the composition of F*’s existing reduction steps, and the desired extra beta-reductions, rather than one after the other. In other words, you lose on how much partial evaluation you can trigger.
To illustrate the latter point, consider the example below. Circa 2016, the conditional could not be eliminated in the example below, because the beta-reduction (in KaRaMeL) happened after the conditional elimination step (in F*’s existing normalizer).
// St for "stateful"
[@substitute]
let f (x: bool): St ... =
if x then
...
else
...
let main () =
f true
This was fixed in September 2017, when F* allowed its inline_for_extraction
keyword to apply to stateful beta-redexes. This suddenly “unlocked” many
more potential use-cases, which we quickly leveraged. The earlier example now
uses
the new keyword.
Overloaded Operators via Partial Evaluation
In 2018, HACL* entered a deep refactoring that eventually culminated in all of the original code being rewritten. That rewrite led to more concise, reusable and effective code.
One of the key changes was overloaded operators. This is an interesting technical feature, as it relies on the conjunction of four F* features: compile-time reduction, implicit argument inference, fine-grained reduction hints and cross-module inlining.
The problem is as follows. The Low* model of C in F* has one module per
integer type; effectively, we expose the base types found in C99’s
<inttypes.h>
. Unfortunately,
this means each integer type (there are 9 of them) has its own set of operators.
This prevents modularity (functions need to be duplicated for e.g. uint32
and
uint64
), and hinders readability and programmer productivity.
Overloaded operators are defined as follows:
// IntTypes.fsti, Obviously, a simplification!
type int_width = | U32 | U64 | ...
inline_for_extraction
type int_type (x: int_width) =
match x with
| U32 -> LowStar.UInt32.t
| U64 -> LowStar.UInt64.t
[@@ strict_on_arguments [0]]
inline_for_extraction
val (+): #w:inttype -> int_type w -> int_type w -> int_type w
This interface file only introduces an abstract signature for +
; the
definition is in the corresponding implementation file.
// IntTypes.fst
let (+) #w x y =
match w with
| U32 -> LowStar.UInt32.add x y
| U64 -> LowStar.UInt64.add x y
Compared to our earlier [@substitute]
example, this is a lot more
sophisticated! To make this work in practice, we need to combine several
mechanisms.
- The
inline_for_extraction
qualifier indicates that this definition needs to be evaluated and reduced at compile-time by F*. Indeed, the dependent typeint_type
and the dependently-typed+
do not compile to C, so any use of these definitions must be evaluated away at compile-time. - The
strict_on_arguments
attribute indicates that this definition should not be reduced eagerly, as it would lead to combinatorial explosion in certain cases. Rather, beta-reduction should take place only when the 0-th argument to+
is a constant, which guarantees the match reduces immediately. - The definition is a
val
in the interface, meaning it is abstract; this is intentional, and prevents the SMT solver from “seeing” the body of the definition in client modules, and attempting a futile case analysis. To make this work, a special command-line flag must be passed to F*, called--cmi
, for cross-module inlining. It makes sureinline_for_extraction
traverses abstraction boundaries, but only at extraction-time, not verification-time. - Finally, in order for this to be pleasant to use, we indicate to F* that it
ought to infer the width
w
, using#
to denote an implicit argument.
These abstractions form the foundation of HACL* “v2”. They are briefly described in our CCS Paper.
(Note: this predates the type class mechanism in F*. The same effect would be achievable with type classes, no doubt.)
Hand-written Templates for Hashes
Operators enable code-sharing in the small; we want to enable code-sharing in the large. A prime example is the Merkle-Damgård (MD) family of hash functions (MD5, SHA1, SHA2). All of these functions follow the same high-level structure:
// VASTLY simplified!
let hash (dst: array uint8) (input: array uint8) (input_len: size_t) =
// Initialize internal hash state
let state = init () in
// Compute how many complete blocks are in input, relying on truncating
// integer division, then feed them into the internal hash state.
let complete_blocks = (input_len / block_size) * block_size;
update_many_blocks state input complete_blocks;
// Compute how many bytes are left in the last partial block. Use pointer
// arithmetic, than process those leftover bytes as the "last block".
let partial_block_len = input_len - complete_blocks;
update_final_block state (input + complete_blocks) partial_block_len;
// Computation is over: we "extract" the internal hash state into the final
// digest dst.
extract dst state
The hash
function is generic (identical) across all hash algorithms;
furthermore, the update_final_block
and update_many_blocks
functions are
themselves generic, and only depend on the block update function
update_block
which is specific to each hash in the family.
We could follow the trick we used earlier, and write something along the lines of:
type alg = SHA2_256 | SHA2_512 | ...
let hash (a: alg) ... =
let state = init a in
update_many_blocks a state input; ...
Then, we would have to make sure none of the a
parameters remain in the final
C code; this in turn would require us to inline all of the helper functions
such as init
, update_many_blocks
, etc. into hash
, so as to be able to
write:
let hash_sha2_256 = hash SHA2_256
This works, and does produce first-order, low-level C code that contains no
run-time checks over the exact kind of MD algorithm; but this comes at the
expensive of readability: we end up with one huge, illegible hash
function, with no trace left of the init/update/finish
structure at the heart
of the construction. (There are other issues with this approach, notably what
happens when you have multiple implementations of e.g. SHA2/256.)
We can do much better than that, and generate quality low-level code without
sacrificing readability. The trick is to write hash
as a higher-order
function, but perform enough partial evaluation that by extraction-time, the
higher-order control-flow is gone.
inline_for_extraction
let mk_hash
(init: unit -> St internal_state ...)
(update_many_blocks: internal_state -> array uint8 -> size_t -> St unit ...)
...
(dst: array uint8)
(input: array uint8)
=
// same code, but this time using the function "pointers" passed as parameters
This function is obviously not the kind of function that one wants to see in C! It uses function pointers and would be wildly inefficient, not to mention that its types would be inexpressible in C.
But one can partially apply mk_hash
to its first few arguments:
let hash_sha2_256 = mk_hash init_sha2_256 update_many_blocks_sha2_256 ...
At compile-time, F* performs enough beta-reductions that
hash_sha2_256
becomes a specialized hash function that calls
init_sha2_256
, update_many_blocks_sha2_256
, etc., thus producing legible,
compact code that calls specialized functions.
// After beta-reduction
let hash_sha2_256 (dst: array uint8) (input: array uint8) =
// The structure of the function calls is preserved; we get intelligible code
// as opposed to an unscrutable 500-line function body where everything has
// been inlined.
let state = init_sha2_256 () in
update_many_blocks_sha2_256 state input;
update_final_block_sha2_256 state (input - input % block_size);
extract_sha2_256 dst state
The technique can be used recursively:
let update_many_blocks_sha2_256 = mk_update_many_blocks update_one_block_sha2_256
In short, to add a new algorithm in the MD family, it simply suffices to write
and verify its update_one_block_*
function. Then, instantiating the mk_*
family of functions yields a completely specialized copy of the code, that bears
no trace of the original higher-order, generic infrastructure.
This is akin to a template in C++, where hash<T>
is specialized by the
compiler into monomorphic instances for various values of T
, except here, we
don’t need any special support in the compiler! This is all done with
beta-reductions and partial applications.
This technique was developed, settled, and adopted circa 2019, and remains in place to this day for hashes – it is briefly described in the EverCrypt Paper.
Automated Templates
This technique works well for algorithms whose call-graph is not very deep.
Here, hash
calls into update_many_blocks
, which itself calls into
update_one_block
– there are no further levels of specialized calls. But for
much larger pieces of code, manually restructuring many source files with a very deep
call graph can prove extraordinarily tedious. No one wants to manually add all
of those function parameters!
The solution is simply to perform this rewriting automatically with Meta-F*. Meta-F* allows the user to write regular F* code and have it be executed at compile-time; F* exposes its internals via a set of safe-by-construction APIs, which in turn allows meta-programs to inspect definitions, perform proofs, generate new definitions, and much more. This is the technique known as elaborator reflection and pioneered by Lean and Idris.
In our case, the user writes their code “normally”, and adds attributes to
indicate which functions need to be rewritten into the mk_*
form. Then, a
generic meta-program (written once) traverses the function graph, inspects
definitions, rewrites them using the mk_*
pattern, and inserts the
higher-order-style functions into the program. This all happens automatically
without user intervention. All that is left for the user is to “instantiate” the
template for their particular choice of base functions (e.g.,
let update_many_blocks_sha2_256 = mk_update_many_blocks update_one_block_sha2_256
).
The tactic was developed throughout late 2019, and was adopted at scale in early 2020. At the time of this blog post, this “higher-order rewriting tactic” remains the second largest Meta-F* program ever written; most algorithms in HACL now rely on the tactic. We use it to “instantiate” an algorithm with a choice of primitives (e.g. HPKE with ChachaPoly+Curve25519+SHA256), a choice of implementations (e.g. ChachaPoly with Chacha-AVX2+Poly-AVX2), or both. We describe this technique in great detail in a pre-print.
Futamura projection with Recursion on the Normalizer
The final and most complex example in our partial evaluation journey is Noise*, a protocol compiler for the Noise Protocol Framework. Briefly, Noise is a DSL (domain-specific language) that describes a series of key establishment protocols, whose purpose is to establish a shared secret, hence a secure communication channel, between two parties, known as the initiator and the responder. A typical Noise program looks as follows:
IKpsk2:
<- s
...
-> e, es, s, ss
<- e, ee, se, psk
Arrows indicate the flow of data: ->
flows from initiator to responder, and
conversely. The first line, before the ...
, indicates data that is available
out-of-band: in this case, the initiator knows the server’s static (s
) key.
The actual handshake happens below the ...
. Each token indicates a particular
cryptographic operation: es
, for instance, indicates a Diffie-Hellman (DH)
operation between the initiator’s ephemeral key and the responder’s static key;
furthermore, it is implicitly understood that any further communication will be
encrypted using a key derived from the DH secret.
There are 59 protocols in the Noise family; naturally, we want to write and verify an implementation only once! This new challenge is slightly more complicated than what we saw previously with the hashes: a line is a list of tokens, and a handshake is a list of list of tokens. We thus need to operate over recursive data structures at compile-time – our code better terminate!
We represent programs in the Noise DSL as a deep embedding:
type token = E | ES | S | SS
type step = list token
type handshake = list step
We write our code in what we call a “hybrid embedding” style. Parts of our code
operate over the deep-embedding, recurse over the steps of the handshake, and
maintain compile-time data, such as “this is the i
-th step of the handshake”.
Other parts use the regular Low* shallow embedding and perform the actual
run-time operations.
The part of the code that operates over the deep embedding executes at compile-time, and thus belongs to the first stage. After the first stage has executed, all that remains is code for the second stage, using the Low* shallow embedding, hence the name “hybrid embedding”.
Concretely, we write an interpreter for Noise programs, whose actions are in Low*:
let rec eval_token (t: token) (s: state) =
match t with
| ES ->
// This is the Low* call and appears in the generated code. The
// surroundings (let rec, match) are "compiler code" and reduce away at
// compile-time.
diffie_hellman s.encryption_key s.ephemeral s.remote_static
| ...
and eval_step (tokens: step) (s: state) =
match tokens with
| t :: tokens ->
eval_token t s;
eval_step tokens s
| [] ->
()
and ...
Thanks to the first Futamura
projection,
we can partially apply the eval_*
series of functions to one specific program
in the Noise* DSL. In the case of the first step of IKpsk2
, we obtain:
eval_step [ E; ES S; SS ] s
~> // reduces to
eval_token E s;
eval_step [ ES; S; SS ] s
~> // reduces to
eval_token E s;
eval_token ES s;
eval_step [ S; SS ] s
~> // reduces to
generate_ephemeral s.ephemeral;
diffie_hellman s.encryption_key s.ephemeral s.remote_static
...
In other words, we have embedded a protocol compiler in F*’s normalizer; to compile a Noise program, it suffices to perform a partial application and the compiler runs at compile-time in F*. The compiler produces a complete protocol stack, including state machine, defensive API, data structures for managing the peer list, and state serialization/deserialization, fully specialized for the given Noise protocol.
This technique allows for a fair degree of sophistication; we have
expression-level compiler steps (the match
es) above; but we also have
type-level reduction steps: for instance, some fields of the state reduce to
unit
when not needed for the chosen Noise protocol, which in turn guarantees
that they won’t appear in the resulting C code.
Naturally, making this work in practice requires a high degree of familiarity with F*’s internals; and I would lie if I said that we never pulled our hair trying to debug a normalizer loop! But the end result is conceptually quite satisfying, and thanks to Son Ho’s incredible work, yielded a very exciting paper that he will present this week at Oakland.
Looking back…
Verifying code is hard, labor-intensive, and exhausting. Any technique that makes the task easier yields substantial gains of productivity (and student happiness). Opportunities abound for leveraging PL techniques to automate the production of verified code! Instead of directly writing Low* code, it is oftentimes much more efficient to write code that produces Low* code. The extra degree of indirection pays off; and the most dramatic gains are achieved when the team’s expertise combines cryptography and PL.