Jonathan Protzenko

First alpha release of HACL* in Rust

2024-03-20T08:00:00-07:00

I recently wrote about ongoing efforts to retarget the compilation of HACL* from C to Rust. Today, Aymeric, myself and the entire HACL* team are happy to announce that we have a first alpha release of HACL-rs! (Right on time for the HACS workshop).

The goal of HACL-rs is to provide a fast, verified, pure, safe Rust library of cryptographic primitives. In the long run, we simply expect HACL-rs to replace the current HACL C code in libcrux; this will in turn remove the C FFI bindings and make it possible to use libcrux as a pure safe Rust library.

We will present this work later in the year at Rust Verify 2024, but we are making an early announcement to gather initial feedback.

So far, the following algorithms are known to pass our test vectors:

hashes: sha1, sha2, sha3, blake2
stream ciphers: chacha20, salsa20
MACs: poly1305, hmac
AEAD: chacha-poly
bignums (all variants)
signature: Ed25519, ECDSA-P256, RSA-PSS, FFDHE

This is pretty much all of HACL, minus the multiplexing/agile EverCrypt APIs, minus vectorized variants, and minus a few stray algorithms that we haven’t gotten around to fixing yet (K256, HKDF).

The code is here. This is all extremely rough, and we are looking for the following kind of feedback:

performance: notably regressions from HACL-C
API feedback: we understand that none of these are Rust-native APIs, but we’d love to know about dealbreakers (e.g., too many &muts) as soon as possible, as this will also shape the final libcrux API
functional bugs: there is still the possibility of runtime failures, as I was mentioning in my previous blog post; while we have plans to fix this once and for all before the final release, any help finding those will save us time

Please file issues, send emails, or find us at HACS!

Rust verification and backwards compatibility

2024-01-05T07:00:00-08:00

Over the past few years, a lot of eyes in the software verification community have turned towards Rust. That’s hardly a surprise: programs written in Rust are easier to verify, owing to the language’s strong ownership discipline; the absence of undefined behaviors; and its strong notion of value. As a result, we now have a plethora of tools for Rust verification: Creusot, Prusti, Verus, and of course our very own Aeneas… not to mention tools like Kani, Gillian-Rust, and many more. In short, 2024, I believe, shall be the year of Rust verification! (And the year of Linux on the desktop, too, of course.)

This is all great and exciting, but in practice, the transition to an all-Rust ecosystem will take time and needs to happen in a gradual fashion. Neither practitioners nor researchers are going to switch entire systems, codebases and toolchains to Rust overnight. Specifically, two transitions need to happen: existing verified codebases need to be ported to Rust, and new verified Rust code needs a backwards-compatibility story for those users who can’t unconditionally adopt Rust just yet.

This blog post introduces two new projects that address the transition challenges above. The first project allows compiling the HACL* verified crypto library to safe Rust (instead of C), thus bringing old code into the new ecosystem. The second project, named Eurydice, compiles Rust to C (yes, you read that right), in order to bring new code into old ecosystems. The two projects are complementary, and connect in both directions our old, legacy toolchains with the new world in which code is authored in Rust directly.

Modernizing old codebases

HACL*, everyone’s favorite crypto library (or so I’m told), currently amounts to 160k lines of verified code. As Churchill famously said, that’s a lot of code, toil, tears and sweat. Simplifying a bit, HACL* is written in a subset of F* called Low*, for low-level F*. Low* models C memory, C concepts (machine integers, loops), and programs written in the Low* subset can be compiled to C via a dedicated compiler called KaRaMeL (“K&R meets ML”).

In spite of all the success of HACL* (parts of which have been integrated into Linux, Python, Firefox, and many more), there are two fundamental limitations to the way HACL* is currently authored.

First, reasoning about C memory is hard, and a lot of time is wasted on mundane, boring memory reasoning, such as “these pointers are not aliased”, or “the footprint of this data structure does not overlap with that other piece of state on the heap”.
Second, generated code remains, ultimately, unpalatable to downstream consumers, no matter how much effort you put in the quality of your parentheses.

Perhaps you, the ever optimistic reader, are thinking that it’s fine, and that these issues will be addressed for the next big project, such as verified MLS. I wish I could share your enthusiasm: if there’s one message that emerged loud and clear of the HACS workshop series, it’s that users want a pure Rust crypto library – no bindings, just safe Rust!

Perhaps you, still the optimistic reader, might think that we just designed a new toolchain, Aeneas, that aims to address these issues once and for all, and not just for crypto. Indeed, by taking Rust code written by programmers, Aeneas avoids the unsavory code-generation step; by doing a pure translation, Aeneas relieves the programmer of un-necessary memory reasoning.

But sadly, there is no world in which we have the resources, manpower and justification to rewrite all of HACL* in Rust/Aeneas. This leaves us with one option: tweak the code-generation to produce Rust code instead of C. This is exactly what Aymeric and I have been up to. The new project is called hacl-rs, and encompasses changes to both HACL* and KaRaMeL.

The translation, for the time being, can be described as follows:

it happens after all of the compilation passes of KaRaMeL, such as monomorphization, data type elimination, and so on – eventually, we want to leverage Rust’s support for these features, but for now, the priority is to get working code
Low*’s arrays (a.k.a. the buffer type) compile to mutable borrowed slices; Low*’s const arrays (a.k.a. the const_buffer type) compile to shared borrowed slices
as an optimization, stack allocations are initially given a Rust array type, and a type-directed translation inserts a borrow to turn it into a slice when the context expects it
the Rust backend of KaRaMeL does a better job at choosing names and laying out files in a way that is amenable to easy cargo compilation.

Naturally, generating Rust out of Low* is not as trivial as adding a new backend to the existing KaRaMeL compiler.

The first series of problems arises on the HACL* side, which has a slightly irritating (but totally justified) pattern of “maybe in place” functions. Those functions, once compiled to Rust, are certain to trip the borrow-checker. To make things concrete, consider a simplified, albeit representative example: the addition function for fixed-size, 256-bit bignums, as expressed in Low*.

(* 256-bit bignums are represented as an array of 8 32-bit unsigned integers *)
let bignum256_t = buffer uint32_t 8

(* addition takes a pre-allocated destination bignum, and returns the carry as a
 * uint32_t; the Stack annotation means that the function does not
 * heap-allocate *)
val bn_add: dst:bignum256_t -> x:bignum256_t -> y:bignum256_t -> Stack uint32_t
  (fun h0 ->
    disjoint_or_equal dst x /\ (* <-- IMPORTANT BIT *)
    disjoint x y /\
    disjoint dst y)
  (fun h0 r h1 ->
    modifies dst h0 h1 /\
    let dst_nat = as_nat h1 dst in
    let x_nat = as_nat h0 x in
    let y_nat = as_nat h0 y in
    dst_nat == (x_nat + y_nat) % 2^256 /\
    r == (x_nat + y_nat) >> 256)

let ladder ... =
  ...
  bn_add dst foo bar;
  bn_add dst dst baz;
  ...

This signature allows callers to potentially pass the same argument for both x and dst. The reasoning behind this signature is that it allows for an efficient, in-place series of operations where the same bignum is modified through a sequence of operations. Alas, this is exactly the sort of pattern that Rust disallows! The function would compile and type-check per our translation scheme, but would implicity require its arguments to be disjoint. This means that any call-site that passes aliased arguments will generate a borrow error. There are ways to work around this, such as using interior mutability, but this would be very inefficient.

Instead, there is a systematic rewriting pattern that works quite well, leveraging F*’s compile-time reduction facilities:

(* New wrapper around bn_add, for when we know dst and x are aliased *)
inline_for_extraction (* <-- IMPORTANT BIT *)
let bn_add_aliased (dst: ...) x y: ... = (* same signature as before *)
  let x_copy = copy x in
  bn_add dst x_copy y

let ladder ... =
  ...
  bn_add dst foo bar;
  bn_add_aliased dst dst baz;
  ...

The bn_add_aliased function sports the exact same signature as its non-aliased counterpart, meaning that call-sites remain unchanged, something we are adamant about, since proofs at those call-sites might be fragile and not easily fixable. Superficially, the bn_add_aliased function still violates, once translated, the laws of Rust’s borrow-checker. Fortunately, the inline_for_extraction keyword means that F* reduces this definition, leaving at call-site:

let ladder ... =
  ...
  bn_add dst foo bar;
  let dst_copy = copy dst in
  bn_add dst_copy dst baz
  ...

As far as the Rust borrow-checker is concerned, this is perfectly fine.

The second series of problems concerns code-generation, that is, KaRaMeL’s compilation scheme. Consider the following program, which is fine in Low* (c comes before b, intentionally):

let ladder ... (abcd: array uint32 32) = (* four bignums side-by-side *)
  let a = abcd + 0 in
  let c = abcd + 16 in
  let b = abcd + 8 in
  let d = abcd + 24 in
  ...

One can’t perform arbitrary pointer arithmetic in Rust! The only primitive available is split_at (or split_at_mut, depending). For these cases, we perform a little bit of static analysis on the fly and record the history of pointer arithmetic operations with base abcd in a tree. After the first operation, we record that abcd was split at index 0.

LOW* (SOURCE)     RUST (DESTINATION)                    TREE

let a = abcd+0    let r_a = abcd.split_at_mut(0)            a @ 0

At this stage, Rust’s r_a is a pair of slices, meaning a reference to a in the source code compiles to r_a.1 in Rust.

LOW* (SOURCE)     RUST (DESTINATION)                    TREE

let a = abcd+0    let r_a = abcd.split_at_mut(0)            a @ 0
let c = abcd+16   let r_c = r_a.1.split_at_mut(16)             \
                                                              c @ 16

We keep going, extending the tree to keep track of the relationships between variables. Above, in the generated Rust, r_c is found in the right component of the pair r_a, which it splits at index 16. At that program point, a reference to a becomes r_c.0, while a reference to c becomes r_c.1. In other words, we assume that the intent of the programmer is that the slices be non-overlapping. This static analysis continues, and eventually yields:

LOW* (SOURCE)     RUST (DESTINATION)                    TREE

let a = abcd+0    let r_a = abcd.split_at_mut(0)            a @ 0
let c = abcd+16   let r_c = r_a.1.split_at_mut(16)             \
let b = abcd+8    let r_b = r_c.0.split_at_mut(8)             c @ 16
let d = abcd+24   let r_d = r_c.1.split_at_mut(8)              /   \
                                                             b @ 8  d @ 8

Eventually, a, b, c and d compile to r_b.0, r_b.1, r_d.0 and r_d.1 respectively.

This optimistic compilation scheme, as it turns out, can handle a very large amount of cases in HACL* without requiring any modifications to the source, which is a huge win for us! Of course, this only works because the crypto code that lives in HACL* has a very specific shape and doesn’t perform general-purpose pointer arithmetic.

This scheme has several drawbacks.

It cannot detect the case of actually overlapping slices, because pointer arithmetic operations do not come with the length of the sub-array (technically it’s erased by the time it reaches KaRaMeL). This means that there might be out-of-bounds runtime errors in the generated Rust. This, naturally, is an unpleasant property, and we plan to address it by changing the Low* extraction scheme to retain the length of subarrays. In the meanwhile, we seems to be lucky, as no such cases appear to happen in the source HACL*.
The scheme needs to be extended in case indices are not statically-known integers; in practice, our code can account for symbolic terms, which may be compared (e.g. e < e + 8) and thus suitably related to one another in the tree. If two indices cannot be compared, the code makes the assumption that the user is splitting the array into chunks going left to right. Again, this is an unpleasant source of imprecision and we would like to rewrite the source code to get rid of all these cases.
In addition, our scheme needs to deal with the (frequent) case where the code restarts pointer arithmetic off of the base, with difference indices. For instance, our earlier example might be followed by let ab = abcd + 0; let cd = abcd + 16, meaning we need to discard the previous tree and restart the static analysis. Support for this has been recently implemented.
Finally, we need to detect the intractable case where two different uses are interleaved (e.g. use a and b, use ab, then a and b again). Sadly, this happens in HACL*, but should be easy to flag as an error in the code-gen before we produce Rust code.

What about Vale?

Some readers might remember that the HACL* repository also hosts the Vale algorithms, written in a deeply-embedded Intel x64 assembly DSL. Those have their own compiler (more of a printer, really), which generates either assembly files (.S or .asm), or inline assembly headers, i.e., C headers with static inline functions containing __asm__ blocks, a GCC-ism that allows writing assembly directly within C code.

Our plan for those, currently implemented by Aymeric, is to retarget the printer to generate Rust inline assembly syntax. This means that algorithms like Curve25519’s 64-bit Intel ADX version, instead of generating a mixture of C and inline ASM (as is currently done in HACL*), will generate a mixture of Rust and Rust inline ASM. Rust and C inline assembly share many similarities; however, Rust imposes a few additional restrictions, such as forbidding the use of the rbx register for input and output operands which will require small tweaks to the verified Vale assembly.

Catering to legacy environments

As I mentioned above, Aeneas is the future! Crypto algorithms will be written in Rust by the programmer, and verification elves will confirm that the code exhibits all the required properties. But for all the excitement around Rust, a large variety of contexts still require the use of a legacy toolchain. Your project might be catering to vintage Unices from the 80s; or perhaps targets an embedded environment for which only proprietary C toolchains are available; or your management simply hasn’t reached enlightenment yet.

In spite of all of that, we still want to verify Rust, because it’s so much easier to work on a functional model rather than stateful pointer-wielding code. In order to get the best of both worlds, the Aeneas family is getting a new Greek-named tool: Eurydice. Eurydice connects to Charon, the same Rust compiler plug-in and frontend that Aeneas uses. Leveraging KaRaMeL as a library, Eurydice compiles MIR down to C.

The first challenge is to compile MIR into KaRaMeL’s internal AST. MIR is very regular, and has a clear distinction between arrays, which represent storage space, and pointers, which represent only a single word. The C language, on the other hand, is much murkier when it comes to semantics: there is no way to talk about the contents of an array, and there are no rvalues of array types – even though x may have type int[4], x as an rvalue always contains an implicit address-of operator, something which is explicit in MIR. Resolving that discrepancy requires a type-directed translation step, which happens first in the compilation pipeline.

Then, there are other semantic discrepancies – as I mentioned above, C arrays always decay, meaning they cannot be returned by value: one uses an “outparam” instead. Conversely, in Rust, one can very naturally return an array. This means that array-returning Rust functions need to be rewritten into outparam-taking C functions, unless the array is contained within a struct, in which case C changes its mind and lets you pass the whole thing by value.

If anything, Eurydice has given me much greater appreciation for Rust’s clean semantics, and a lot more bitterness about the state of the software industry, seeing that C has been the de facto solution for so long. But I digress.

The rest of the 30 or so nano compilation passes in Eurydice are about code quality and compiling away Rust features (function and data type polymorphism, assigning arrays by value) that C does not support. Of those, half were around found in KaRaMeL (or could be reused with minimal tweaks), and half are written specifically for Eurydice.

One interesting tidbit was the compilation of slices: they compile to a C struct (defined by hand) containing base pointer (of type void *) and length. Every slice-using function is naturally polymorphic in Rust; I added support in KaRaMeL for polymorphic opaque functions, which compile in C as a function call receiving the type argument, intended to be implemented by hand as a macro. This means that a Rust function like copy_from_slice is compiled generically by Eurydice, then implemented by hand as follows:

#define core_slice___Slice_T___copy_from_slice(dst, src, t) memcpy(dst.ptr, src.ptr, dst.len * sizeof(t))

Similarly, indexing a slice with a range is compiled generically and can be hand-implemented as a macro that receives a slice s, a range r and a type t:

#define Eurydice_slice_subslice(s, r, t) \
  ((Eurydice_slice){ .ptr = (void*)((t*)s.ptr + r.start), .len = r.end - r.start)

Note that we rely on C11 struct literals, so that we can have a struct as a value, and that the type is essential here in order to perform correct pointer arithmetic on a well-typed pointer.

Status

Currently, about a dozen modules from HACL-rs compile to Rust, and some like Curve25519 or Chacha-Poly have successfully been run and tested. This remains preliminary work and we hope to have a non-trivial amount of algorithms running and passing test vectors very soon.

For the other direction, we have successfully extracted and compiled a complete implementation of Kyber from Rust to C, with only minimal rewrites related to ongoing support for traits.

Performance

To everyone’s great surprise, performance has not been a concern in either direction. When compiling HACL* to Rust, the performance is within 2% of the original C, and merging those modifications that rewrite alias-taking functions sometimes even improves performance of the C code by a little bit! When compiling Kyber to C via Eurydice, there is no measurable performance difference – intuitively, it probably all looks the same to LLVM.

Wrap up

The future is bright for Rust verification, and it looks like we have a plan for transitioning our ecosystems to Rust. Join the effort, and hit us up if you’d like to help out!

Verified Secure Group Messaging with MLS

2023-06-09T08:00:00-07:00

Long gone are the days of my youth where AOL Messenger and IRC ruled online communication spaces… in this day and age, people use WhatsApp, Signal, Facebook Messenger, or even Instagram Messages (or so I am told). This new generation of messenger apps provides some fundamental technological improvements, such as Unicode Emoji instead of ASCII smileys, or perhaps more interestingly to readers of this blog, End-to-End Encryption.

While I could write several blog posts about the many fascinating facets of Unicode, including segmentation, encodings, normalization, and the creative use of zero-width joiner codepoints to create new emojis… I would like to focus today on MLS, a new standard in the space of secure messaging protocols. Specifically, this blog post focuses on verifying MLS, in order to establish with high assurance that it does, indeed, provide End-to-End Encryption in the context of group messaging. This blog post is an informal version of our paper, which my student Théophile (co-advised with Karthik B at INRIA) will present at Usenix Security this summer.

Our paper sheds lights on what it means to perform secure group management; identifies the sub-component of the standard in charge of this and formalizes it. This yields a reference implementation (for interop testing, or just for other implementors’ inspiration), and also yields bugs or shortcomings in the standard (which are now all fixed). Read on for more details.

A word about MLS

Théophile has written an excellent introduction to MLS. To recap briefly: two-party end-to-end secure messaging has been widely studied, notably through the Signal protocol, which powers many encrypted messengers (notably, those mentioned above, but not AOL Messenger or IRC, obviously). But oftentimes, users have multiple devices, and sharing keys between devices is bad™️. Ergo, every conversation is actually a group conversation between multiple devices, and for that we need something beyond Signal. Several solutions exist, but they’re suboptimal (see Théophile’s blog post), which is why “the industry” has been hard at work at the IETF devising a new standard called MLS that aims to provide end-to-end encryption, for groups.

If that all sounds familiar, it is: other protocols like TLS have been going through the same standardization process at the IETF. But unlike TLS, the academic community was involved from the get-go in the design and security analysis of MLS. Which begs the question: what properties are we trying to establish here?

First, some basics: MLS only takes care of processing messages and maintaining group data structures and corresponding cryptographic material. Crucially, MLS assumes the existence of two important components, along with some behavioral properties they must enjoy. First, a delivery service, which is responsible for carrying messages from one device to all the other participants in the group, retrying if need be, handling eventual consistency, and so on. Second, an identity service (sometimes called a directory) which associates some initial cryptographic material to a given identity, and somehow guarantees that the directory entry for Alice is indeed the correct one.

Now on to the expected properties of MLS. The functional requirements of MLS are perhaps unsurprising: it should be able to deal with group members going offline for periods of time then “catching up”; it should be scalable and avoid some quadratic issues currently found in some of the existing solutions; and it should support long-lived groups where people come and go over time.

The security requirements are the interesting part. We expect the usual guarantees: authenticity (message from Alice is indeed from Alice), confidentiality (Eve who isn’t in the group can’t decipher the messages), forward secrecy (decrypting at some point in time doesn’t compromise previous conversations) and post-compromise recovery (even if the key of Bob is compromised, after a period of healing, the cryptographic material has been refreshed and the attacker can’t eavesdrop any further).

Roster agreement, and its security

All of these guarantees make sense as long as we can trust the membership of the group. This is known as roster agreement, and states that everyone in the group is on the same page as to who exactly belongs to the group. This is a particularly tricky property to establish, and attacks have been found by others, on this exactly. This brings us to our paper. In it, we tackle questions such as: what exactly is secure group management, how does one capture those security properties, and how does one go about proving them.

The first contribution of the paper is to extricate and disentangle the core group management from what is now a very large standard. We call it TreeSync, and present it as its own, generic protocol that operates independently of the others parts of MLS, which are concerned with epoch secrets and deriving message encryption keys from those secrets. This in itself is novel: the current standard does not identify TreeSync, and it can be difficult to understand what is happening in what is now a very large the document. Our work makes it easier to understand the standard from a security and functional standpoint. Thus, TreeSync lives on, with its mandate being to make sure that everyone in the group has signed off (authenticated) the contents of the roster.

I called TreeSync a protocol: it operates on its own data structure (unsurprisingly, a tree), and can process messages to enact addition, deletion, or key refresh of the members currently in the group, meaning the tree evolves over time. Identifying TreeSync as a protocol on its own allows us to i) better understand past attacks as specifically targeting TreeSync, i.e. the group management part of MLS, and ii) to find new attacks that were previously not “visible” due to TreeSync being mixed up with the rest of MLS.

We formalize TreeSync in F*, sprinkle a generous amount of dependent types to encode a variety of invariants, then use the existing DY* framework to reason about protocol security in the symbolic model. As always, using formal reasoning forces us to think about the details: we ended up looking very closely at a core part of TreeSync called “ParentHash”, which computes a digest of the membership tree that TreeSync manages. Informally, the invariant of the membership tree is that each member (leaf) signs (authenticates) the subtree that is still in the same state as the leaf (hasn’t been updated since the leaf was itself changed). This is a crucial piece of information, since it allows new members to verify the signatures to ensure that membership is authenticated (signed) by group members, not by an external attacker.

There were several problems, related to two core optimizations. The first optimization is called “filtered nodes”, and essentially stems from the fact that the membership tree is a complete binary tree, but that in general, some of those nodes are empty. The second optimization is called “unmerged leaves”, and allows de-coupling addition from authentication. Essentially, unmerged leaves are waiting to be authenticated by the next refresh of secrets in the tree (see the paper for a more precise explanation). Overall, our work not only clarified the expected TreeSync invariant, using formal language, but also found several situations in which the invariant could be broken. These in turn defeated the security guarantees of MLS, and in particular could lead to a state of confusion where not all group members agree on whether Alice is in the group or not. All the fixes to the standard stemming from this work have been integrated in the RFC.

There are plenty of other interesting details in the paper, including a signature confusion attack, where the absence of disambiguating label could allow a signature to be reused in another part of the protocol (this is also very bad™️). But I’ll just conclude this brief overview with a note about implementation.

Actually running the code

In this line of research, and especially in the sort of protocols-meet-the-real-world projects that Karthik and I have been pursuing over the past few years, having actual code has always been a priority. This paper makes no exception, and we took great care to ensure that our reference implementation can actually be executed and is not just a bunch of lemmas. This is not just for fun: without serious interop testing, you might be proving great properties about a protocol that turns out to be MLS’, but not the actual MLS.

For this work, with the addition of byte-correct parsers and serializers (more on that in a later blog post), we were able to extract our entire MLS specification. Unlike previous work in Low* that extracts to C, this time, we extracted using F*’s regular extraction pipeline to OCaml. (Side-note: we would like to have a verified implementation of MLS in Rust using Aeneas, more on that also later.) As we had seen previously in other works (Signal*, Noise*…), the performance of the protocol is mostly dominated by its crypto primitives. What we did here is compile MLS to JavaScript using js_of_ocaml, then “plug” the protocol part on top of the efficient WebAssembly crypto primitives from our 2019 work on compiling HACL* to WASM. The result is a very decent implementation of MLS, which we actually integrated into a prototype version of Skype. We were able to successfully converse in secure group settings, and showed that one can be both specification-oriented and have very decent performance.

There is much more to learn about MLS: Théophile is planning an in-depth dive on his blog, meaning I can stop my overview here, and redirect the curious reader to our paper, or the blog. Hit us up if you want a copy of the implementation!

5 Years of Meta-Programming Cryptography

2022-05-22T08:00:00-07:00

For the past five years or so, I have been exploring, along with many other collaborators, the space of verified cryptography and protocols. The premise is simple: to deliver critical code that comes with strong properties of correctness, safety and security. Some readers may be familiar with Project Everest: this is the umbrella under which a large chunk of this work happened.

This blog post focuses on the specific topic of using meta-programming and compile-time evaluation to scale up cryptographic verification. This theme has been a key technical ingredient in almost every paper since 2016, with each successive publication making a greater use of either one of these techniques. The topic is of interest to my programming languages (PL)-oriented readers. And, as far as I know, I haven’t really written anything that highlights this unifying theme in the research arc of verified crypto and protocols.

These techniques all come together in our most recent paper on Noise*, in which we use a Futamura projection and a combination of deep and shallow embeddings to run a protocol compiler on the F* normalizer. The Noise* work is the culmination of five years of working on PL techniques for cryptography, and it will be presented this week at Oakland (S&P). This is a great timing for a retrospective!

A disclaimer before I go any further: this is my non-professional blog, and this post will inevitably capture my personal views and individual recollections.

Basic Partial Evaluation in Low*

In 2017, we introduced the Low* toolchain. The core idea is that we can model a palatable subset of C directly within F*.

// Allocate a zero-initialized array of 32 bytes on the stack.
let digest = Array.alloca 32ul 0uy in
SHA2.hash digest input input_len

The special Array module is crafted so that F* enforces a C-like discipline, permitting reasoning about liveness, disjointness, array lengths, and so on. (There are many more such modules in Low*.) In the example above, F* would typically check that i) digest and input are live pointers, that ii) input_len represents the length of input, that iii) digest and input are disjoint, and so on. Provided your code verifies successfully and abides by further (mostly syntactic) restrictions, then a special compiler called KaRaMeL (née KReMLin) will turn it into C code.

uint8_t digest[32];
SHA2_hash(digest, input, input_len);

As we were developing what would become HACL*, an issue quickly arose. For the purposes of verification, it is wise to split up a function into many small helpers, each with their own pre- and post-conditions. This makes verification more robust; code becomes more maintainable; and small helpers promote code reuse and modularity. However, at code-generation time, this produces a C file with a multitude of three-line functions scattered all over the place. This is not idiomatic, makes C code harder to follow, and generally diminishes our credibility when we want verification-agnostic practitioners to take us seriously. I wrote extensively about this a few years back: even though it’s auto-generated, producing quality C code matters!

The initial workaround (added Dec 2016) was to add an ad-hoc optimization pass to KaRaMeL, the F*-to-C compiler, using a new F* keyword, dubbed [@substitute]. See this vintage example for a typical usage in HACL*: at call-site, KaRaMeL would simply replace any function marked as [@substitute] with its body, substituting effective arguments for formal parameters. This is simply an inlining step, also known as a beta-reduction step. (The careful reader might notice that this is of course unsound in the presence of effectful function arguments, which may be evaluated twice: fortunately, F* guarantees that effectful arguments are let-bound, and thus evaluated only once.)

This was the first attempt at performing partial evaluation at compile-time in order to generate better code, and got us going for a while. However, it had several drawbacks.

F* already had a normalizer, that is, an interpreter built into the type inference machinery, for the purpose of higher-order unification. In other words, F* could already reduce terms at compile-time, and owing to its usage of a Krivine Abstract Machine (KAM), was much better equipped than KaRaMeL to do so! The KAM is well-studied, reduces terms in De Bruijn form, and is formally described, which is much better than adding an ad-hoc reduction step somewhere as a KaRaMeL nanopass.
F*’s normalizer performs a host of other reduction steps; if beta-reductions happen only in KaRaMeL, then a large amount of optimizations are actually impossible, because you need the composition of F*’s existing reduction steps, and the desired extra beta-reductions, rather than one after the other. In other words, you lose on how much partial evaluation you can trigger.

To illustrate the latter point, consider the example below. Circa 2016, the conditional could not be eliminated in the example below, because the beta-reduction (in KaRaMeL) happened after the conditional elimination step (in F*’s existing normalizer).

// St for "stateful"
[@substitute]
let f (x: bool): St ... =
  if x then
    ...
  else
    ...

let main () =
  f true

This was fixed in September 2017, when F* allowed its inline_for_extraction keyword to apply to stateful beta-redexes. This suddenly “unlocked” many more potential use-cases, which we quickly leveraged. The earlier example now uses the new keyword.

Overloaded Operators via Partial Evaluation

In 2018, HACL* entered a deep refactoring that eventually culminated in all of the original code being rewritten. That rewrite led to more concise, reusable and effective code.

One of the key changes was overloaded operators. This is an interesting technical feature, as it relies on the conjunction of four F* features: compile-time reduction, implicit argument inference, fine-grained reduction hints and cross-module inlining.

The problem is as follows. The Low* model of C in F* has one module per integer type; effectively, we expose the base types found in C99’s . Unfortunately, this means each integer type (there are 9 of them) has its own set of operators. This prevents modularity (functions need to be duplicated for e.g. uint32 and uint64), and hinders readability and programmer productivity.

Overloaded operators are defined as follows:

// IntTypes.fsti, Obviously, a simplification!
type int_width = | U32 | U64 | ...

inline_for_extraction
type int_type (x: int_width) =
  match x with
  | U32 -> LowStar.UInt32.t
  | U64 -> LowStar.UInt64.t

[@@ strict_on_arguments [0]]
inline_for_extraction
val (+): #w:inttype -> int_type w -> int_type w -> int_type w

This interface file only introduces an abstract signature for +; the definition is in the corresponding implementation file.

// IntTypes.fst
let (+) #w x y =
  match w with
  | U32 -> LowStar.UInt32.add x y
  | U64 -> LowStar.UInt64.add x y

Compared to our earlier [@substitute] example, this is a lot more sophisticated! To make this work in practice, we need to combine several mechanisms.

The inline_for_extraction qualifier indicates that this definition needs to be evaluated and reduced at compile-time by F*. Indeed, the dependent type int_type and the dependently-typed + do not compile to C, so any use of these definitions must be evaluated away at compile-time.
The strict_on_arguments attribute indicates that this definition should not be reduced eagerly, as it would lead to combinatorial explosion in certain cases. Rather, beta-reduction should take place only when the 0-th argument to + is a constant, which guarantees the match reduces immediately.
The definition is a val in the interface, meaning it is abstract; this is intentional, and prevents the SMT solver from “seeing” the body of the definition in client modules, and attempting a futile case analysis. To make this work, a special command-line flag must be passed to F*, called --cmi, for cross-module inlining. It makes sure inline_for_extraction traverses abstraction boundaries, but only at extraction-time, not verification-time.
Finally, in order for this to be pleasant to use, we indicate to F* that it ought to infer the width w, using # to denote an implicit argument.

These abstractions form the foundation of HACL* “v2”. They are briefly described in our CCS Paper.

(Note: this predates the type class mechanism in F*. The same effect would be achievable with type classes, no doubt.)

Hand-written Templates for Hashes

Operators enable code-sharing in the small; we want to enable code-sharing in the large. A prime example is the Merkle-Damgård (MD) family of hash functions (MD5, SHA1, SHA2). All of these functions follow the same high-level structure:

// VASTLY simplified!
let hash (dst: array uint8) (input: array uint8) (input_len: size_t) =
  // Initialize internal hash state
  let state = init () in
  // Compute how many complete blocks are in input, relying on truncating
  // integer division, then feed them into the internal hash state.
  let complete_blocks = (input_len / block_size) * block_size;
  update_many_blocks state input complete_blocks;
  // Compute how many bytes are left in the last partial block. Use pointer
  // arithmetic, than process those leftover bytes as the "last block".
  let partial_block_len = input_len - complete_blocks;
  update_final_block state (input + complete_blocks) partial_block_len;
  // Computation is over: we "extract" the internal hash state into the final
  // digest dst.
  extract dst state

The hash function is generic (identical) across all hash algorithms; furthermore, the update_final_block and update_many_blocks functions are themselves generic, and only depend on the block update function update_block which is specific to each hash in the family.

We could follow the trick we used earlier, and write something along the lines of:

type alg = SHA2_256 | SHA2_512 | ...

let hash (a: alg) ... =
  let state = init a in
  update_many_blocks a state input; ...

Then, we would have to make sure none of the a parameters remain in the final C code; this in turn would require us to inline all of the helper functions such as init, update_many_blocks, etc. into hash, so as to be able to write:

let hash_sha2_256 = hash SHA2_256

This works, and does produce first-order, low-level C code that contains no run-time checks over the exact kind of MD algorithm; but this comes at the expensive of readability: we end up with one huge, illegible hash function, with no trace left of the init/update/finish structure at the heart of the construction. (There are other issues with this approach, notably what happens when you have multiple implementations of e.g. SHA2/256.)

We can do much better than that, and generate quality low-level code without sacrificing readability. The trick is to write hash as a higher-order function, but perform enough partial evaluation that by extraction-time, the higher-order control-flow is gone.

inline_for_extraction
let mk_hash 
  (init: unit -> St internal_state ...)
  (update_many_blocks: internal_state -> array uint8 -> size_t -> St unit ...)
  ...
  (dst: array uint8)
  (input: array uint8)
=
  // same code, but this time using the function "pointers" passed as parameters

This function is obviously not the kind of function that one wants to see in C! It uses function pointers and would be wildly inefficient, not to mention that its types would be inexpressible in C.

But one can partially apply mk_hash to its first few arguments:

let hash_sha2_256 = mk_hash init_sha2_256 update_many_blocks_sha2_256 ...

At compile-time, F* performs enough beta-reductions that hash_sha2_256 becomes a specialized hash function that calls init_sha2_256, update_many_blocks_sha2_256, etc., thus producing legible, compact code that calls specialized functions.

// After beta-reduction
let hash_sha2_256 (dst: array uint8) (input: array uint8) =
  // The structure of the function calls is preserved; we get intelligible code
  // as opposed to an unscrutable 500-line function body where everything has
  // been inlined.
  let state = init_sha2_256 () in
  update_many_blocks_sha2_256 state input;
  update_final_block_sha2_256 state (input - input % block_size);
  extract_sha2_256 dst state

The technique can be used recursively:

let update_many_blocks_sha2_256 = mk_update_many_blocks update_one_block_sha2_256

In short, to add a new algorithm in the MD family, it simply suffices to write and verify its update_one_block_* function. Then, instantiating the mk_* family of functions yields a completely specialized copy of the code, that bears no trace of the original higher-order, generic infrastructure.

This is akin to a template in C++, where hash is specialized by the compiler into monomorphic instances for various values of T, except here, we don’t need any special support in the compiler! This is all done with beta-reductions and partial applications.

This technique was developed, settled, and adopted circa 2019, and remains in place to this day for hashes – it is briefly described in the EverCrypt Paper.

Automated Templates

This technique works well for algorithms whose call-graph is not very deep. Here, hash calls into update_many_blocks, which itself calls into update_one_block – there are no further levels of specialized calls. But for much larger pieces of code, manually restructuring many source files with a very deep call graph can prove extraordinarily tedious. No one wants to manually add all of those function parameters!

The solution is simply to perform this rewriting automatically with Meta-F*. Meta-F* allows the user to write regular F* code and have it be executed at compile-time; F* exposes its internals via a set of safe-by-construction APIs, which in turn allows meta-programs to inspect definitions, perform proofs, generate new definitions, and much more. This is the technique known as elaborator reflection and pioneered by Lean and Idris.

In our case, the user writes their code “normally”, and adds attributes to indicate which functions need to be rewritten into the mk_* form. Then, a generic meta-program (written once) traverses the function graph, inspects definitions, rewrites them using the mk_* pattern, and inserts the higher-order-style functions into the program. This all happens automatically without user intervention. All that is left for the user is to “instantiate” the template for their particular choice of base functions (e.g., let update_many_blocks_sha2_256 = mk_update_many_blocks update_one_block_sha2_256).

The tactic was developed throughout late 2019, and was adopted at scale in early 2020. At the time of this blog post, this “higher-order rewriting tactic” remains the second largest Meta-F* program ever written; most algorithms in HACL now rely on the tactic. We use it to “instantiate” an algorithm with a choice of primitives (e.g. HPKE with ChachaPoly+Curve25519+SHA256), a choice of implementations (e.g. ChachaPoly with Chacha-AVX2+Poly-AVX2), or both. We describe this technique in great detail in a pre-print.

Futamura projection with Recursion on the Normalizer

The final and most complex example in our partial evaluation journey is Noise*, a protocol compiler for the Noise Protocol Framework. Briefly, Noise is a DSL (domain-specific language) that describes a series of key establishment protocols, whose purpose is to establish a shared secret, hence a secure communication channel, between two parties, known as the initiator and the responder. A typical Noise program looks as follows:

IKpsk2:
 <- s
 ...
 -> e, es, s, ss
 <- e, ee, se, psk

Arrows indicate the flow of data: -> flows from initiator to responder, and conversely. The first line, before the ..., indicates data that is available out-of-band: in this case, the initiator knows the server’s static (s) key.

The actual handshake happens below the .... Each token indicates a particular cryptographic operation: es, for instance, indicates a Diffie-Hellman (DH) operation between the initiator’s ephemeral key and the responder’s static key; furthermore, it is implicitly understood that any further communication will be encrypted using a key derived from the DH secret.

There are 59 protocols in the Noise family; naturally, we want to write and verify an implementation only once! This new challenge is slightly more complicated than what we saw previously with the hashes: a line is a list of tokens, and a handshake is a list of list of tokens. We thus need to operate over recursive data structures at compile-time – our code better terminate!

We represent programs in the Noise DSL as a deep embedding:

type token = E | ES | S | SS
type step = list token
type handshake = list step

We write our code in what we call a “hybrid embedding” style. Parts of our code operate over the deep-embedding, recurse over the steps of the handshake, and maintain compile-time data, such as “this is the i-th step of the handshake”. Other parts use the regular Low* shallow embedding and perform the actual run-time operations.

The part of the code that operates over the deep embedding executes at compile-time, and thus belongs to the first stage. After the first stage has executed, all that remains is code for the second stage, using the Low* shallow embedding, hence the name “hybrid embedding”.

Concretely, we write an interpreter for Noise programs, whose actions are in Low*:

let rec eval_token (t: token) (s: state) =
  match t with
  | ES ->
      // This is the Low* call and appears in the generated code. The
      // surroundings (let rec, match) are "compiler code" and reduce away at
      // compile-time.
      diffie_hellman s.encryption_key s.ephemeral s.remote_static
  | ...

and eval_step (tokens: step) (s: state) =
  match tokens with
  | t :: tokens ->
      eval_token t s;
      eval_step tokens s
  | [] ->
      ()

and ...

Thanks to the first Futamura projection, we can partially apply the eval_* series of functions to one specific program in the Noise* DSL. In the case of the first step of IKpsk2, we obtain:

eval_step [ E; ES S; SS ] s

~> // reduces to

eval_token E s;
eval_step [ ES; S; SS ] s

~> // reduces to

eval_token E s;
eval_token ES s;
eval_step [ S; SS ] s

~> // reduces to

generate_ephemeral s.ephemeral;
diffie_hellman s.encryption_key s.ephemeral s.remote_static
...

In other words, we have embedded a protocol compiler in F*’s normalizer; to compile a Noise program, it suffices to perform a partial application and the compiler runs at compile-time in F*. The compiler produces a complete protocol stack, including state machine, defensive API, data structures for managing the peer list, and state serialization/deserialization, fully specialized for the given Noise protocol.

This technique allows for a fair degree of sophistication; we have expression-level compiler steps (the matches) above; but we also have type-level reduction steps: for instance, some fields of the state reduce to unit when not needed for the chosen Noise protocol, which in turn guarantees that they won’t appear in the resulting C code.

Naturally, making this work in practice requires a high degree of familiarity with F*’s internals; and I would lie if I said that we never pulled our hair trying to debug a normalizer loop! But the end result is conceptually quite satisfying, and thanks to Son Ho’s incredible work, yielded a very exciting paper that he will present this week at Oakland.

Looking back…

Verifying code is hard, labor-intensive, and exhausting. Any technique that makes the task easier yields substantial gains of productivity (and student happiness). Opportunities abound for leveraging PL techniques to automate the production of verified code! Instead of directly writing Low* code, it is oftentimes much more efficient to write code that produces Low* code. The extra degree of indirection pays off; and the most dramatic gains are achieved when the team’s expertise combines cryptography and PL.

What’s new in Everest: Summer 2020

2020-08-13T12:07:00-07:00

In a valiant attempt to wean myself off of doomscrolling, I thought I’d try to write a few blog posts this summer. This one highlights some of the exciting things that happened over the past few months in Everest, and specifically around the HACL and EverCrypt projects.

New bindings for Everest cryptography

The big piece of news is that we now have official OCaml and JavaScript bindings for our cryptographic code, a long-standing request from many consumers.

The OCaml bindings were authored by Victor Dumitrescu. Thanks to Victor’s patches, KreMLin, in addition to generating C code from F*, now also generates matching ocaml-ctypes bindings for the generated C code. The resulting OCaml package is called hacl-star-raw. The API in hacl-star-raw is very low-level and does not enforce any of the preconditions that are present in the F* source code. To make things much more pleasant to use, Victor also authored a separate package that wraps hacl-star-raw with a set of very idiomatic bindings that use functors, nice high-level signatures and types. This latter package is called simply hacl-star and can be installed with opam install hacl-star.

One of the very nice things about calling HACL or EverCrypt from OCaml is that all of the static preconditions can be checked at run-time! The functions that HACL* or EverCrypt expose to clients have preconditions of the form disjoint x y, where x and y are arrays, or preconditions of the form length x = l. Because we bind arrays as bytes in OCaml, the former check boils down to == and the latter check boils down to Bytes.length. So, your OCaml program will not segfault if you misuse the API – it is completely safe.
The JavaScript bindings build on our work on WebAssembly, and were authored by Denis Merigoux. Last year, we published the first fully formalized compilation toolchain from F* to WebAssembly at IEEE S&P’19. In that work, we presented (among other things) a paper formalization of the Low* to WASM compilation, along with its implementation in KreMLin, and its application to HACL*. This gave us WHACL*, the compilation of HACL* to WASM via the formalized toolchain in the KreMLin compiler.

Of course, there’s quite a gap between writing a paper and finishing the packaging, integrating the work under CI, fixing the long tail of small bugs that prevent a smooth integration, etc. So, it wasn’t until a few months ago that we were able to finally declare victory and publish WHACL*.

Just like for OCaml, the JavaScript bindings are made up of two layers. The lower layer is “raw” WASM code as output by the KreMLin compiler. Using this code requires knowing not only the static preconditions that must be satisfied by the clients, but also being aware of the KreMLin calling-convention, i.e. how KreMLin-compiled WASM code expects JavaScript clients to lay out arguments in memory before jumping into the WASM code.

To make things easier for clients, Denis wrote a JavaScript wrapper that hides all of these complexities and offers an idiomatic, “native” JavaScript API based on ArrayBuffers. This has been published in the node package repository and can be installed with npm install hacl-wasm.

One of the highlights of our JavaScript package is that it now makes it easy to use verified cryptography on the web, on the desktop (e.g. Electron) or on the server (via node). You no longer need to wait for WebCrypto to include the newer algorithms!

The documentation for both of these packages is online.

Much improved packaging and distribution

Over the past few months, we’ve steadily made great progress ensuring our code can be consumed easily by anyone interested. This required work in many areas:

The generated C code is now under version control on the master branch and refreshed automatically every night (contribution by Franziskus Kiefer). This means that you can always get the latest and freshest code!
There is now a configure script that performs some amount of auto-detection for your toolchain. This was important to enable compilation on Arm, where some files just flat out don’t compile, because, say, they use 256-bit wide vector instructions, for which there exists no ARM implementation.
We’ve signed up for Travis CI to try compiling the generated C code. We are now testing six different configurations: Linux/ARM64, Linux/AMD64 (4 variants), Windows/AMD64. This was motivated by the wonderful and humbling cornocupia of compiler and toolchain bugs we found. Among the most delightful issues we had:
- a version of XCode refused to compile our inline assembly because its register allocator bailed; no one was able to reproduce it, not even Apple, so it looks like Travis is using the one exact build of XCode with the problem, and this build cannot be found anywhere else!
- GCC swapped the order of arguments of an intrinsic at some point throughout its lifetime, but clang, which also defines __GNUC__, always had the right order to begin with
- after fixing the order of arguments, GCC still miscompiled the intrinsic
- no single toolchain agrees on a uniform way to securely zero-out memory
- compiler intrinsics are not found in the same header in MSVC vs. the rest of the world
- and so many more…

As always, this is the most unrewarding kind of work: anyone would rather be doing cool proofs than battle with Travis and debug toolchain issues. But, it’s my firm belief that high-quality packaging is essential to ensure the success of Everest crypto. So, extra thanks to Victor Dumitrescu, Natalia Kulatova, Marina Polubelova and Santiago Zanella-Béguelin for their help debugging and nailing down some of these most vexing issues.

New incremental APIs

Of course, many things happened on the code side, with new algorithms, new vectorized APIs, and an increased usage of meta-programming that has significantly lowered our code-to-proof ratio. All of this is covered in our ePrint, submitted with amazing collaborators Marina Polubelova, Karthikeyan Bhargavan, Benjamin Beurdouche, Aymeric Fromherz, Natalia Kulatova and Santiago Zanella-Béguelin.

But rather than enumerate a long list of improvements, I’ll just focus on one piece of work that I’m particularly excited about.

Many cryptographic primitives fall within a given family of algorithms that share some common characteristics. For instance, Merkle-Damgard hashes, Poly1305 and Blake2 are all block algorithms. This means that they process exactly one block of data at a time; clients who can’t provide data block-by-block must perform buffering themselves, a classic source of bugs owing to modulo-computations. Furthermore, block algorithms require clients to follow a very precise state machine, with a sequence of functions oftentimes referred to as init/update_block/update_last/finish – it is easy for clients to get this wrong. Finally, there are very subtle but essential differences in the state machines; for instance, clients may never call Blake2’s update_last with an empty block, even though it’s ok to do so for SHA2.

In short, init/update_block/update_last/finish is a very error-prone, low-level API and clients are probably better off using higher-level APIs that take care of the buffering, state machine management, and abstract away the idiosyncrasies of each algorithm.

Naturally, this is where formal verification comes in. We decided to start looking higher up the stack, beyond bare cryptographic primitives. In our latest work, we tackle the verification of the high-level APIs that perform internal buffer management and rule out state machine errors by copying internal block state as needed.

We have over a dozen implementations in our tree that are eligible for a high-level API. Writing a copy of the high-level API for each would be bad engineering, a poor use of our time, and of course, not very much fun.

Instead, we wrote a functor that takes as an argument a block-based algorithm, and generates via meta-programming a Low* implementation of its corresponding high-level API. The high-level API has a trivial state machine that admits no user errors; clients can feed the data byte-by-byte thanks to internal buffering; and by virtue of being the application of a single functor, all high-level APIs are meant to be used in the exact same fashion, where unpleasant state machine differences have all been abstracted away.

Thanks to a judicious use of meta-programming and very fine-grained control of meta-time partial evaluation, the resulting code has no cruft. The functor has several tweaking knobs, controlling for instance whether the resulting Low* code will have to perform key management (e.g. Poly1305), or whether this is not needed (e.g. SHA2). In the latter case, the relevant struct fields and function parameters get eliminated via partial evaluation.

These new high-level APIs are now all available on master, and are known as the “Streaming” APIs. Thanks to Aymeric Fromherz and heroic work by Son Ho, Blake2 has been brought under the same API, meaning we can now trivially generate a byte-by-byte API for any implementation of SHA, Poly1305 or Blake2.

We plan to add SHA3 to the set of available streaming APIs, and also adopt the same approach for other families of algorithms, such as CTR encryption. And, of course, write a paper about this all. Stay tuned!

GitHub strange

2019-12-08T09:00:00-08:00

Earlier this year, I was paired up with my colleagues Chris, Tom, Madan and intern Danielle Gonzalez (from Rochester) for the annual Microsoft Hackathon. The goal was to examine and extract meaningful data from GHTorrent, a colossal data set that contains an exhaustive log of all GitHub events, for all repositories.

Nothing went quite as planned, and we (of course) hit more issues than we anticipated, but in the short span of a couple days we still managed to make some interesting discoveries. In particular, we found some really weird GitHub repositories that to this day raise more questions than answers… Here’s (finally) a writeup!

A monstrous dataset

The dataset was colossal: even on a fine, 20MB/s network connection, merely downloading 100GB of data still takes about two hours… and that’s just for a month’s worth of data. So, there went the first few hours of the hackathon.

The next issue was simply that my machine, even with 1TB of disk, just didn’t have enough space to load the downloaded SQL dump into a local MySQL database; even if I had had the space, according to our extrapolations, this would’ve taken several days. So, we had to come up with a plan B.

A quick language shootout ensued, where we each wrote a quick-and-dirty script that would parse a couple of the SQL dumped tables (in CSV format) and try to compute some basic queries by hand. Perl, Unix shell (grep & co), and OCaml were all attempted. We found that using the excellent CSV package for OCaml gave the fastest results (much more so than a hand-writter lexer); they even had the exact option we needed to parse the specific escaping format used by SQL dumps.

The strange world of GitHub repositories

We then attempted to answer a very simple question in our remaining time: which repositories have the most commits? (Over the course of that month.)

It turns out that because we were aggregating and folding over the entire event stream of GitHub, instead of counting the number of commits currently in the tree, we appear to have counted the number of commits pushed to a given repository over the chosen time period. The results were not what we expected, and uncovered some repositories that are not what you can read about, e.g. on this Quora question.

First place: `tmp_clock_repo`

We initially suspected an error in our script: the first repository on our list was https://github.com/efarbereger/tmp_clock_repo, with over 13 million commits. We immediately went to the project page, only to find a nondescript repository, with no files in it, only 1470 commits, the latest of which was several days ago. Only after we navigated back up to the author’s page did everything suddenly became clear. Eric Farber-Eger, an unsung hero of version control, has a cron job that every five minutes pushes an entire new history to his repo, crafted so that the GitHub heat map of his contributions forms a digital LCD clock. And, 31 * 24 * ( 60 / 5 ) * 1470 is about 13 million, so this checks out.

A quick aside: Git allows one to entirely rewrite the history by force-pushing, and commit metadata is only indicative; in particular, the date of a given commit can be chosen arbitrarly, either by using a Git library directly (e.g. libgit2) or via the --date option of the command-line frontend. Thus, the heatmap can be used as a virtual LCD display where each pixel is addressable by writing to the (fresh) history commits for that given time period.

I don’t know who Eric is; a Google search doesn’t seem to yield many results; but he has my eternal admiration.

For interested readers, there seems to be an entire set of libraries dedicated to the very task of pushing pixel art on GitHub heatmaps. Some people’s creativity is just astonishing.

Second place: `historyclockimage`

In second place was a now-defunct project called historyclockimage, at nearly 5 million commits pushed. Quite unsurprisingly, it was under username efarbereger. My curiosity and admiration for this mystery man only grew stronger.

Third place: `blocklist-ipsets`

In third place was https://github.com/firehol/blocklist-ipsets, with 3.5 million commits pushed over time, but only one commit in the history. It turns out that, this is just one instance of people using GitHub as a cloud storage provider, to store a variety of files which, conceivably, can be easily updated by clients via the use of a simple Git pull.

I guess there is something to be said for the simplicity of the Git workflow? Or is it that setting up just storage on the cloud is too much of a setup for trivial use-cases?

Fourth place: `heartbeat`

In fourth place, with 1.6 million commits, was perhaps the most intringuing repository: https://github.com/19h/heartbeat. The description reads: “Emergency signed life insurance files.”. The contents? Three files: a cryptic README, whose first line is GCM R 20/0c/400 L 20/0c/400 followed by some base64 data, which once decoded, does not seem to have any structure or contents. Beyond the README are two files, lkLocation and lkHeartbeat. I could find very little on this repository, except for a brief Reddit thread.

What is this mysterious project? Is this a cloud-distributed, modern version of the supposed dead-hand radio UVB-76?

Fifth place: `update`

In fifth place, with 1.2 million commits: https://github.com/shenzhouzd/update, which very simply reads: “This repository has been disabled. Access to this repository has been disabled by GitHub staff due to excessive use of resources”. What have you done, shenzhouzd? Why is https://github.com/shenzhouzd/update1 empty?

Special mention: CI logs

Some familiar faces appear further down the line (positions 16 and 52), with CI logs for the impressive infrastructure deployed by our friends at Cambridge Labs.

Very special mention: TV playlists?!!

Further down were a set of repositories all sharing the same characteristics: a single user, with a single repository of the same name, containing a single file: lists/plex.txt. The file appears to be a curated playlist of video channels from across the world; radio stations; and a weird mix of movies hosted on a public website. Is there some hotel, somewhere, in some part of the world, where the VOD system pulls its data from GitHub? Who has curated this list and decided that all the Matrix and Home Alone sequels should be included?

Update: my colleague Tom points out that this seems to be related to http://ccloudtv.org/. I’m unsure why there are many distributed playlists on remote GitHub repositories. Probably for the convenient cloud storage, once again?

Methodology & conclusion

Looking back on the data, this only a very partial view; after all, it’s only over a given month of a given year. It says nothing about the “overall” importance of a given repository; Linux, for instance, is way down.

The goal, however, was not to have super-serious results anyhow, but just to experiment over a short span of time. I’m glad we found oddities and quirky repositories! There are many more mysteries in this list, which I’ve uploaded online. I’d be happy to hear readers’ theories on the mysterious playlists and the life insurance policy.

The EverCrypt verified cryptographic provider

2019-04-02T10:00:00-07:00

Today, we’re announcing a preview release of EverCrypt, a verified cryptographic provider that offers a comprehensive collection of cryptographic algorithms. EverCrypt automatically selects the best available implementation for your platform (C or assembly); offers unified APIs for families of algorithms (e.g. hashes); and its performance is on-par with what you’d find in, say, OpenSSL. In short, EverCrypt aims to offer a verified cryptographic library that offers the same convenience as existing, industrial-grade libraries, but with added verification guarantees.

A verified cryptographic library is of crucial importance: cryptographic libraries are found in any modern software stack, and they’re incredibly hard to get right. Bugs range from memory corruption, incorrect math, to side-channel leaks and even illegal instruction errors. All of these can have incredibly painful consequences. With Evercrypt, we hope to demonstrate that one can rule out these errors with mathematical certainty, without compromising on the feature set or performance.

The technical details, including a precise description of what we verify, are available in the README. The high-level overview, with a human-readable explanation of what we hope to achieve, is on the MSR blog. For a more general perspective on software verification, Quanta Magazine just released a very accessible article that talks about our work.

What we’re hoping for with this alpha release

First, an obligatory disclaimer: this is an alpha release, and important features are missing, such as tests for non-X64 platforms; a C fallback implementation of the AES algorithms; and many others I’m sure. We also have a few admitted proofs here and there throughout our code which couldn’t be wrapped up in time for the release. These will be addressed by the end of this release cycle.

Nevertheless, the reason we’re doing an “informal” release is to gather feedback about the code, even in its current state. Things we’d love to hear about include:

ease-of-use of our library from C
whether the APIs are “idiomatic” or they could use some improvements
integration or build difficulties
things that could be improved to make this even more useful to C clients
most-wanted algorithms / optimized implementations (do you crave an AVX2 SHA256 or an AVX Chacha20?).

In terms of APIs, the most polished one is the hash API (EverCrypt_Hash.h); let us know if there are improvements to be enacted for this style.

Some of the improvements on the radar are: a unified error code in EverCrypt.Error; abstract C structs for all APIs, including Hash and AEAD; and many more.

Please get in touch via a GitHub issue! If sharing an experience report publicly is not possible, private emails also work.

A look behind the scenes

We call EverCrypt a “provider” because it unifies under a single interface two strands of work from Project Everest: HACL*, a cryptographic library written in Low* which generates pure C algorithms, and Vale-Crypto, a collection of assembly algorithms written in Vale.

EverCrypt is, perhaps, the first project to come out of Everest whose ownership is truly spread across all institutions currently working on Everest: Microsoft Research, INRIA, Carnegie Mellon University. It means it’s also the first instance in which we had to reconcile two independently-developed projects.

There were plenty of technical challenges which we will cover in an upcoming paper submission: verifying both the Vale and HACL implementations against the same specifications; crafting suitable interfaces to abstract away the implementation details; doing multiplexing in a style that is friendly to the C preprocessor; and many more.

But the challenge, perhaps, wasn’t so much in the verification itself, but in materializing the union of the two projects; and for that, a lot of work took place behind the scenes. We have, for the whole hacl-star repository, 110,000 lines of hand-written Low* and 25,000 lines of hand-written Vale; the latter translate to 70,000 lines of (generated) F*. Getting those in a single repository, efficiently verifying and building, was in itself a challenge. Several contributor-weeks were devoted to a unified build system; performance improvements at every level in F*; bringing back diverging abstractions into shared library modules that could guarantee safe interoperability; and many more mundane tasks that we’d all rather forget.

The work is still very much in progress and we have a long road towards a 1.0 release, but everyone has been admirably patient in the face of engineering and performance challenges. My hope is that as the technology stabilizes, we’ll be able to allocate more time to documentation and learning materials, and bring onboard more contributors who will help us grow our body of verified code.

Generating C code that people actually want to use

2019-01-04T08:57:33-08:00

Project Everest is a large, collaborative research effort that aims to verify and deploy a new, secure HTTPS stack. All of our code is verified using the F* programming language. Using KreMLin, a dedicated compiler, the verified F* code is compiled to readable C, meaning existing systems projects can readily integrate our verified code. Going to C is what allows people to use our code without having to buy into exotic, strange languages with lambdas.

Three years into the project, we have successfully integrated code into mainstream software, notably Firefox and Windows. Along the way, we learned what it takes to deliver quality C code that can be taken seriously and integrated into an actual source tree. We made many rookie mistakes and discovered the discrepancy between what we, researchers, thought was ok, and what was expected of us by actual software developers. None of these lessons will ever make it into a paper, which makes a perfect fit for a blog post. I have shared many of these anecdotes in informal conversations or talks; apologies if you’ve heard some of this already.

Our first big “client” of Project Everest turned out to be Firefox. This initial success was in large part due to the well-known strategy of infiltrating PhD students at large companies or non-profits (e.g. Mozilla), hoping that after setting camp as interns, they would be in a position to push for verified code from the inside.

Software engineers review generated code!

We sure expected that generating idiomatic C code was important, and this informed a lot of early design choices in our toolchain. We were surprised, however, by how closely Mozilla reviewed and manually inspected the generated C code. This revealed a number of issues that we never thought were actual blockers.

Many temporaries were inserted by F*, to enforce the evaluation order of arguments, something which is guaranteed by neither C of OCaml, the two main extraction backends of F*. Alas, this extra precision was not appreciated by reviewers, who very explicitly requested that variables named uu__123456 be eliminated whenever possible.
Computationally-irrelevant function arguments were erased as unit arguments, a correct but suboptimal compilation scheme that gives un-natural function signatures in C headers. Also a no-go.
I wrote the pretty-printer for our compiler, KReMLin, looking at the reference table of operator precedence in C, resulting in a minimal amount of parentheses being inserted; I happily thought I did the optimal thing, until it turned out that no one can remember the relative precedence of + and <<, or | and - – I have myself since then forgotten, and I’ve heard that it even differs across languages…

These shortcomings were fixed as follows.

The auto-generated names were removed using a custom pass in the KReMLin compiler that performs a def-use analysis via a syntactic criterion, and is aware of the C semantics.
The extra unit arguments were removed, first by KReMLin, then in F* itself, which now ensures that the OCaml extraction of F* also generates prettier code.
KReMLin became reasonable and added an option to generate extra parentheses.

A myriad of other details were fixed, such as always adding curly braces around conditional blocks, indentation, unused variable elimination, etc.

In short, even if it doesn’t matter for a C compiler, the expectation was truly that the resulting C code would be as crisp, clean and readable as what a C programmer would have written. In hindsight, that’s fair.

Don’t go functional on a C compiler

In early experiments, as programmers of functional training and descent, we relied heavily on recursion: reasoning is easier with recursive calls (it’s easy to call some lemma after the recursive call returns); it also makes implementing our functional specifications easier, something important for students who need to a progressive ramp-up with our toolchain. For all those scenarios, we relied on GCC 7’s excellent tail-call optimization, as a stop-gap measure in support of rapid prototyping.

However, once one leaves the cocoon of modern compilers and toolchains, and discovers the vast range of compiler versions and architectures under Mozilla’s CI, it becomes evident that relying on any sort of tail-calls optimizations would be unreasonable: very few compilers reliably perform tail-call optimization.

We were confronted with that reality when a (temporary) piece of code blew the stack using a non-GCC compiler. It was converting between two abstractions for arrays by performing a very idiomatic, functional, byte-by-byte recursive copy. It was later rewritten with a call to memcpy, and will soon disappear as we unify the two abstractions.

Another gripe that we had was our heavy reliance on passing structures by value. KReMLin supports, as an extension, data types passed as values, using a tagged union compilation scheme. It is awfully convenient, since structures as values have no lifetime, and are hence manipulated as pure values within F*, incurring no extra verification cost.

Passing structures by value is allowed by the C standard, but this is costly. Most ABIs mandate that such structures, if larger than a few words, be passed by pushing their contents onto the stack. This showed up in our performance profiles; our new guideline is to essentially only use those when their size does not exceed four words on a 64-bit architecture.

Furthermore, even small structures caused issues. Due to multiple layers of type abstraction, and owing to the tagged union compilation scheme of KReMLin, we ended up with series of field accesses over 10 fields long, which hit a hardcoded limit in one of our supported compilers. This was later solved by extending KReMLin with three new compilation schemes for data types, optimizing special cases (inductive has one branch; no branch has more than one argument; etc.) to reduce the depth of field accesses.

Forget about abstraction and modularity

Working with F*, we (unsurprisingly) split our code into separate modules, to leverage modularity, maintainability, and parallel verification. Within modules, larger functions were split into small, individual helpers, each adorned with their respective pre- and post-conditions, for robustness and readability.

Going to C, a priority for the KReMLin compiler was to get rid of this pesky abstraction and modularity.

First, no one wants to land in their repository an algorithm spread out across 20 different source files, no matter how beautiful and well-designed the original source code was.
Naïvely translating each F* module to an individual C file had a more pernicious effect: this generated a lot of calls across translation units (C files), meaning that a large chunk of the call graph was forced to abide by the rules of external linkage and the calling convention dictated by the compiler’s ABI. Specifically, this prevented the compiler from optimizing function calls by, say, passing more arguments in registers. This had a disastrous effect on performance.
Even splitting functions into smaller helpers prevented a lot of intra-procedural analyses from kicking in.

This was alleviated with two mechanisms:

KReMLin got equipped with a bundling mechanism driven by a micro-DSL, which allows the programmer to specify how to recombine multiple source F* modules into a single C translation unit (file), indicating which functions are the API entry points, leaving the rest of the module marked as static inline, which has a big effect on performance.
Initially done in KReMLin, now in F*, the programmer has the ability to mark functions as noextract inline_for_extraction, meaning that the helper will be inlined at call-site and will not itself be extracted to a separate function.

The bundling mechanism works in conjunction with a reachability analysis that removes anything not reachable from the API entry points, hence ensuring that no unused helper remains in the generated C code, something that also made our consumers queasy.

Reproducible builds

Our toolchain relies on OCaml, custom compilers, and a variety of tools that turned out to be very hard to set up. We have now switched entirely to a Docker-based CI infrastructure, which makes distributing our code trivial. Mozilla adopted this and runs a Docker build command in their own CI, which errors out if someone tries to modify the generated code instead of fixing the F* source file.

Audit all the non-verified code

This blog post wouldn’t be complete without a few words about the bugs we found. Apart from a compiler bug, two amusing bugs occurred in hand-written code and are worth noting.

We had a somewhat puzzling stack corruption. The reason turned out to be an external header improperly defining a macro. Now for the details…
C99 provides fixed-width integer types (uint32_t, etc.) via #include , along with corresponding macros for printing and scanning fixed-width integers via printf and scanf. We rely heavily on , but compiling for Windows, the header didn’t seem to be available by default. We did find, however a copy of this header somewhere, and used it to write a routine to convert a hex-encoded string into the corresponding byte array, using the SCNx8 macro for scanning a single hex-encoded byte into a temporary on the stack. Unbeknownst to us, the underlying C library had no runtime support for scanning a single byte, but this did not deter the author of the inttypes.h header we found from defining the macro nonetheless, scanning two bytes into the destination address, causing a buffer overrun. It looks like other people (e.g. Android) were careful not to define SCNx8 on Windows.
We regularly audit our hand-written code using Clang’s sanitizers. This caught a few undefined behaviors. The most blatant one was for code reading 4 little-endian bytes, returning the corresponding uint32_t. The author of the initial implementation wrote return *((uint32_t *)src);, not realizing that in the case that src is not aligned, this is undefined.

We now have a fairly solid process for distributing code, gathering the artifacts of Project Everest in self-contained directories that any client can copy and integrate into their project. It took us a long way to get there, but we are now ready to scale up to multiple consumers of our code, an excellent news as we gear up towards more software releases, such as EverCrypt, which I shall cover in a later blog post.