Eurydice: a Rust to C compiler (yes)
Perhaps the greatest surprise of the last two years was, for me, the realization that people not only care about compiling C to Rust (for obvious reasons, such as, ahem, memory safety) – they also care about compiling Rust to C! Wait, what?
I wrote about this briefly a couple years ago, but the level of interest for the project, I must say, took me somewhat by surprise. So let’s talk about compiling Rust to C a little more today.
Barriers to Rust adoption
Rust is making big progress in terms of adoption, and represents a great value proposition, especially for new code. Both my former employer and my new employer, like pretty much everyone else these days, have big projects that are written in pure Rust or can have Rust components. Even Windows kernel drivers can be written in Rust now. Amazing stuff.
However, if your project is, say, an open-source library that gets compiled on a wonderfully diverse set of target architectures, OSes, distributions and toolchains, well, chances are… one of these is not going to support Rust. Think of a crypto library: there will be people out there with an obscure compiler for a weird embedded target, and they really want to compile your library, because they’ve been told not to roll out their own crypto. Or perhaps you have a format library ridden with memory errors and you want to port it to Rust. Or maybe your company has an in-house analysis that only runs on C code. Regardless of the scenario, there will always be that one legacy use-case that prevents you from switching to Rust until it’s 2035, all those LTS versions (looking at you RHEL) are finally retired, and you yourself are too close to retirement to even care anymore.
That is, unless you’re willing to use a Rust to C compiler.
Why?
Having a backwards-compat scenario where Rust can be compiled to C serves several purposes.
- It allows for a gradual transition. The codebase can be ported to Rust, and refactored / cleaned up / rewritten to use all the nice Rust things (data types, pattern-matching, polymorphism, memory safety), thus making you and your developers much, much happier. Meanwhile, the C version co-exists so that you don’t alienate your userbase.
- It only requires maintaining a single version. The Rust code is authoritative; the C code is derived from it automatically, either on CI, or at least with a CI job that checks that the two are in sync.
- It allows for a census of problematic scenarios. By making the Rust version
the default (and putting the fallback C behind a
--write-us-an-emailflag), there is finally a way to enumerate those mythical users who cannot switch to Rust just yet.
If that sounds appealing, meet Eurydice.
Eurydice
Eurydice is a compiler from Rust to C that aims to produce readable C code. Of course, readability is subjective; also, seeing that Rust relies on whole-program monomorphization, the C code is bound to be more verbose than the Rust code. But you can judge for yourself: here’s the result of compiling libcrux to C.
The output of the test suite is under version control, and there are a lot more tests to peruse. See for instance this bit, compared to the Rust original.
The design of Eurydice
Eurydice plugs in directly at the MIR level, using
Charon to avoid reimplementing the
wheel and paying the price of interacting with the guts of rustc. Our
paper on Charon says more about its
architecture.
The advantage of plugging in at the MIR level is that i) we do not have to interpret syntactic sugar, which means our translation is more faithful to the Rust semantics, and ii) we have way fewer constructs that need compiling to C. Even then, it’s no easy feat to translate Rust to C.
There is naturally, the need to perform whole-program monomorphization, over
types and const-generic arguments; the compilation of pattern matches into
tagged unions; recognizing instances of iterators that can be compiled to native
C for-loops. Then, there are more subtle things, such as compiling array
repeat expressions sensibly – zero-initializers when possible, initializer
lists otherwise, unless it generates too much code, in which case for-loops are
preferable. And finally, there are all the rules about visibility, static,
inline, etc. that are very C-specific and depend on how you want to lay out
your C files.
The translation is complicated by the constraint that the generated code
ought to be readable: for instance, we compile Rust structs to
C structs, including
DSTs, by
relying on flexible array
members.
We also
work hard to avoid using the fully-generic tagged union pattern when possible,
instead eliminating the tag when e.g. the Rust enum only has a single case.
Additionally, we rely on Charon to reconstruct control-flow, rather than compile
the MIR CFG to C code ridden
with gotos; again, this is for code quality.
At a low-level, there were many interesting tidbits.
- Because arrays in Rust are values, we wrap them within C structs to give them
value semantics in C, too; concretely,
[u32; 8]becomesstruct { uint32_t data[8]; }. (A previous version of Eurydice would emituint32_t *, and rely on variousmemcpys to implement value semantics, but this produced a translation that was not type-generic, and there were plenty of finicky corner cases. We revamped the compilation scheme recently.) - The notion of
lvaluein C means we need to insert more variable declarations than in Rust – for instance, you can’t trivially compile&[0u32; 1]without naming the array. - The fact that the evaluation order is so loosely defined in C means that intermediary computations need to be stored in intermediary variables to enforce the evaluation order.
- Rust relies on whole-program monomorphization; this means that the C code is inevitably going to contains multiple copies of the same types and functions, but for different choices of type and const generic argumnets. This is currently done with a builtin phase in Eurydice (for historical reasons), but in the long run, we want to rely on Charon’s support for monomorphization.
- There are plenty of peephole optimizations that are required for good code
quality, such as recognizing
array::from_fnand generating sensible code that initializes the array in-place (instead of relying on the fully-general compilation scheme for closures), or recognizing instances of theEqtrait that deserve dedicated treatment (such as usingmemcmpfor arrays and slices of flat data).
A final design choice is that for now, Eurydice may define more behaviors than Rust – for instance, Rust panics on integer overflow, but Eurydice-compiled code does not. This is because we assume the input code is verified, and therefore has been shown to be free of panics. This design choice can be easily changed, though.
In practice, as soon as you use traits, the C code becomes more voluminous than the Rust code. We rely on a configuration file mechanism to control the placement of monomorphized instances of a given function, rather than put everything in one big C file. This currently requires a lot of manual intervention to give good results on large projects.
Implementing of Eurydice
Eurydice starts by compiling the MIR AST obtained out of Charon into KaRaMeL’s internal AST. This is ~3000 lines of OCaml code, so that’s already pretty involved. A lot of the work revolves around trait methods and their monomorphization, given Rust’s expressive trait system.
Then, about 30 nanopasses simplify the KaRaMeL AST until it becomes eligible for compilation to C. Of those, a handful were originally written for KaRaMeL and were somewhat reusable; this includes compilation of data types, as well as monomorphization. The rest was written from scratch for Eurydice, and totals about ~5000 lines of OCaml code.
A particularly gnarly phase was eliminating MIR’s variable assignments as much as possible: in MIR, every variable starts out uninitialized at the beginning of the function; then, in lieu of the variable declaration, we have an assignment with the initial value. Naturally, having a variable declaration in the right spot is better for code quality, so an initial phase tries to reconstruct these assignments. That’s a drawback of using MIR, but we still firmly believe that sticking to something that has clear semantics is ultimately better.
Fun fact: because there are so many peephole optimizations, I got tired of
maintaining enormous
pattern-matches
that would try to catch every flavor of
Rust iterator that can be compiled to a C for-loop. Instead, a custom OCaml syntax
extension allows writing concrete
syntax
for the internal KaRaMeL language in OCaml patterns. Those magic patterns then get
compiled at compile-time to OCaml AST nodes for an actual OCaml pattern that
matches the (deeply-embedded) syntax of KaRaMeL’s AST. This relies on a ppx
that lexes, parses and compiles the concrete syntax.
Deploying Eurydice-generated code
Eurydice-generated code expects some hand-written glue that contains macros and
static inline functions; sometimes, it’s simply more convenient to write a
single macro that uses a type, rather than have Eurydice generate N copies of a
polymorphic function that gets specialized each time. A typical example is
compiling the Eq trait for arrays: it’s nicer to emit Eurydice_array_eq(a1, a2,
len, t), which macro-expands to !(memcmp(a1, a2, len*sizeof(t))), rather than
have N such functions, each containing a for-loop specialized for different
values of t.
Eurydice generates code that is either (C11 and C++20-compatible) or (C++-17
compatible, but not C-compatible). The reason for this is that Rust allows enum
values (e.g. Foo { bar: baz }) in any expression position. For simplicity,
Eurydice emits a compound initializer (Foo) { .tag = bar, .value = { .case_Foo
= { .bar = baz }}}, or a C++20 aggregate that uses designated initializers,
relying on a macro (not shown here) to hide the syntax differences between the
two. But C++17 does not have designated initializers, so there is an option for
Eurydice to emit different code that relies on member pointers to achieve
sensibly the same effect.
Limitations of Eurydice
Naturally, there are many limitations to this approach. Here are the main ones that come to mind:
- we cannot guarantee that the layout of objects will be the same in C as in Rust; conceivably, one could parse the layout information from MIR, then emit compiler-specific alignment directives to keep the two identical, but this is not done currently;
- the generated code violates strict
aliasing,
because creating a user-defined DST involves casting one pointer type (a
struct containing an array) to another (a struct with a flexible array
member instead); I’m not sure what the best fix is, so for now, please compile your
code with
-fno-strict-aliasing; - the code that Eurydice sees is MIR after applying
cfgtweaks; this means that for code that is intended to be multi-platform, some tricks need to be applied, otherwise, Eurydice will only “see” one version of the code (AVX2, or ARM64, or something else) - because monorphization is so pervasive, the configuration language needs to
express things such as “types that reference
__m256i, an AVX2-only type, need to go into a separate file to be compiled with-mavx2”; this can get tedious real fast but I’m not sure I know how to do better.
What’s next?
There is ongoing work to integrate Eurydice-generated code for both Microsoft and Google’s respective crypto libraries.
The community grew recently, with wonderful contributions by GitHub users
@ssyram and @lin23299. There are more in the pipeline, and I look forward to
seeing the supported subset of Rust grow even more. Next on the horizon is
support for dyn traits via vtables, and relying on Charon’s monomorphization
to get MIR exactly as the Rust compiler would monomorphize it, intead of relying
on a custom procedure in Eurydice.
An ambitious goal is for the whole standard library of Rust to be extractable via Eurydice in 2026. This is non-trivial, but I believe this achievement is within reach. Stay tuned.
PS: Why the name?
People keep asking about the name; because the project shares a large amount of infrastructure with Aeneas and Charon, I had to follow the Greek mythology theme. Specifically, the myth of Eurydice resonated with me: I thought I was saved from the hell of generating C code, and was going to go back to the world of the living, but alas, no.
