Optimized Cryptography with LLVM Non-Standard Bitwidth Types

Documents

Summary

The optimization target for the project is the X25519 cryptographic function implemented as plainly as possible in C with the _BitInt(255) type provided by the upcoming C23 standard.

Baseline

Benchmarks against the widely-deployed and well-optimized libsodium implementation of X25519 show a massive performance gap. Clang 17.0.3 out-of-the-box produces code that takes over 127x that of libsodium for a single X25519 call.

Implementation	Time (ns)	Iterations	Cycles
Baseline	4682.75	8962	15771906.96
`libsodium`	36.80	1141472	124794.13

This result suggests there may be some low hanging fruit, but we also think there's a lot of work to be done!

Crandall Reduction

The first optimization to be implemented and evaluated is Crandall reduction, which exploits a trick for modular reduction for moduli of the form 2^n - c for some small c. If our modulus m has this form then 2^n = c (mod m). Therefore if a value x to be reduced has low n bits l and high bits h, then we have that x = c*h + l (mod m).

The CrandallReductionPass detects this case and optimizes it for large integer types observed in cryptographic code like X25519. Despite its simplicity, it performs remarkably well on our benchmark, giving a massive 24x speedup.

Implementation	Time (ns)	Iterations	Cycles
Baseline	4682.75	8962	15771906.96
Crandall Single Step	195.29	215216	662345.84

Reduction Analysis

The ReductionAnalysis pass identifies self-contained arithmetic expression graphs which are guaranteed to be reduced modulo the same modulus. The purpose of this stage is to detect when intermediate reductions may not be necessary, and could be safely rewritten to use partially non-reduced forms.

The analysis is implemented as a worklist algorithm tracking lattice states for each value: UNDEF for no information, REDUCED(M) if all observed uses are also reduced modulo M and only involved in instructions that preserve the reduction property, and NOTREDUCED otherwise. The REDUCED(M) state is generated for operands of urem instructions by constant moduli, and propagated by instructions that preserve modular arithmetic.

Incomplete Reduction

Incomplete reduction utilizes the prior techniques to eliminate unnecessary urem instructions. Reduction analysis results allow us to determine when urem instructions are candidates for elimination, and instead Crandall reduction techniques can be used to leave it in an incompletely reduced form instead.

The CrandallReductionPass was extended to implement the incomplete reduction optimization, and now operates in three phases:

Plan Reductions: Determine how to handle each modular reduction in the function: what bound to reduce it to, which Crandall parameters to apply, and whether the reduction should be left incomplete.
Plan Ranges: Given a set of reductions which will be left incomplete, determine the possible range of values that every value in the transformed expression graph will now have. These ranges are used to determine new types.
Rewriting: Given reductions to be rewritten, and new types for the expression graph, actually replace the expression graph with a parallel rewritten version. Delete the old expression graph and fix up φ nodes.

Incomplete reduction gives an additional 2.1x performance boost when applied to a basic block, and the more advanced version handling φ nodes provided 3.3x.

Implementation	Time (ns)	Iterations	Cycles
Crandall Single Step	195.29	215216	662345.84
Incomplete Reduction	93.11	451088	315782.34
Incomplete Reduction over φ	58.46	718515	198282.05

Results

The combined results of these optimizations are effective, giving an 80.1x speedup over baseline. The end result is only 1.6x away from hand-tuned.

Implementation	Time (ns)	Iterations	Cycles
Baseline	4682.75	8962	15771906.96
Crandall Single Step	195.29	215216	662345.84
Incomplete Reduction	93.11	451088	315782.34
Incomplete Reduction over φ	58.46	718515	198282.05
`libsodium`	36.80	1141472	124794.13