+7.7 % on SPR AVX-512 SHA3-256 by shortening rounds 1 and 24

Two round-level specialisations on XKCP 8-way AVX-512 Keccak — 181 MH/s to 195 MH/s. 4-backend deploy matrix.

+7.7 % on 16-core SPR AVX-512 SHA3-256 by shortening rounds 1 and 24

Two microarchitectural tricks that layer onto XKCP’s reference KeccakP1600times8_AVX512_PermuteAll_24rounds to squeeze the last few percent of throughput out of a fixed-prefix preimage workload. Four backends (CUDA / HIP / ARM SVE2 / WebAssembly) share one scalar Keccak reference. Open repository on Radicle, live browser demo, CC0 for math, MIT for code.


Overview

  • What: a SHA3-256 preimage-throughput benchmark on AVX-512. Two round-level specialisations (A2 on round 1, A1 on round 24) add up to +7.7 % over an already-well-tuned XKCP 8-way baseline.
  • Hardware under test: Intel Xeon Platinum 8488C (Sapphire Rapids, 16 cores at SMT saturation, sustained 3.32 GHz under AVX-512-times-8 load — verified via turbostat).
  • End state: ~195 MH/s aggregate, 16 threads. The same scalar Keccak header is used by a CUDA backend, an AMD HIP backend, an ARM SVE2 backend, and a WebAssembly-via-emscripten browser demo. Every backend passes bit-for-bit hashlib parity on ≥ 1000 random inputs.
  • Where to find it: rad:z3PfFA3CHj64RkyY8tRkieX7mk94f. Clone via rad clone rad:z3PfFA3CHj64RkyY8tRkieX7mk94f or browse via the iris.radicle.xyz gateway. Radicle is content-addressed P2P git, no centralised forge.
  • Live browser demo: see backends/wasm/LIVE_DEMO.md in the repo for the current Cloudflare quick-tunnel URL (the subdomain rotates on tunnel restart; the LIVE_DEMO.md pointer updates auto- magically on each rotation via a repo-local helper).
  • Licensing: MIT for the benchmark code, CC0 for the scalar Keccak math. No fine print.

TL;DR

I benchmarked SHA3-256 preimage throughput on an Intel Xeon 8488C (Sapphire Rapids, 16 cores at SMT saturation, AVX-512) and found that two round-level specialisations on top of XKCP’s stock 8-way batched Keccak kernel push the bench from ~181 MH/s to ~195 MH/s — +7.7 % end-to-end. The tricks are niche (fixed-prefix workloads + SHA3-256’s specific output-byte selection) but they compose multiplicatively and they’re the only thing that moved the needle after a stretch of otherwise-futile optimisation attempts.

  • Trick 1 — round-1 theta partial precomputation (A2): ~28 ops per iteration saved by recognising that 4 of 5 column parities are constants given the fixed prefix.
  • Trick 2 — round-24 lane-(0,0)-only short-circuit (A1): ~56 ops per iteration saved by recognising that SHA3-256’s first 8 output bytes come from a single state lane and we only need those to decide “is this hash better than my current best?”.

Both are implemented in the reference repo at src/keccak_bench.c, MIT-licensed, AVX-512-only, no external dependencies beyond XKCP’s publicly-released round macros (CC0). See the avx512-keccak-bench repository on Radicle (content-addressable P2P git).

The workload

Repeatedly evaluate SHA3-256(PREFIX || salt || counter) where:

  • PREFIX is a fixed 11-byte string, identical across workers.
  • salt is 5 bytes fixed per worker (e.g. 2-byte worker ID + 3 process-random bytes).
  • counter is an 8-byte little-endian u64 that increments each iteration.

Total message: 24 bytes. This fits inside a single Keccak-f[1600] absorb block (rate = 136 bytes for SHA3-256). Padding: 0x06 at byte 24, 0x80 at byte 135.

This is the canonical shape of a proof-of-work or preimage-golf workload — header + nonce — so the optimisations apply to anything that hashes a short message with a fixed prefix via SHA3-256.

Why “A2”? Why “A1”? (Non-confusing disambiguation.)

Internally I numbered the optimisations A1, A2, … in the order I thought of them. A1 is the round-24 short-circuit, A2 is the round-1 theta precompute. The numbering has no meaning to anyone else; I’m keeping it in this article only because the source code uses the labels in comments and it’s easier for a reader to match docs to commits.

Trick 1: round-1 θ partial precomputation

Keccak’s θ step takes the 25-lane state and computes:

C[i] = A[0][i] XOR A[1][i] XOR A[2][i] XOR A[3][i] XOR A[4][i]    for i in 0..4
D[i] = C[(i-1) mod 5] XOR ROL(C[(i+1) mod 5], 1)                  for i in 0..4
A'[x][y] = A[x][y] XOR D[x]                                        for all (x,y)

In our absorb state, only lane A[2] (the counter) varies across iterations. The other 24 lanes are constant per worker. That means:

C[0] = A[0][0]                              (= PREFIX[0..7] u64 LE)       constant
C[1] = A[0][1]                              (= PREFIX[8..10]||salt u64)   constant
       XOR A[3][1]                          (= 0x8000000000000000,
                                               SHA3 pad bit at byte 135)  constant
C[2] = A[0][2]                              (= counter)                   varies
C[3] = A[0][3]                              (= 0x06, SHA3 pad byte at 24) constant
C[4] = A[0][4]                              (= 0)                         constant

So 4 of 5 column parities are constants per worker. Likewise:

D[0] = ROL(C[1], 1) XOR C[4]   =  ROL(C[1], 1)           constant
D[1] = ROL(C[2], 1) XOR C[0]                             depends on counter
D[2] = ROL(C[3], 1) XOR C[1]   =  0x0C XOR C[1]          constant
D[3] = ROL(C[4], 1) XOR C[2]   =  counter                just the counter
D[4] = ROL(C[0], 1) XOR C[3]   =  ROL(C[0], 1) XOR 0x06  constant

Three of five deltas are constants per worker; the other two depend on the counter but are trivial to compute in the hot loop.

What’s saved

Standard round 1 in the batched 8-way AVX-512 Keccak uses one XOR5(a,b,c,d,e) per column (implemented as two vpternlogq at worst) = 10 vpternlogq for C[0..4]. Plus 5 ROL and 5 vpxorq for D[0..4]. After A2 precomputation, the hot loop runs only:

D1 = XOR(C0_const, ROL(counter, 1))     -- 1 ROL + 1 XOR
D3 = counter                            -- free

Everything else is a broadcast-load of a precomputed constant. That’s ~28 V512 ops saved per iteration (counting the reduction chain inside each XOR5).

Round-level aggregate: ~28 / ~280 total ops per round ≈ 10 % per-round saving on round 1. Amortised over 24 rounds, the end-to-end throughput improvement is smaller — measured +3 % end-to-end vs a baseline that does a normal round 1 but otherwise uses the same embedded XKCP round macros.

Why this isn’t cheating

The round-1 output of the modified kernel is byte-for-byte identical to the standard round-1 output for the same input. We’re not trading accuracy for speed; we’re precomputing values that were otherwise being recomputed every iteration.

Cross-check: a 1000-input hashlib parity test + a 10-million-counter “no missed beats” test (re-deriving the “best” set via Python’s hashlib.sha3_256 and asserting the miner emits exactly that set) both pass with zero mismatches. After ~5 B candidates across v1/v2/v4 binaries in the production setup, still zero mismatches.

Trick 2: round-24 lane-(0,0)-only short-circuit

SHA3-256’s output bytes 0..7 come from state lane (0,0) — that’s the first output lane of the squeeze. If we’re doing preimage golf or PoW, we need to decide “is this candidate better than my current best?” — which is a lexicographic comparison over 32 bytes.

Ties in the first 8 bytes happen once per ~16 B candidates (assuming uniform-ish hashes, which SHA3 gives us). Below that threshold we can use lane (0,0) alone as the comparator. So we can short-circuit the last round to produce only lane (0,0):

  1. Full θ (all 25 columns contribute — skipping any C[i] would corrupt lane (0,0) via its own column parity).
  2. Rho + pi for only the 3 pre-pi lanes that end up at post-pi row-0 columns 0, 1, 2. These are (pre-pi) lanes (0,0), (1,1), (2,2) with ρ offsets 0, 44, 43.
  3. chi00 = r00 XOR ((NOT r11) AND r22) using vpternlogq(r00, r11, r22, 0xD2).
  4. XOR with ι round constant 23.

Skip: 22 of 25 post-θ lanes (rows 1..4 entirely + row-0 cols 3, 4), and all 24 non-(0,0) post-χ lanes. Savings: ~56 V512 ops per iteration. Measured: +4 % by itself.

When a beat happens

On the extremely rare occasion that lane (0,0) is lower than the running best, we re-run the full standard 24-round permute on a fresh state for the same counter to obtain the full 32 bytes. This is a cold path: ~60 ns per call on SPR, firing maybe once per billion candidates, so its cost is noise.

It’s also the path we use for correctness cross-checks: on every beat, we verify that the first 8 bytes of the full digest match what the short-circuit told us. If they don’t, the miner exits with a loud {"kind":"internal_error", ...} line. This has never fired in production across ~5 B candidates.

Why ρ offsets 0, 44, 43?

Keccak’s π permutation maps post-pi[x][y] = pre-pi[(x + 3y) mod 5][x]. For post-pi row 0 columns 0/1/2 (the three lanes that feed χ to produce post-χ lane (0,0)):

  • post-pi[0][0] = pre-pi[(0+0) mod 5][0] = pre-pi[0][0] — ρ offset 0
  • post-pi[1][0] = pre-pi[(1+0) mod 5][1] = pre-pi[1][1] — ρ offset 44
  • post-pi[2][0] = pre-pi[(2+0) mod 5][2] = pre-pi[2][2] — ρ offset 43

These are the main-diagonal lanes (0,0), (1,1), (2,2). We load them, XOR each with its column’s theta delta, rotate by (0, 44, 43), and then combine via χ.

Composing A2 + A1

A2 (round 1) and A1 (round 24) operate at opposite ends of the permute and touch disjoint sets of ops. Throughput gains compose multiplicatively: +3 % × +4 % ≈ +7 % theoretical; +7.7 % measured. Close enough that I didn’t chase the last 0.3 % — it’s inside measurement noise for a 300 s steady-state benchmark.

Final numbers (Xeon 8488C, Sapphire Rapids, all-core AVX-512 at sustained 3.32 GHz, 16 SMT threads):

Binary MH/s Notes
XKCP KeccakP1600times8_AVX512, 8 workers ~146 baseline (no SMT, no prefix-fix)
Direct state init + 16-worker SMT ~181 +24 % by pushing SMT saturation
+A2 round 1 ~187 +3 % round-1 theta precompute
+A2 +A1 (final) ~195 +4 % round-24 lane-0 short-circuit

What DIDN’T work

Most of the project’s optimisation effort went into things that didn’t move the needle. Listing them so the next person doesn’t waste the same time on the same ideas on SPR:

  • cpuminer-opt’s 8-way Keccak — 11-14 % SLOWER than XKCP on SPR. Their clever lane-complement trick for χ (pre-compute NOT a once and reuse it across the 5 chi outputs) is strictly obsolete once you have vpternlogq (1 instruction = χ step with zero NOTs, zero extra moves). I wrote this up as a separate article — the lane-complement trick was an ILP win on pre-Skylake-X hardware but is now a compute regression. See the cpuminer-opt article in this repo.
  • PGO — the inner loop is straight-line AVX-512. No branches for PGO to inform. Measured +0.08 % with gcc 13’s -fprofile-use, within noise.
  • BOLT — no LBR access on our hypervisor (perf record -j any refused). BOLT without LBR flies blind and mis-lays hot paths. Measured -1.9 %. Not a BOLT defect — a prerequisite we couldn’t meet.
  • gcc vs clang 18, full-LTO vs thin-LTO — ±0.5 %, drift rather than signal. At 16-thread SMT saturation the ALU-port throughput ceiling bites before any compiler difference matters.
  • 2-stream ILP — SPR exposes 32 ZMM registers. Pipelining 2 independent Keccak states per thread at 8 SMT threads would fit (2 × 25 = 50 ZMM ideal, overlap to ~35 used), but at 16 SMT threads the register pressure spills into stack. -0.35 % at 8-threads-no-SMT. Clean rule-out.
  • iTLB hugepages — measured iTLB miss rate on the hot loop: 0.00025 %. Hot code fits in L1i + ITLB comfortably. Any page- map change would be theatrics.
  • Round-23 short-circuit — same motivation as A1 but one round earlier. Turned out the round-24 θ globalises dependencies over ALL 25 round-23 χ outputs, so there’s no partial-output subset to exploit. Symbolic analysis: +3 vpternlogq + 6 vpxorq over the standard path = ~0.1-0.2 % regression on SPR where vpternlogq is the port-0/5 bottleneck.

What I’d try next if I had time

  • Hand-scheduled port-balance pass over the Keccak inner loop targeting SPR’s p0/p5 vpternlogq asymmetry. Intel’s optimisation guide calls out that vpternlogq can only issue on p0 or p5, and on SPR those ports are often over-subscribed by χ + θ. A hand- scheduled variant that interleaves p1/p2-bound ops (shifts / broadcasts / loads) between bursts of vpternlogq might recover 2-5 %. Realistic estimate: +2-5 %. Time-box the attempt so you stop if < 3 % after a reasonable exploration.
  • Upstream A2 + A1 as a XKCP variant kernel. These should go in the reference tree as KeccakP1600times8_PrefixFixed_24rounds_* or similar. Would benefit everyone who hashes with a fixed prefix (XMSS, SPHINCS+, Ethereum’s PoW days of yore, anyone benchmarking preimage throughput).
  • Port to Neoverse-N2 / Cobalt / Graviton 4 SVE2. The backends/sve2/ directory in the repo already does this — 2-way batching at 128-bit VL, same A2 + A1 derivations, validated under QEMU. First native-hardware runtime test is pending.
  • Port to CUDA. backends/cuda/ is a scalar GPU kernel with the same A2 + A1 specialisations. Compiles cleanly under nvcc 12; first-GPU runtime test is pending.

Reproducing

# Install radicle (P2P git; no account required).
curl -sSLf https://radicle.xyz/install | sh

# Clone the repo by its content-addressed RID.
rad clone rad:z3PfFA3CHj64RkyY8tRkieX7mk94f
cd avx512-keccak-bench

make              # builds src/keccak_bench + keccak_bench_lto
make test         # 3-part hashlib-parity test suite (~45 s)
make bench        # 60 s steady-state throughput

Running on a non-AVX-512 machine? Try a backend:

# CUDA
cd backends/cuda && make test                 # CPU reference 3/3 hashlib parity
# AMD HIP
cd backends/hip && make test                  # CPU reference 3/3 hashlib parity
# ARM SVE2 (cross-compile + QEMU)
cd backends/sve2 && make test                 # QEMU parity at VL=2/4/8
# WebAssembly (Node CLI + browser)
cd backends/wasm && source /path/to/emsdk_env.sh && make test

The repo is MIT-licensed. XKCP round macros + constants are CC0 (public domain) per XKCP’s release terms. Reproducer datasets + all parity vectors are in the repo under test/.

Is this a PoW chain? (no)

No. There is no currency, no mining reward, no token, no distributed consensus. The benchmark is a pure engineering measurement: given a fixed 11-byte string prefix bench_sha3:, how many SHA3-256 preimage evaluations per second can a modern Xeon do, and can the cross-arch deploy matrix match? The salt + counter layout is a standard proof-of-work-or-proof-of-useful-work input shape, but nothing here pays out.

Credits

Keccak is the work of Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. The AVX-512 times-8 batched kernel is Ronny Van Keer’s. All I did was notice two cheap consequences of a prefix-fixed input layout plus SHA3-256’s top-byte-dominant output selector. The giants did the real work.

Discussion + tips

  • Discussion: comments / corrections / counter-benches welcome. Reply as a Nostr note quoting this article, or open an issue via rad issue open in a clone of the repo.
  • Counter-example standing invitation: if you have a faster AVX-512 Keccak kernel on SPR (faster than ~195 MH/s, 16 threads, same problem shape), I’d love to see it. Same invitation for ARM SVE2 on Neoverse-V2 / Cobalt 100.
  • Optional zaps: entirely opt-in. The project is CC0 for the math and MIT for the code and will stay that way regardless of whether anyone zaps. If you do want to throw a few sats at it, the BOLT12 offer below is the receive endpoint.

BOLT12 offer (receive-only; Phoenixd self-custodial)

lno1zrxq8pjw7qjlm68mtp7e3yvxee4y5xrgjhhyf2fxhlphpckrvevh50u0qwkq3t7m97k9xxd8qn5as88au9cjckzrxt9k85n6ee328pz5k5mzsqszqwhaugcu5aqshwxhv0cwcdtjytfcrvx63ua7q0as2tn5f6r7ex0sqvew2t4tuncgwqmc89m3vu4389r4mhha3krvyjv668ywjq4emjkjmpn42j5dq9nhdvj94wc6sxxz7w3rngmjqvzszv0y8rh75p6unxmavq7zaeqvfylahd27l6asffxsn268mkxmgqqskjqx0g43yva08cu8xgtykpz7qc

This is a reusable BOLT12 offer backed by a Phoenixd daemon; all payments settle directly to my node without intermediate custody. First-inbound-payment will auto-open a Lightning channel with the ACINQ liquidity-provisioner — the channel-open fee comes out of the first zap, not from my pocket. Smaller payments accumulate as fee credit toward the channel opening.

Once the channel opens, I’ll also publish a LUD-16 lightning address in a subsequent kind:0 profile update for wallets that don’t yet speak BOLT12. Until then, the BOLT12 offer is the only receive endpoint.

npub

npub1ptlnxv26w6f73myptaujgxyz8aq8dut63dmsuf4d5t2ycwnek2ss7k3hmc

No comments yet.