Improvements to the Linear Operations of LowMC: A Faster Picnic

Daniel Kales, Léo Perrin, Angela Promitzer, Sebastian Ramacher, and Christian Rechberger

2017 · Full Version · eprint 2017/1148

Disclaimer

This content was automatically converted from the original PDF and may have undergone post-processing. None of these steps have been reviewed or approved by the authors. Errors in formulas, definitions, proofs, or text may have been introduced during conversion. The authoritative version is the original paper on ePrint. Always cite and verify against the original publication.

Converted with: marker · 2026-02-14

Abstract

Picnic is a practical approach to digital signatures where the security is primarily based on the existence of a one-way function, and the signature size strongly depends on the number of multiplications in the circuit describing that one-way function. The highly parameterizable block cipher family LowMC has the most competitive properties with respect to this metric and is hence a standard choice. In this paper, we study various options for efficient implementations of LowMC in-depth. First, we investigate optimizations of the round key computation of LowMC independently of any implementation optimizations. By decomposing the round key computations based on the keys’ effect on the S-box layer and general optimizations, we reduce runtime costs by up to a factor of 2 and furthermore reduce the size of the LowMC matrices by around 45 % compared to the original Picnic implementation (CCS’17). Second, we propose two modifications to the remaining matrix multiplication in LowMC’s linear layer. The first modification decomposes the multiplication into parts depending on their effect on the S-box layer. While this requires the linear layer matrices to have an invertible submatrix, it reduces the runtime and memory costs significantly, both by up to a factor of 4 for instances used by Picnic and up to a factor of 25 for LowMC instances with only one S-box. The second modification proposes a Feistel structure using smaller matrices completely replacing the remaining large matrix multiplication in LowMC’s linear layer. With this approach, we achieve an operation count logarithmic in the block size but more importantly, improve over Picnic’s matrix multiplication by 60 % while retaining a constant-time algorithm. Furthermore, this technique also enables us to reduce the memory requirements for storing LowMC matrices by 60 %.

Keywords: LowMC · lightweight block cipher · Picnic · post-quantum digital signatures · efficient implementation

1. Introduction

Lightweight cryptographic primitives that only require a low number of multiplications have many applications ranging from reducing costs for countermeasures against side-channel attacks [DPVR00, GLSV14], over improving homomorphic encryption schemes [ARS+15, MJSC16, CCF+16, DSES14, NLV11] and multiparty computation [GRR+16, RSS17], to SNARKs [AGR+16]. Additionally, they also turned out to be useful to efficiently implement and reduce signature sizes of post-quantum signature schemes based on \Sigma-protocols [CDG+17a] without requiring any structured hardness assumptions. The latter in particular builds upon LowMC [ARS+15, ARS+16], a highly parameterizable block cipher with a low number of multiplications.

We focus on the use of LowMC in the post-quantum digital signature scheme Picnic [CDG+17a, CDG+17b] which is based on zero-knowledge proofs of knowledge of pre-images of one-way functions. There, the one-way functions are instantiated using LowMC. Picnic relies on a proof system called ZKB++, which is based on the “MPC-in-the-head” [IKOS07] paradigm. To compute proofs in ZKB++, the circuit of the one-way function is decomposed into three branches where XOR gates and AND gates involving constants can be computed locally, but AND gates require communication between the branches. Thus the signature size depends on the total number of AND gates required to describe the one-way function as a circuit.

The remaining introductory paragraphs discuss the constraints emerging from LowMC’s use in Picnic: the number of AND gates relates to signature size, while fewer S-boxes lead to more rounds and more linear layer operations (XORs). Any improvement reducing the number of XOR gates in LowMC’s linear layer allows targeting smaller signature sizes without sacrificing performance. The authors note this follows a long line of work on finding efficient alternative descriptions of ciphers, similar to the AES competition era.

1.1 Contribution

The contributions of this work can be summarized as follows:

An alternative description of LowMC with a new structure to compute round keys and apply round constants. The idea is to split the computation into linear and non-linear parts. This change allows replacing all round key computations only affecting the linear part by exactly one matrix multiplication. In the signature scheme use-case, this leads to performance improvements ranging from a factor of 1.5 for smaller block sizes to a factor of 2 for larger block sizes.
An optimization to the matrix multiplication in the linear layer of LowMC, employing similar techniques to split the computation into linear and non-linear parts. This reduces the number of operations for the linear layer computations as well as the memory requirements from r \cdot n^2 to r \cdot (n^2 - (n - 3 \cdot m)^2) + n^2, for instances of LowMC with r rounds, blocksize n and m S-boxes. This reduces runtime to a quarter of the original.
Fibonacci Feistel Networks (FFNs), a variant of Generalized Feistel Networks providing very fast diffusion. Instantiating the network with regular matrices yields a compact representation of a larger matrix multiplication with logarithmic complexity. This technique reduces the size of the LowMC matrices by up to 60 %.

2. Preliminaries

2.1 LowMC

LowMC [ARS+15, ARS+16] is a very parameterizable symmetric encryption scheme design enabling instantiation with low AND depth and low multiplicative complexity. Let n be the block size, m be the number of S-boxes, k the key size, and r the number of rounds. Round constants C_i \stackrel{R}{\leftarrow} \mathbb{F}_2^n for i \in [1, r], full rank matrices K_i \stackrel{R}{\leftarrow} \mathbb{F}_2^{n \times k} and regular matrices L_i \stackrel{R}{\leftarrow} \mathbb{F}_2^{n \times n} are chosen independently during instance generation and kept fixed. Keys are generated by sampling from \mathbb{F}_2^k uniformly at random. LowMC consists of key whitening in the beginning and multiple rounds composed of an S-box layer, a linear layer, addition with constants and addition of the round key.

Algorithm 1 (LowMC encryption): Given key matrices K_i \in \mathbb{F}_2^{n \times k} for i \in [0, r], linear layer matrices L_i \in \mathbb{F}_2^{n \times n} and round constants C_i \in \mathbb{F}_2^n for i \in [1, r], on input plaintext p \in \mathbb{F}_2^n and key y \in \mathbb{F}_2^k:

s \leftarrow K_0 \cdot y + p

For i \in [1, r]:

s \leftarrow \operatorname{SBOX}(s), \quad s \leftarrow L_i \cdot s, \quad s \leftarrow s + C_i + K_i \cdot y

Return s.

To reduce the multiplicative complexity, the number of S-boxes applied in parallel can be reduced, leaving part of the substitution layer as the identity mapping.

2.2 (2,3)-Decomposition of Circuits in Picnic

Circuit decomposition is a protocol for jointly computing a circuit, similar to an MPC protocol but with higher efficiency. In a (2,3)-decomposition there are three players and the protocol has 2-privacy, i.e., it remains secure even if two of the three players are corrupted.

Definition 1 ((2,3)-decomposition)

Let f be a function computed by an n-gate circuit \phi such that f(x) = \phi(x) = y. Let k_1, k_2, k_3 be tapes of length \kappa chosen uniformly at random from \{0,1\}^\kappa corresponding to players P_1, P_2, P_3. The tuple of algorithms (Share, Update, Output, Reconstruct) is defined as follows:

Share(x, k_1, k_2, k_3): On input of the secret value x, outputs the initial views for each player containing the secret share x_i of x.

Update: On input of the views and random tapes k_i, k_{i+1}, compute wire values for the next gate and return the updated view.

Output: On input of the final view, returns the output share y_i.

Reconstruct(y_1, y_2, y_3): On input of output shares, reconstructs and returns y.

Correctness requires that reconstructing a (2,3)-decomposed evaluation of a circuit \phi yields the same value as directly evaluating \phi on the input value. The 2-privacy property requires that revealing the values from two shares reveals nothing about the input value.

The \Sigma-protocol ZKB++ constructs the (2,3)-decomposition as follows: Let R be an arbitrary finite ring and \phi a function such that \phi: R^m \to R^\ell can be expressed by an n-gate arithmetic circuit over the ring using addition/multiplication by constants and binary addition/multiplication gates. The decomposition is then given by:

Addition by constant (w_b = w_a + c): w_b^{(i)} = w_a^{(i)} + c if i = 1 and w_b^{(i)} = w_a^{(i)} otherwise.
Multiplication by constant (w_b = c \cdot w_a): w_b^{(i)} = c \cdot w_a^{(i)}.
Binary addition (w_c = w_a + w_b): w_c^{(i)} = w_a^{(i)} + w_b^{(i)}.
Binary multiplication (w_c = w_a \cdot w_b): w_c^{(i)} = w_a^{(i)} \cdot w_b^{(i)} + w_a^{(i+1)} \cdot w_b^{(i)} + w_a^{(i)} \cdot w_b^{(i+1)} + R_i(c) - R_{i+1}(c) where R_i(c) is the c-th output of a pseudorandom generator seeded with k_i.

Note that P_i can compute all gate types locally except binary multiplication gates, as the latter requires inputs from P_{i+1}. Only outputs of binary multiplication gates need to be serialized, and thus the view size and signature size of Picnic depend on the size of the ring R and the number of multiplication gates.

3. Optimizing Linear Operations

From the use of LowMC in Picnic, we obtain some constraints on the optimizations we are allowed to perform. The S-box serves as a synchronization point on the first 3 \cdot m bits, i.e., the bits that are actually touched by the S-box. On the other n - 3 \cdot m bits the S-box is simply the identity map, and their actual values do not matter for S-box evaluations. Thus we have to ensure that the evaluation of all AND gates stays invariant under all our optimizations. Secondly, we have to assume that the secret key – or more precisely the shares representing it – changes on every encryption.

3.1 Splitting the Round Key Computation and Round Constant Addition

We start with the round key computation. Since the secret key is freshly shared for each LowMC evaluation in Picnic, the round keys cannot be pre-computed once during initialization. However, we can observe that due to the structure of the S-box layer, for n - 3 \cdot m bits of the round key (which coincide with the part of the state where the S-box acts as identity map), it does not matter whether those bits are added to the state before or after the application of the S-box.

Due to the linear nature of all operations involved in the computation after the S-box, we can simply change the order of adding the round key and multiplying the state with L_i. We modify each round as follows:

Modify s \leftarrow L_i \cdot s + K_i \cdot y + C_i to s \leftarrow L_i \cdot (L_i^{-1} \cdot K_i \cdot y + s) + C_i. Now split L_i^{-1} \cdot K_i \cdot y into the lower 3 \cdot m bits (the “non-linear part”) and the upper n - 3 \cdot m bits (the “linear part”) and move the addition of the upper n - 3 \cdot m bits before the S-box layer.

We denote by \rho_i^j: \mathbb{F}_2^n \to \mathbb{F}_2^{j - i + 1} the map sending n-dimensional vectors to a vector consisting of the i-th to j-th coordinates. In particular, we use \rho_L = \rho_{3 \cdot m + 1}^n to identify the linear part and \rho_N = \rho_1^{3 \cdot m} for the non-linear part.

After iterating this procedure, all linear parts of the round key are moved before the first round, leaving a reduced round key of 3 \cdot m bits per round. The linear part can be computed by calculating the matrix P_L defined as:

P_L = \overline{L_1^{-1}} \cdot K_1 + \sum_{j=2}^{r} \left( \prod_{k=1}^{j} \overline{L_k^{-1}} \right) \cdot K_j

This matrix P_L is precomputed from the LowMC matrices before any encryption. Only the multiplication with the secret key y and addition to the initial state s_0 is required at the beginning. For the non-linear part, we define:

P_{N,i} = \overline{L_i^{-1}} \cdot K_i + \sum_{j=i+1}^{r} \left( \prod_{k=i}^{j} \overline{L_k^{-1}} \right) \cdot K_j

The first 3 \cdot m rows of P_{N,i} from all rounds are combined into one matrix P_N of dimension (3 \cdot m \cdot r \times k). This matrix is multiplied with the secret key y before the first round, producing a 3 \cdot m \cdot r dimensional vector v. The appropriate 3 \cdot m bits of v are then added to the non-linear part of the state in each round.

Algorithm 2 (LowMC with split round key): Given key matrix K_0 \in \mathbb{F}_2^{n \times k}, linear layer matrices L_i \in \mathbb{F}_2^{n \times n}, and precomputed matrices P_L, P_N and vectors C_L, C_N, on input plaintext p \in \mathbb{F}_2^n and key y \in \mathbb{F}_2^k:

v \leftarrow P_N \cdot y + C_N

s \leftarrow (K_0 + P_L) \cdot y + p + C_L

For i \in [1, r]:

s \leftarrow \operatorname{SBOX}(s), \quad s \leftarrow \rho_{1+3 \cdot m \cdot (i-1)}^{3 \cdot m \cdot i}(v) + s, \quad s \leftarrow L_i \cdot s

Return s.

3.2 Reducing Linear Layer Computation

A similar modification can be applied to the linear layer matrices, where a substantial part of the linear layer computation can be moved to the first round. We take the linear layer matrix L_i and split it into 4 submatrices depending on the number of S-boxes:

L_i = \begin{bmatrix} \mathcal{N}_i & \mathcal{A}_i \\ \mathcal{B}_i & \mathcal{L}_i \end{bmatrix}

where \mathcal{N}_i is 3 \cdot m \times 3 \cdot m, \mathcal{A}_i is 3 \cdot m \times (n - 3 \cdot m), \mathcal{B}_i is (n - 3 \cdot m) \times 3 \cdot m, and \mathcal{L}_i is (n - 3 \cdot m) \times (n - 3 \cdot m). We also split the state into the non-linear part s_N (3 \cdot m bits) and the linear part s_L (n - 3 \cdot m bits). The linear layer multiplication is rewritten as:

\begin{bmatrix} s_N^{i+1} \\ s_L^{i+1} \end{bmatrix} = \begin{bmatrix} \mathcal{N}_i & \mathcal{A}_i \\ \mathcal{B}_i & \mathcal{L}_i \end{bmatrix} \cdot \begin{bmatrix} s_N^i \\ s_L^i \end{bmatrix} = \begin{bmatrix} \mathcal{N}_i \cdot s_N^i + \mathcal{A}_i \cdot s_L^i \\ \mathcal{B}_i \cdot s_N^i + \mathcal{L}_i \cdot s_L^i \end{bmatrix}

With the precondition that the submatrix \mathcal{L}_i is invertible, we can move the multiplication by \mathcal{L}_i before the other multiplications:

\begin{bmatrix} s_N^{i+1} \\ s_L^{i+1} \end{bmatrix} = \begin{bmatrix} \mathcal{N}_i & \mathcal{A}_i \cdot \mathcal{L}_i^{-1} \\ \mathcal{B}_i & I \end{bmatrix} \cdot \begin{bmatrix} s_N^i \\ \mathcal{L}_i \cdot s_L^i \end{bmatrix}

By repeating this process for multiple rounds, all multiplications by \mathcal{L}_i are grouped into a single matrix Z_0, precomputed and multiplied with the linear part of the state before the first round. The modified linear-layer matrices Z_i are:

Z_i = \begin{bmatrix} \mathcal{N}_i & \mathcal{A}_i \cdot \prod_{k=i}^{r} \mathcal{L}_k^{-1} \\ \left(\prod_{k=0}^{r-(i+1)} \mathcal{L}_{r-k}\right) \cdot \mathcal{B}_i & I \end{bmatrix}

and the combined matrix:

Z_0 = \begin{bmatrix} I & 0 \\ 0 & \prod_{k=0}^{r-1} \mathcal{L}_{r-k} \end{bmatrix}

These reduced matrices decrease the multiplications from an n \times n matrix-vector multiplication to an n \times 3 \cdot m and a 3 \cdot m \times (n - 3 \cdot m) matrix-vector multiplication. The improvements are especially noticeable when m is very small.

Algorithm 3 (LowMC with reduced linear layer): Given key matrix K_0 \in \mathbb{F}_2^{n \times k}, reduced linear layer matrices Z_i \in \mathbb{F}_2^{n \times n}, precomputed matrices P_L, Z_0, P_N, and vectors C_L, C_N, on input plaintext p and key y:

v \leftarrow P_N \cdot y + C_N

s \leftarrow (K_0 + P_L) \cdot y + p + C_L

s \leftarrow Z_0 \cdot s

For i \in [1, r]:

s \leftarrow \operatorname{SBOX}(s), \quad s \leftarrow \rho_{1+3 \cdot m \cdot (i-1)}^{3 \cdot m \cdot i}(v) + s, \quad s \leftarrow Z_i \cdot s

Return s.

One downside is that this approach is not compatible with standard LowMC, as the generation of random matrices does not ensure that \mathcal{L}_i is invertible. The authors therefore propose LowMC-I, where an additional rejection condition ensures invertibility of \mathcal{L}_i during matrix generation.

3.3 Fibonacci Feistel Network

Given that the cost of storing a binary matrix operating on n bits is proportional to n^2, it is possible to decrease the cost of an n \times n matrix when it can be implemented using several m \times m matrices with m < n. The authors propose a new variant called the Fibonacci Feistel Network (FFN), which provides very fast diffusion. This structure uses the Fibonacci sequence \{\phi_i\}_{i > 0} defined by:

\phi_0 = 0, \; \phi_1 = 1, \; \phi_{i+1} = \phi_i + \phi_{i-1}

The smallest integer i such that \phi_i > b is denoted i = \Lambda_\phi(b).

The Fibonacci-Feistel Structure. A FFN operates on 2 \times b \times w bits, where w \geq 4 and b \geq 2, using R rounds. Each round is a classical 2-branched Feistel round where the Feistel function maps b \times w bits to b \times w bits. The Feistel function used in round i, denoted F_i, works as follows: (1) the state is divided into b branches of w bits; (2) each branch goes through a w-bit L-Box; (3) the branches are rotated by \phi_i. The full round function is:

X_{j+b}^{i+1} = X_j^i, \quad X_j^{i+1} = X_{j+b}^i \oplus L_{j - \phi_i}^i\left(X_{j - \phi_i}^i\right)

where j - \phi_i is taken modulo b and L_j^i is a w-bit L-Box.

Lemma 1

Consider a FFN with i rounds, where \phi_{i+1} \leq b. Then the word X_0^0 influences all words X_j^i with indices 0 \leq j < \phi_{i+1} or b \leq j < b + \phi_i.

Proof. By induction. Before round 0, only X_0^0 depends on X_0^0. After round 0 (rotation by \phi_0 = 0), only X_0^1 and X_b^1 depend on X_0^0. Suppose X_j^i depends on X_0^0 iff 0 \leq j < \phi_i or b \leq j < b + \phi_{i-1}. The round function maps X^i to X^{i+1} such that:

\begin{cases} X_j^{i+1} = X_{j+b}^i \oplus L_{j - \phi_i}^i\left(X_{j - \phi_i}^i\right) & \text{if } j < b \\ X_j^{i+1} = X_{j-b}^i & \text{if } j \geq b \end{cases}

For j \geq b: X_j^{i+1} = X_{j-b}^i depends on X_0^0 iff 0 \leq j - b < \phi_i, i.e., b \leq j < b + \phi_i. For j < b: X_j^{i+1} depends on both X_{b+j}^i and X_{j - \phi_i}^i. The first depends on X_0^0 iff 0 \leq j < \phi_i. The second iff \phi_i \leq j < \phi_{i+2}. Thus for any j with 0 \leq j < \phi_{i+2}, X_j^{i+1} depends on X_0^0. \square

Lemma 2

Let R_k be a permutation of \{0, \ldots, 2b-1\} such that

R_k(j) = \begin{cases} (j + k) \bmod b & \text{if } j < b \\ b + ((j - b + k) \bmod b) & \text{if } j \geq b \end{cases}

meaning that applying R_k to the indices of the branches rotates separately the left and right branches by k. Let P be one round of FFN where all L-Boxes are the same and let (y_0, \ldots, y_{2b-1}) = P(x_0, \ldots, x_{2b-1}). Then the following always holds:

P(x_{R_k(0)}, \ldots, x_{R_k(2b-1)}) = (y_{R_k(0)}, \ldots, y_{R_k(2b-1)})

Proof. If the L-boxes are identical, then all operations in a FFN are invariant under word rotations applied to both halves of the internal state. \square

Corollary 1

Consider a FFN with i rounds, where \phi_i \geq b. Then each left input word X_j^0, where 0 \leq j < b, influences all output words X_j^i, where 0 \leq j < 2b.

Proof. Let k \leq i + 1 be such that \phi_{k-1} \leq b < \phi_k. By Lemma 1, after k - 2 rounds, X_0^0 influences all words with indices j if 0 \leq j < \phi_{k-1} or b \leq j < b + \phi_{k-2}. After round k - 1, the words on the left all depend on X_0^0, and after round k \leq i + 1, all output words depend on X_0^0. By Lemma 2, we generalize to all input words from the left side. \square

Lemma 3

Consider a FFN with \Lambda_\phi(b) + 1 rounds labelled -1, 0, 1, \ldots, \Lambda_\phi(b) - 1 and where we set \phi_{-1} = 0. Then each input word X_k^{-1}, where 0 \leq k < 2b, influences all output words X_j^{\Lambda_\phi(b)}, where 0 \leq j < 2b.

Proof. After the initial round, each input word X_k^{-1} (0 \leq k < 2b) influences one left word X_k^0 \pmod{b} which in turn (see Corollary 1) influences all output words. \square

Efficient Implementation. The linear layer can be evaluated using \mathcal{O}(\Lambda_\phi(b) \cdot w) operations on a modern processor, where 2 \times b \times w = n. The core trick is in the definition and implementation of the L-Box layer composed with the rotation by \phi_i. Using a bit-sliced strategy, the Feistel function can be decomposed as:

F_i = M^{-1} \circ \mathrm{ROT}_{\phi_i} \circ \mathcal{L}_i \circ M

where M is a bit permutation mapping the w-bit word with index j to bits with indices congruent to j modulo b, and \mathcal{L}_i is parameterized by w words of size bw computing:

\mathcal{L}_i(y) = \bigoplus_{j=0}^{w-1} \ell_j^i \wedge (y \lll (j \cdot b))

For a 256-bit linear layer (n = 256 = 2 \times 32 \times 4), a bijective linear layer with full diffusion uses 1 + \Lambda_\phi(32) = 10 rounds. In total, this requires 40 copies, 40 rotations, 40 XORs, 40 ANDs, and 10 swaps, all on 128-bit words.

4. Performance and Memory Evaluation

Both suggested optimizations were implemented and evaluated against the Picnic implementation available on GitHub. All measurements used the LowMC parameters from Picnic with blocksizes 128, 192, and 256, each with 10 S-boxes.

4.1 Reduced Round Key Computation and Constant Addition

Benchmarks on an Intel Core i7-4790 show improvements of 1.1x for the 128-bit case and up to 1.8x for larger instances. Results on Raspberry Pi 3 show even larger improvements of factors 1.6 to 2 for signing and verifying across all instances.

For the PICNIC-256 parameter set, the round key calculations shrink from 39 matrices of dimension 256 \times 256 (312 KB) to one 256 \times 256 and one 1140 \times 256 matrix. The 38 round constants (1.2 KB) are replaced with one 256-bit vector and one 1140-bit vector (0.17 KB). RRKC reduces storage for round key matrices and constants by more than 85 %. Including the linear layer, savings are still 43 %.

4.2 Reduced Linear Layer

When comparing an implementation without optimizations to one with RLL, performance improves by up to a factor of 4 with the instances used in Picnic. Compared to RRKC alone, RLL improves runtime by up to a factor of 3.

For LowMC instances with m = 1, RLL further reduces runtime to 1/12 and 1/25 of the original, respectively. Memory-wise, for Picnic-256 the linear layer shrinks from 304 KB to 37.69 KB (87 % reduction). For the 256-bit instance with m = 1, the linear layer is reduced from 2904 KB to 41.45 KB, representing only 2 % of original size.

4.3 Fibonacci Feistel Network

The FFN was benchmarked against a constant-time matrix multiplication. The constant-time implementation requires 2302 cycles for a 256 \times 256 matrix-vector multiplication. The FFN with 64-bit branches performs 10 % better while using 37 % less memory (5120 bytes vs 8192 bytes for the full matrix).

Various FFN instances for n = 256 bits were evaluated: with 2 \times 32 \times 4-bit configuration (10 rounds, 4463 cycles, 640 bytes), up to 2 \times 2 \times 64-bit configuration (5 rounds, 2047 cycles, 5120 bytes).

5. Discussion

The results suggest that all three optimizations yield even better results for larger block sizes, e.g., those used by LowMCHash-256 [AGR+16] in post-quantum ring signature schemes [DRS18]. For RRKC and RLL, the number of S-boxes is essential for the performance gain. If the number of S-boxes remains constant while block size increases, the impact is higher. If 3 \cdot m is almost as large as n, then no performance gain is expected.

The proposed FFN will also yield higher gains for larger block sizes. While the cost for constant-time multiplication quadruples when the block size doubles, the FFN less than triples the cost. For a 4096 = 2 \times 32 \times 64-bit instance, the FFN requires only 20,480 XOR operations compared to 262,144 for constant-time multiplication.

Both main contributions will likely also be useful for other cipher designs relying heavily on large linear layers, such as Rasta [DEG+18].

Acknowledgements

We thank Tyge Tiessen for interesting ideas and discussions on optimizing LowMC’s round key computation. We also thank Christoph Dobraunig, Maria Eichlseder and Eik List for comments on earlier versions. S. Ramacher and C. Rechberger have been supported by H2020 project Prismacloud, grant agreement n°644962. C. Rechberger has additionally been supported by EU H2020 project PQCRYPTO, grant agreement n°645622.

References

[ADKF70] V. Arlazarov, E. Dinic, M. Kronrod, and I. Faradzev. “On economical construction of the transitive closure of a directed graph.” In: Soviet Math Dokl., 1970.
[AGR+16] Martin R. Albrecht, Lorenzo Grassi, Christian Rechberger, Arnab Roy, and Tyge Tiessen. “MiMC: Efficient Encryption and Cryptographic Hashing with Minimal Multiplicative Complexity.” In: ASIACRYPT (1), Vol. 10031. LNCS, pp. 191–219, 2016.
[AL00] Kazumaro Aoki and Helger Lipmaa. “Fast implementations of AES candidates.” In: AES Candidate Conference, pp. 106–120, 2000.
[ARS+15] Martin R. Albrecht, Christian Rechberger, Thomas Schneider, Tyge Tiessen, and Michael Zohner. “Ciphers for MPC and FHE.” In: EUROCRYPT (1), Vol. 9056. LNCS. Springer, pp. 430–454, 2015.
[ARS+16] Martin R. Albrecht, Christian Rechberger, Thomas Schneider, Tyge Tiessen, and Michael Zohner. “Ciphers for MPC and FHE.” IACR Cryptology ePrint Archive, 2016:687, 2016.
[Bar06] Gregory V. Bard. “Accelerating cryptanalysis with the method of four russians.” IACR Cryptology ePrint Archive, 2006:251, 2006.
[BB02] Elad Barkan and Eli Biham. “In how many ways can you write rijndael?” In: ASIACRYPT, Vol. 2501. LNCS. Springer, pp. 160–175, 2002.
[BBF+02] Guido Bertoni, Luca Breveglieri, Pasqualina Fragneto, Marco Macchetti, and Stefano Marchesin. “Efficient software implementation of AES on 32-bit platforms.” In: CHES, Vol. 2523. LNCS. Springer, pp. 159–171, 2002.
[BEF18] Dan Boneh, Saba Eskandarian, and Ben Fisch. “Post-quantum group signatures from symmetric primitives.” IACR Cryptology ePrint Archive, 2018:261, 2018.
[Ber09] Daniel J. Bernstein. “Optimizing linear maps modulo 2.” 2009. url: binary.cr.yp.to/linearmod2.html.
[BS08] Daniel J. Bernstein and Peter Schwabe. “New AES software speed records.” In: INDOCRYPT, Vol. 5365. LNCS. Springer, pp. 322–336, 2008.
[CCF+16] Anne Canteaut, Sergiu Carpov, Caroline Fontaine, Tancrède Lepoint, María Naya-Plasencia, Pascal Paillier, and Renaud Sirdey. “Stream ciphers: A practical solution for efficient homomorphic-ciphertext compression.” In: FSE, Vol. 9783. LNCS. Springer, pp. 313–333, 2016.
[CDG+17a] Melissa Chase, David Derler, Steven Goldfeder, Claudio Orlandi, Sebastian Ramacher, Christian Rechberger, Daniel Slamanig, and Greg Zaverucha. “Post-quantum zero-knowledge and signatures from symmetric-key primitives.” In: CCS. ACM, pp. 1825–1842, 2017.
[CDG+17b] Melissa Chase, David Derler, Steven Goldfeder, Claudio Orlandi, Sebastian Ramacher, Christian Rechberger, Daniel Slamanig, and Greg Zaverucha. The Picnic Signature Algorithm Specification, 2017. url: github.com/Microsoft/Picnic.
[DEG+18] Christoph Dobraunig, Maria Eichlseder, Lorenzo Grassi, Virginie Lallemand, Gregor Leander, Eik List, Florian Mendel, and Christian Rechberger. “Rasta: A cipher with low AND depth and few ANDs per bit.” In: CRYPTO (1), Vol. 10991. LNCS. Springer, pp. 662–692, 2018.
[Din18] Itai Dinur. “Linear equivalence of block ciphers with partial non-linear layers: Application to LowMC.” Cryptology ePrint Archive, Report 2018/772, 2018. url: eprint.iacr.org/2018/772.
[DPVR00] Joan Daemen, Michaël Peeters, Gilles Van Assche, and Vincent Rijmen. Nessie Proposal: NOEKEON, 2000. url: gro.noekeon.org/Noekeon-spec.pdf.
[DRS18] David Derler, Sebastian Ramacher, and Daniel Slamanig. “Post-quantum zero-knowledge proofs for accumulators with applications to ring signatures from symmetric-key primitives.” In: PQCrypto, Vol. 10786. LNCS. Springer, pp. 419–440, 2018.
[DSES14] Yarkin Doröz, Aria Shahverdi, Thomas Eisenbarth, and Berk Sunar. “Toward practical homomorphic evaluation of block ciphers using prince.” In: Financial Cryptography Workshops, Vol. 8438. LNCS. Springer, pp. 208–220, 2014.
[GLSV14] Vincent Grosso, Gaëtan Leurent, François-Xavier Standaert, and Kerem Varici. “LS-designs: Bitslice encryption for efficient masked software implementations.” In: FSE, Vol. 8540. LNCS. Springer, pp. 18–37, 2014.
[GMO16] Irene Giacomelli, Jesper Madsen, and Claudio Orlandi. “ZKBoo: Faster zero-knowledge for boolean circuits.” In: USENIX Security Symposium. USENIX Association, pp. 1069–1083, 2016.
[GRR+16] Lorenzo Grassi, Christian Rechberger, Dragos Rotaru, Peter Scholl, and Nigel P. Smart. “MPC-friendly symmetric key primitives.” In: ACM CCS. ACM, pp. 430–443, 2016.
[IKOS07] Yuval Ishai, Eyal Kushilevitz, Rafail Ostrovsky, and Amit Sahai. “Zero-knowledge from secure multiparty computation.” In: STOC. ACM, pp. 21–30, 2007.
[MJSC16] Pierrick Méaux, Anthony Journault, François-Xavier Standaert, and Claude Carlet. “Towards stream ciphers for efficient FHE with low-noise ciphertexts.” In: EUROCRYPT (1), Vol. 9665. LNCS. Springer, pp. 311–343, 2016.
[NLV11] Michael Naehrig, Kristin E. Lauter, and Vinod Vaikuntanathan. “Can homomorphic encryption be practical?” In: CCSW. ACM, pp. 113–124, 2011.
[NPV17] Valérie Nachef, Jacques Patarin, and Emmanuel Volte. Feistel Ciphers – Security Proofs and Cryptanalysis. Springer, 2017.
[RSS17] Dragos Rotaru, Nigel P. Smart, and Martijn Stam. “Modes of operation suitable for computing on encrypted data.” In: IACR Trans. Symmetric Cryptol., 2017(3):294–324, 2017.
[SM10] Tomoyasu Suzaki and Kazuhiko Minematsu. “Improving the generalized Feistel.” In: FSE, Vol. 6147. LNCS. Springer, pp. 19–39, 2010.
[SME16] Shady Mohamed Soliman, Baher Magdy, and Mohamed A. Abd El-Ghany. “Efficient implementation of the AES algorithm for security applications.” In: SoCC. IEEE, pp. 206–210, 2016.
[SMMK12] Tomoyasu Suzaki, Kazuhiko Minematsu, Sumio Morioka, and Eita Kobayashi. “TWINE: A lightweight block cipher for multiple platforms.” In: Selected Areas in Cryptography, Vol. 7707. LNCS. Springer, pp. 339–354, 2012.

Appendix A: An Example of Fibonacci-Feistel Network

The appendix depicts the propagation of the left-most bit through a 6-round FFN structure with b = 8 branches. The rectangles correspond to distinct L-Box calls. A branch is highlighted if its value depends on the left-most word of the input. This illustrates the fast diffusion property described in Lemma 1.

(Figure 4 from the original paper: 6 rounds of the FFN structure for b = 8, showing how X_0^0 propagates through the network.)

Appendix B: Example of Binary Matrices Corresponding to a FFN

The matrix presented in the original paper has the following properties:

It has full rank.
33585 \approx 0.51 \times 2^{16} of its coefficients are equal to 1.
It can be evaluated using a 10-round FFN with 32 independent and random 4-bit linear permutations used as L-boxes in each round. A new L-box layer is used for each round.

Its inverse has similar properties.

(Figure 5 from the original paper: A matrix M and its inverse M^{-1} corresponding to a 10-round 256-bit FFN with b = 32, w = 4. Black means 1, white means 0.)

History

2026-02-17Add disclaimer: content not author-approved, eprint is authoritative6638546
2026-02-16Add CONVERTED_DATE to existing 47 paper pages7191c14
2026-02-16Add crawler metadata to all 47 paper pagesc6638f2
2026-02-14Add 36 new paper pages and update papers index6e99f38