Post-Quantum Cryptography for Embedded Systems: Challenges and Accelerator Integration over RISC-V Architecture
Introduction
Information security has taken an increasingly crucial and pervasive role, particularly for embedded systems, which are low-power, resource-constrained devices involved in critical applications such as automotive, IoT and industrial control systems. However, the advent of quantum computers is threatening the security of current public key cryptography solutions based on classical algorithms, like RSA and ECC, which can be easily broken via the Shor’s quantum algorithm. To face this issue, the National Institute of Standards and Technology (NIST) has started a standardization process devoted to quantum-resistant algorithms, better known as Post-Quantum Cryptography (PQC) algorithms. The transition to PQC experiences some technical obstacles, especially in the context of embedded systems, where the new cryptographic primitives introduce a higher algorithmic complexity with respect to traditional schemes.
The intensive mathematical operation involved in the newly selected schemes, such as polynomial multiplication over large numbers and the hash routines, poses significant challenges in terms of performance and energy consumption, making the optimized software/hardware strategies adoption crucial. Given the constrained available resources, one of the major aspects in the transition to PQC in embedded systems is the need for accelerating cryptographic operations, keeping at the same time a hardware low-area occupation. The algorithms CRYSTALS-Kyber and CRYSTALS-Dilithium, selected in the NIST process, ensure robustness in face of quantum attacks, but their efficient implementation in embedded systems requires innovative solutions.
Hardware Accelerator role in embedded systems
The execution of post-quantum algorithms like Kyber can be computationally demanding, especially in resource costraint devices like embedded systems. For this reason, the hardware accelerators adoption represents a key strategy to reduce execution time and energy consumption, enhancing the overall efficiency. Hardware accelerators are specialized units designed to efficiently implement complex cryptographic routines. With respect to a generic processor, accelerators are able to perform specific computations, like polynommial multiplications and cryptographic hashes, with a significticative improvemnet in efficiency if compared to software strategies. One of the most suitable platforms to integrate accelerators for PQC is RISC-V, an open-source architecture supporting custom instruction set extension. Thanks to its flexibility, RISC-V allows the integration of dedicated hardware modules to improve the computational efficiency of PQC implementations, adapting them to the specific needs of embedded systems. According to the integration level with the main processor, accelerators can be more or less strictly coupled to the system architecture.
An interesting approach is represented by loosely-coupled accelerators, that work as independent modules, separated from the CPU, and can be reused for multiple applications. This solution shows stronger flexibility with respect to tightly-coupled accelerators, which are instead directly integrated inside the CPU, but introduces a natural overhead due to CPU-accelerator communications.
Accelerator integration strategies over RISC-V
Loosely-coupled accelerators communicate with the processor via the system bus, a hardware interconnection allowing the data transfer between CPU, memory and peripheral components, exploiting register interfaces according to APB, AXI-Lite, OBI and AXI standards. This helps the efficient integration in RISC-V-based systems and ensures scalability in different application scenarios. The CPU-accelerator interaction is managed by dedicated software drivers, that deal with initialization, DMA (Direct Memory Access) data transfer and hardware interrupts management. The DMA allows the data transfer between memory and peripheral components without CPU involvement, enhancing the overall system efficiency. In this article, we will focus on the loosely-coupled approach, showing how the use of hardware accelerators can significantly improve Kyber performance, reducing the computational complexity with respect to a purely software solution.
Computationally demanding functions acceleration
The hardware implementation has given the opportunity to optimize many fundamental operations, drastically reducing execution time, measured in clock cycles, with respect to software implementation. A clock cycle is the time a computer or an electronic device spends to perform a unique simple operation, like a value update or an elementary mathematical operation. In other words, it is the time unit that measures the rhythm of the system’s operations.
Keccak
Figure 1. Keccak accelerator scheme
One of the most demanding operations, in terms of both clock cycles and energy consumption, is the Keccak transform, the main building block of SHA-3, involved in cryptographically secure hash computation. Keccak is the hash function based on a sponge construction, selected by NIST as SHA-3 standard due to its resilience in the face of cryptanalityc attacks and hardware efficiency. The accelerator depicted in Figure 1 implements the Keccak-f[1600] permutation equipped with some optimizations like a simplified generator of round constants. Moreover, the complete storage of the register state in gray balances the resource usage and maximizes efficiency. The hardware acceleration has optimized the permutation computation, allowing the performance to be improved more than an order of magnitude with respect to a software approach: the clock cycles number needed has moved from 56529 to 4000, experiencing a reduction factor of 13.
Operazione | Numero di cicli di clock originali | Numero di cicli di clock ottimizzati | Miglioramento |
Keccak-f[1600] | 56529 | 4169 | 13.56 X |
Number Theoretic Transform

Figure 2. NTT/INTT accelerator scheme
Another crucial operation is the Number Theoretic Transform (NTT), followed by its inverse (INVNTT), which is the main tool to perform optimized polynomial multiplication. The proposed accelerator implements a butterfly unit able to perform both NTT and its inverse. Indeed, this unit is fundamental in this kind of transform, like NTT and FFT (Fast Fourier Transform), allowing to combine pairs of coefficients via modular sums, improving computational efficiency. The constant values involved in the transform are precomputed and stored in the ROM to reduce the number of computations during the algorithm execution. The overall architecture includes memories dedicated to input and output polynomials, equipped with multiplexers to manage data storage and retrieval. The control unit orchestrates the processing, launching the operations according to start signals. The hardware implementation of these transforms has shown a significative acceleration, reducing the cost 15-20 times compared to a CPU.
Operazione | Numero di cicli di clock originali | Numero di cicli di clock ottimizzati | Miglioramento |
NTT | ≈24000 | 1531 | 15.68 X |
INVNTT | ≈30000 | 1531 | 19.60 X |
Results
The table below shows Kyber performance including and not including the above described hardware accelerators. Thanks to the loosely-coupled involvement, the results highlight a relevant improvement in key generation, encapsulation and decapsulation.
Algoritmo | Funzione | Numero di cicli di clock originali | Numero di cicli di clock ottimizzati | Miglioramento |
Kyber512 | KeyGen | 1,052,145 | 292,660 | 3.59 X |
Encaps | 1,106,228 | 365,167 | 3.03 X | |
Decaps | 1,231,155 | 460,374 | 2.67 X | |
Kyber768 | KeyGen | 1,674,185 | 315,149 | 5.31 X |
Encaps | 1,789,912 | 418,661 | 4.29 X | |
Decaps | 1,968,664 | 533,930 | 3.69 X | |
Kyber1024 | KeyGen | 2,612,887 | 609,948 | 4.29 X |
Encaps | 2,757,344 | 563,515 | 4.89 X | |
Decaps | 2,983,573 | 741,062 | 4.02 X |
In the case of Kyber-512, the accelerators have reduced the clock cycles number for the key generation from 1052145 to 292660, with a perfomance improvement factor of 3.59. As regards the encapsulation, the clock cycle number is decreased from 1106228 to 365167, with a reduction of approximately 3.03 times, while for the decapsulation from 1231155 to 460374, obtaining a reduction factor of 2.67. The analysis shows a similar trend also for other Kyber parametrizations, Kyber-768 and Kyber-1024, where the hardware acceleration has considerably reduced the number of clock cycles.
Conclusion
Due to high-demanding resource requirements, the integration of PQC into embedded systems represents a challenging objective. The hardware accelerator involvement over RISC-V architecture offers an interesting trade-off in terms of performance, efficiency and scalability. The research in this field keeps exploring solutions to reduce area consumption and improve accelerators adaptability in many application scenarios. This is forstering the transition to a secure digital future even in the quantum era.
This article belongs to a series of contributions, edited by the Telsy Cryptography Research Group, devoted to quantum computing and its implications on Cryptography. For reference to other articles, please refer to the index.
For other articles related to Quantum and Cryptography topics, please refer to the related categories in the blog.
The author
Alessandra Dolmeta, a graduate student in Electronic Engineering from the Politecnico di Torino, currently pursuing a PhD in Electrical, Electronic, and Communications Engineering in collaboration with Telsy. Her research focuses on the development of hardware architectures for post-quantum cryptography on RISC-V, aiming to optimize and accelerate PQC algorithms.