Efficient finite field computations for elliptic curve cryptography

Wangchen Dai
University of Windsor
EFFICIENT FINITE FIELD COMPUTATIONS FOR ELLIPTIC CURVE CRYPTOGRAPHY

by

WANGCHEN DAI

APPROVED BY:

Dr. D. Wu
School of Computer Science

Dr. C. Chen
Department of Electrical and Computer Engineering

Dr. H. Wu, Advisor
Department of Electrical and Computer Engineering

December 11, 2013
Author’s Declaration of Originality

I hereby certify that I am the sole author of this thesis and that no part of this thesis has been published or submitted for publication.

I certify that, to the best of my knowledge, my thesis does not infringe upon anyone's copyright nor violate any proprietary rights and that any ideas, techniques, quotations, or any other material from the work of other people included in my thesis, published or otherwise, are fully acknowledged in accordance with the standard referencing practices. Furthermore, to the extent that I have included copyrighted material that surpasses the bounds of fair dealing within the meaning of the Canada Copyright Act, I certify that I have obtained a written permission from the copyright owner(s) to include such material(s) in my thesis and have included copies of such copyright clearances to my appendix.

I declare that this is a true copy of my thesis, including any final revisions, as approved by my thesis committee and the Graduate Studies office, and that this thesis has not been submitted for a higher degree to any other University or Institution.
Abstract

Finite field multiplication and inversion are two basic operations involved in Elliptic Curve Cryptosystem (ECC), high performance of field operations can be applied to provide efficient computation of ECC. In this thesis, two classes of fields are proposed for multipliers with much reduced time delay. A most-significant-digit first and a least-significant-digit first digit-serial Montgomery multiplications are also proposed, using novel fixed elements $R(x)$ which are different from $x^m$ and $x^{m-1}$. Architectures of the proposed Montgomery multipliers are studied and obtained for the fields generated by the irreducible pentanomials, which are selected based on the proposed special finite fields. Complexities of the Montgomery multipliers in term of critical path delay and gate count of the architectures are investigated; the critical path delay of the proposed multipliers are found to be as good as or better than the existing works for the same class of fields. Then, implementation of the proposed multipliers ($m = 233$) using Field Programmable Gate Array (FPGA) is provided. In addition, an FPGA implementation of an efficient normal basis inversion algorithm is also presented ($m = 163$). The normal basis multiplication unit is implemented using a digit-level structure, and a C-code is written to generate the first coordinate of the product of two normal basis elements for all field size $m$.

Key Words: Montgomery multiplication, digit-serial, Elliptic Curve Cryptography, normal basis inverse, FPGA.
Dedication

I dedicate this thesis to my parents for supporting me to accomplish my master’s degree at University of Windsor in Canada.
Acknowledgments

I would like to express my sincere gratitude and appreciation to everyone who helped make this thesis possible. I am deeply indebted to my supervisor Prof. Huapeng Wu, Professor of Electrical and Computer Engineering at University of Windsor, for guiding me throughout the writing of this thesis. As one of best teachers I have ever had, Professor Wu impressed upon me that a good teacher instructs students in matters far beyond those in textbooks. His broad knowledge and logical way of thinking have been of great value; without his detailed and constructive comments on my research, none of this thesis would be possible.

I would also grateful to my colleagues and friends, Yiruo He, Ya Tan, Ran Xiao and Shoaleh Hashemi Namin for their time and support.

Finally, I with to extend my gratitude to everyone at UWindsor’s Faculty of ECE for their efforts during my study in the M.A.Sc. Program. I also gratefully acknowledge the financial support form University of Windsor and Professor Huapeng Wu.
# Contents

Author’s Declaration of Originality iii  
Abstract iv  
Dedication v  
Acknowledgments vi  
List of Figures x  
List of Tables xii  
List of Appendices xiv  
List of Abbreviations/Symbols xv  

1 Introduction 1  

2 Mathematical Preliminaries 4  
  2.1 Finite Field and Representations 4  
  2.2 Montgomery Multiplication over $GF(2^m)$ 6  
  2.3 Elliptic Curve Cryptosystem 7  
    2.3.1 Elliptic Curves 7  
    2.3.2 Finite Field Inversion Using Normal Basis 9  
    2.3.3 Elliptic Curve Cryptosystem 10  

3 A Review of Existing Work 13
## 4 Proposed Digit-serial Montgomery Multipliers

4.1 Proposed Digit-Serial MSD First Montgomery Multiplier .......................... 19
   4.1.1 Algorithm .................................................................................. 20
   4.1.2 General Architecture .................................................................. 21
   4.1.3 Advanced Architecture ............................................................... 26

4.2 Proposed Digit-Serial LSD First Montgomery Multiplier ....................... 31
   4.2.1 Algorithm .................................................................................. 31
   4.2.2 General Architecture .................................................................. 32
   4.2.3 LFSR-Based Architecture ............................................................ 36

4.3 Complexity Analysis ............................................................................. 38

4.4 FPGA Implementation of the Proposed Multipliers ............................... 42
   4.4.1 Summary of the MSD-First Multiplier Implementation .................. 42
   4.4.2 Summary of the LSD-First Multiplier Implementation .................. 43

## 5 FPGA Implementation of Inverse Generator

5.1 The Design of Inverse Generator .......................................................... 45
   5.1.1 REG1 Module .............................................................................. 46
   5.1.2 REG2 Module .............................................................................. 47
   5.1.3 MUX Module ............................................................................... 48
   5.1.4 Digit-level Normal Basis Multiplier Module and Multiplication Al-
        gorithm ......................................................................................... 48
   5.1.5 Top-Level .................................................................................... 52

5.2 Simulation and Compilation ................................................................... 52
   5.2.1 Simulation Results ....................................................................... 52
   5.2.2 Compilation Results ..................................................................... 54

## 6 Conclusions

## A C-code of $F(s)$ and the First Coordinate $c_0$ Generation

## B Generated VerilogHDL-code of the First Coordinate $c_0$

## Bibliography
CONTENTS

Vita Auctoris 76
List of Figures

2.1 Operations in an elliptic curve ........................................ 7
2.2 Elliptic curve over binary field $GF(2^m)$ ........................... 9
2.3 Encryption/decryption of elliptic curve cryptosystem ............. 10
2.4 Computation structure of ECC over $GF(2^m)$ ....................... 11
3.1 (a) Tang’s architecture of $GF(2^{233})$ multiplier [17] (b) Kumar’s architecture of $GF(2^m)$ multiplier [19] ......................... 15
3.2 Tang’s architecture of partial product multiplier, generates the product of $A_j \times B$ [17] ....................................................... 16
3.3 Meher’s block diagram of proposed field multiplier over $GF(2^m)$ [24] .......................................................... 17
3.4 Work reported in [28], (a) $R(x) = x^m$, (b) $R(x) = x^{m-1}$ .......... 18
4.1 Block diagram of proposed digit-serial MSD-first Montgomery multiplier when $R(x) = x^l$ .............................................. 22
4.2 General architecture of the proposed multiplier when $R(x) = x^u$ ........ 23
4.3 Implementation of equation (4.11) ..................................... 25
4.4 Model 1: multiply by $x$ structure .................................... 27
4.5 Implementation of computation $A(x)x^2 \mod f(x)$ .............. 28
4.6 Model 2: multiply by $x^{-1}$ structure ................................ 29
4.7 Implementation of $A(x)B_{s-i-1}(x)x^{-l} \mod f(x)$ ............ 30
4.8 Advanced architecture of proposed multiplier .................... 30
4.9 General architecture of the proposed digit-serial LSD first multiplier .................. 33
4.10 LFSR-based architecture of the proposed LSD Montgomery multiplier ........ 37
5.1 Architecture of the designed inverse generator .................. 46
5.2 Block diagram of the inverse generator for FPGA implementation ...... 47
5.3 REG1 module ................................................................. 48
<table>
<thead>
<tr>
<th>Figure</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>5.4</td>
<td>REG2 module</td>
<td>48</td>
</tr>
<tr>
<td>5.5</td>
<td>MUX module</td>
<td>49</td>
</tr>
<tr>
<td>5.6</td>
<td>Digit-level normal basis multiplier module</td>
<td>49</td>
</tr>
<tr>
<td>5.7</td>
<td>Digit-level Normal Basis multiplier structure</td>
<td>50</td>
</tr>
<tr>
<td>5.8</td>
<td>Simulation result of the Inverse Generator</td>
<td>56</td>
</tr>
<tr>
<td>5.9</td>
<td>RTL of the design</td>
<td>57</td>
</tr>
<tr>
<td>5.10</td>
<td>Technology map viewer of the design</td>
<td>58</td>
</tr>
</tbody>
</table>
List of Tables

1.1 Key size comparison between RSA and ECC with same secure level ................................. 2
2.1 Algorithm of Binary Field Bit-Parallel Montgomery Multiplication ........................................ 6
3.1 Algorithm of Bit-Serial Montgomery Multiplication ............................................................ 14
3.2 Algorithm of Digit-Serial Montgomery Multiplication, where $d$ is the digit size, $f_0(x) = 1 \mod x^d$, $C_0(x)$ and $f_0(x)$ are the least significant digits of $C(x)$ and $f(x)$, respectively ................................................................. 14
4.1 Digit-serial MSD-first Montgomery Multiplier ($R(x) = x^l$), where $0 \leq l \leq d - 1$ .................................................. 21
4.2 Complexity of each block of the proposed MSD-first Montgomery multiplier ................. 26
4.3 Complexity of proposed digit-serial MSD-first Montgomery multiplication (Algorithm I, general architecture, when $k_{i+1} - k_i \geq d - 1, k_0 = 0, k_4 = m$ and $0 \leq l \leq d - 1$) .................................................. 26
4.4 Complexity of proposed digit-serial MSD-first Montgomery multiplication (Algorithm I, advanced architecture, when $k_{i+1} - k_i \geq \max\{l, d - l - 1\}, i = 0, 1, 2, 3, k_0 = 0, k_4 = m$ and $0 \leq l \leq d - 1$) .................................................. 31
4.5 Digit-serial LSD-first Montgomery Multiplier ($R(x) = x^l$), where $l \geq 0$ .......... 32
4.6 ................................................................................................................................. 33
4.7 Complexity of digit-serial LSD Montgomery multiplication (Algorithm II, when $1 \leq l \leq d - 1$ and $k_{i+1} - k_i \geq d - 1, k_0 = 0, k_4 = m$) .................................................. 34
4.8 Complexity of digit-level Montgomery multiplication (Algorithm II, when $l = d$, and $k_{i+1} - k_i \geq d - 1, k_0 = 0, k_4 = m$) .................................................. 34
4.9 Degree range of each term of equation (4.22) ............................................................... 35
4.10 Value of $l$ in terms of XOR gate usage of block S1 .................................................. 35
<table>
<thead>
<tr>
<th>Table</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.11</td>
<td>Complexity of digit-level Montgomery multiplication (Algorithm II, when $l &gt; d$, and $k_{i+1} - k_i \geq l$, $k_0 = 0$, $k_4 = m$)</td>
<td>36</td>
</tr>
<tr>
<td>4.12</td>
<td>LFSR-Based Digit-serial LSD-first Montgomery Multiplier ($R(x) = x^l$), where $0 \leq l \leq d - 1$</td>
<td>36</td>
</tr>
<tr>
<td>4.13</td>
<td>Complexity of digit-level Montgomery multiplication (Algorithm III, when $0 \leq l \leq d - 1$, and $k_{i+1} - k_i \geq \max{l, d - l - 1}$, $k_0 = 0$, $k_4 = m$)</td>
<td>38</td>
</tr>
<tr>
<td>4.14</td>
<td>Intrinsic delay of XOR2 and AND2 gate, we assume each gate could drive a maximum of two gates ($25^\circ C$, 1.8V, CMOSP18 Tech., $Y = A \cdot B$, or $Y = A \oplus B$)</td>
<td>38</td>
</tr>
<tr>
<td>4.15</td>
<td>Digit-serial Montgomery multipliers comparison ($f(x) = x^m + x^{k_3} + x^{k_2} + x^{k_1} + 1$, $s = m/d$)</td>
<td>39</td>
</tr>
<tr>
<td>4.16</td>
<td>Proposed multipliers compared with Polynomial Basis finite field multipliers (MSD cases, $f(x) = x^m + x^{k_3} + x^{k_2} + x^{k_1} + 1$, $s = \lceil m/d \rceil$, $T_{DF}F$ represents the time delay of a D-flipflop)</td>
<td>39</td>
</tr>
<tr>
<td>4.17</td>
<td>Proposed multipliers compared with Polynomial Basis finite field multipliers (LSD cases, $T_M$ represents the time delay of a $2 \times 1$ Multiplexer, $T_{TFF}$ represents the time delay of a T-flipflop)</td>
<td>40</td>
</tr>
<tr>
<td>4.18</td>
<td>Efficiency of the proposed multipliers and existing Montgomery multipliers ($m = 233$, $d = 8$, if $l &lt; d$, then $l = 4$)</td>
<td>41</td>
</tr>
<tr>
<td>4.19</td>
<td>Efficiency of the proposed multipliers and existing PB multipliers ($m = 233$, $d = 8$, if $l &lt; d$, then $l = 4$)</td>
<td>41</td>
</tr>
<tr>
<td>4.20</td>
<td>Cells usage of compilation ($m = 233$, $d = 8$, $u = 4$)</td>
<td>42</td>
</tr>
<tr>
<td>4.21</td>
<td>Gate count of each module ($m = 233$, $d = 8$, $u = 4$)</td>
<td>42</td>
</tr>
<tr>
<td>4.22</td>
<td>Time complexity of the design ($m = 233$, $d = 8$, $u = 4$)</td>
<td>43</td>
</tr>
<tr>
<td>4.23</td>
<td>Cells usage of compilation ($m = 233$, $d = 8$, $l = 4$)</td>
<td>43</td>
</tr>
<tr>
<td>4.24</td>
<td>Gate count of each module ($m = 233$, $d = 8$, $l = 4$)</td>
<td>43</td>
</tr>
<tr>
<td>4.25</td>
<td>Time complexity of the design ($m = 233$, $d = 8$, $l = 4$)</td>
<td>44</td>
</tr>
<tr>
<td>5.1</td>
<td>Description of Each Clock cycle</td>
<td>52</td>
</tr>
<tr>
<td>5.2</td>
<td>Cells usage of compilation</td>
<td>55</td>
</tr>
<tr>
<td>5.3</td>
<td>Area cost of each module</td>
<td>55</td>
</tr>
<tr>
<td>5.4</td>
<td>Operation delay of the design Inverse Generator over $GF(2^{163})$</td>
<td>55</td>
</tr>
</tbody>
</table>
List of Appendices

C-code of $F(s)$ and the First Coordinate $c_0$ Generation ................................. 61
Generated VerilogHDL-code of the First Coordinate $c_0$ ................................. 66
**List of Abbreviations/Symbols**

<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>GF</td>
<td>Finite Field or Galois Field</td>
</tr>
<tr>
<td>PB</td>
<td>Polynomial Basis</td>
</tr>
<tr>
<td>NB</td>
<td>Normal Basis</td>
</tr>
<tr>
<td>EC</td>
<td>Elliptic Curve</td>
</tr>
<tr>
<td>ECC</td>
<td>Elliptic Curve Cryptosystems</td>
</tr>
<tr>
<td>RSA</td>
<td>Rivest, Shamir, Adleman</td>
</tr>
<tr>
<td>FPGA</td>
<td>Field Programmable Gate Array</td>
</tr>
<tr>
<td>ALUT</td>
<td>Adaptive Look Up Tables</td>
</tr>
<tr>
<td>MSD</td>
<td>Most Significant Digit</td>
</tr>
<tr>
<td>LSD</td>
<td>Least Significant Digit</td>
</tr>
<tr>
<td>LE</td>
<td>Logic Element</td>
</tr>
<tr>
<td>MUX</td>
<td>Multiplexer</td>
</tr>
<tr>
<td>XOR</td>
<td>Exclusive OR</td>
</tr>
<tr>
<td>TFF</td>
<td>T-Flipflop</td>
</tr>
<tr>
<td>DFF</td>
<td>D-Flipflop</td>
</tr>
<tr>
<td>LFSR</td>
<td>Linear Feedback Shift Register</td>
</tr>
<tr>
<td>VLSI</td>
<td>Very Large Scale Integrated Circuits</td>
</tr>
</tbody>
</table>

xv
Chapter 1

Introduction

The development of cryptography can be divided into the following two stages [1]: classical cryptography, and modern cryptography. Classical cryptography was the study of the confidentiality of a message through encryption and decryption. An encryption operation can be described as the conversion of a message or a piece of information from comprehensible text into some incomprehensible form. Transposition cipher [3] and substitution cipher [2] are two representative classical ciphers.

Due to the rapid development of computer and network technologies, and the worldwide application of on-line trading services, mobile phones, and credit cards, the increasing threat to personal privacy and information security is becoming a significant challenge to security engineers. Under this context, cryptography is no longer just a concern for governments, but for civilians as well. Therefore, this field has been expanded far beyond communication confidentiality to include identity authentication, digital signatures, message integrity verification, etc. This extension has led to modern cryptography. Cipher algorithms in modern cryptography are achieved by using a key to encrypt and decrypt information. The Data Encryption Standard (DES) and the Advanced Encryption Standard (AES) are two symmetrical cipher algorithms created in modern cryptography. The encryption and decryption of these algorithms share the same key. The problem is that over time, more users know the key, and the risk of security breaches increases: once the key is revealed by one of the uses, the entire cryptosystem will be on longer secure.

During the 1970s, the public-key cryptosystem, known as the most notable advance in the field of cryptography after World War II, was invented [1]. A public-key system is an
asymmetrical key system that uses a public key to encrypt but decrypts with a private key. The concept of public-key cryptography was first raised in 1976 by Diffie and Hellman [6]; they demonstrated the possibility of network communication when the public key could be widely distributed, while its paired private key remains secret. After that, RSA, which was first published in 1978 by three talented scientists [7], is considered to be the most widely used public-key cryptosystem. To break RSA, a large-number factorization problem must be solved first. Later, elliptic curve cryptosystem (ECC), another public-key system was proposed by Koblitz [8] and Miller [9] during 1985 to 1987. The breaking of an ECC is equivalent to solving discrete logarithm problems. A RSA algorithm with 768-bit key size was broken in 2010 [4], while the hardest ECC scheme broken at present had only a 112-bit key size [5], ECC seems to be superior to RSA. The following table lists the key size in terms of security level with regard to these two public-key cryptosystems.

<table>
<thead>
<tr>
<th>Key Size Comparison</th>
<th>RSA(bit)</th>
<th>ECC(bit)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1024</td>
<td>160</td>
<td>6 : 1</td>
</tr>
<tr>
<td>2048</td>
<td>224</td>
<td>9 : 1</td>
</tr>
<tr>
<td>3072</td>
<td>256</td>
<td>12 : 1</td>
</tr>
<tr>
<td>7680</td>
<td>384</td>
<td>20 : 1</td>
</tr>
</tbody>
</table>

ECC uses a binary field $GF(2^m)$ or a prime field $GF(p)$. The encryption and decryption speed is an important indicator for evaluating an ECC algorithm. Efficiency of finite field arithmetic operation has great impact on the performance of an ECC, since an ECC computation consists a set of point operations and field multiplication and field inversion are the basic operations involved in the point operation. Due to the fact that field inversion also requires field multiplication during the computation, as a consequence, a large number of studies are mainly aimed at high-speed and efficient implementations of field multiplication.

The binary field $GF(2^m)$ is widely used in field operations because it is very suitable for VLSI implementation. However, the multiplication is more complicated and time-consuming. Efficient computation of field multiplication is one of the critical issue of public-key based cipher algorithms. In 1985, Montgomery introduced a new method for integer modular multiplication [10], and proved that the time-consuming trial division operation can be avoided. Later, Koc [12] extended the method to binary field and showed
that binary field multiplication can be implemented dramatically faster than standard multiplication. A number of Montgomery multipliers has been designed, and in general, the existing Montgomery multipliers can be divided into two styles: general styles including bit-serial, bit-parallel, and digit-level sub-types; and systolic styles. Bit-serial multipliers have the least gate count but require the longest time to process one operation. In contrast, bit-parallel multipliers have the smallest time delay but require largest implementation area. Digit-level multipliers are available to combine the advantages of both of them and balance the relationship between gate count and critical path delay by processing constant bits each clock cycle.

The works reported in this thesis mainly focus on the efficient computation and hardware implementation of digit-serial Montgomery multiplication. A most-significant-digit first and a least-significant-digit first digit-serial Montgomery multiplier are proposed; two novel fixed elements $R(x)$, which are different from the general ones ($x^{m-1}$ and $x^m$), are applied. Two classes of fields for the multipliers with much reduced critical path delay are also proposed. Architectures of the proposed Montgomery multipliers are studied and obtained for the fields generated by the irreducible pentanomials. The complexities of the proposed multipliers in terms of gate count and critical path delay of the architecture are investigated, and demonstrated that the critical path delay of the proposed multipliers can be further reduced by applying the special finite fields. The contributions of this research work also consist of an FPGA implementation of the proposed Montgomery multipliers in the case where $m = 233$. Furthermore, an FPGA implementation of a normal basis inversion algorithm in $GF(2^m)$ is also presented in this thesis.

The outline of this thesis is as follows. Chapter 2 presents the mathematical background of the finite field, digit-level Montgomery multiplication, elliptic curve cryptosystem, and some other related equations and concepts. After that, a review of the existing literatures will be presented in Chapter 3. Chapter 4 presents the details of the proposed digit-serial Montgomery multipliers, the comparison results of the proposed Montgomery multipliers in terms of cell usage and critical path delay, the FPGA simulation, and a compilation report of the proposed works. Chapter 5 describes the FPGA implementation of the normal basis inversion generator and provides the results. Finally, in the last chapter, there will be a profound discussion regarding the conclusions and further work.
Chapter 2

Mathematical Preliminaries

This chapter introduces the relevant mathematical background. The definition of finite field as well as its two general representation methods, the definition of Montgomery multiplication, and the algorithm of elliptic curve cryptosystem (ECC) will be introduced in turn.

2.1 Finite Field and Representations

A finite field (or Galois field) is a group of finitely many elements in which both the addition and the multiplication are defined, also the usual algebraic laws, commutative, associative, and distributive can be applied [14]. The number of elements contained in a finite field is called the order of the field. A finite field can be denoted as $GF(q)$, where $q$ is an positive integer number greater than one. The order of a nonzero element $A \in GF(q)$ is defined as the smallest positive integer $k$ to make $A^k = 1$, and $k$ always divides $q − 1$. In cryptography, there are two kinds of finite fields that are commonly used: prime field $GF(p)$, where $p$ is prime, and binary extension field $GF(2^m)$ where $m$ is a positive integer great then or equal to two. The representation of field $GF(p)$ is simply a set of integers modulo $p$, however, unlike prime field, the binary field has many frequently-used representations. Polynomial basis representation and normal basis representation are the two methods commonly used to represent a binary field element.

In a polynomial basis representation, every element in $GF(2^m)$ is represented by a unique polynomial of degree less than $m$. For example, element $A$ of $GF(2^m)$ can be represented as $A(x) = a_{m-1}x^{m-1} + a_{m-2}x^{m-2} + \cdots + a_1x + a_0 = (a_{m-1}a_{m-2}\cdots a_1a_0)$, and
the coefficient $a_i$ of each term equals either 0 or 1. The polynomial basis is the set:

$$PB = \{x^{m-1}, x^{m-2}, \ldots, x^2, x, 1\} \quad (2.1)$$

Using polynomial basis to represent elements in binary field $GF(2^m)$ has been proved to be well suited. By applying such a representation, an addition operation in binary field can be very efficiently implemented by a single XOR gate, and a multiplication operation are defined simply as the product of the corresponding polynomials reduced by modulo $f(x)$. $f(x)$ is an irreducible polynomial which generates the binary field $GF(2^m)$, see equation (2.2). If we let only three $f_i$ equals to one, where $1 < i < m$, we could have an irreducible pentanomial $f(x) = x^m + x^{k_3} + x^{k_2} + x^{k_1} + 1$, where $1 < k_1 < k_2 < k_3 < m$. The works reported in this thesis are focusing on the binary field due to its efficient implementation in both hardware and software.

$$f(x) = x^m + f_{m-1}x^{m-1} + \cdots + f_1x + 1 = 0, \text{ where } f_i = 0 \text{ or } 1 \quad (2.2)$$

In normal basis representation, we use the basis set

$$NB = \{\theta^{2^{m-1}}, \theta^{2^{m-2}}, \ldots, \theta^2, \theta\} \quad (2.3)$$

to represent elements in the binary field, and elements $\theta^{2^i}$, where $i \in [0, m - 1]$, in the basis set must be linearly independent. Using normal basis, a binary field element $A = (a_{m-1}a_{m-2} \ldots a_1a_0)$ can be represented by equation (2.4):

$$A = a_{m-1}\theta^{2^{m-1}} + a_{m-2}\theta^{2^{m-2}} + \cdots + a_1\theta^2 + a_0\theta \quad (2.4)$$

Normal basis representation has the computational advantage that $2^i$-power operations can be implemented very efficiently by a left-shift operation, see equation (2.5). But the multiplication operations are very complicated and time consuming (see [14] Section A.3.8 and Section A.6.4). In that case, a special class of normal bases called Gaussian normal bases are studied in order to minimize the complexity of multiplication.
\[ A^{2^2} = (A^2)^2 = (a_{m-1}\theta^{2^{2m}\mod m} + a_{m-2}\theta^{2^{m-1}} + \cdots + a_1\theta^2 + a_0) \cdot (a_{m-1}\theta^{2^{2m}\mod m} + a_{m-2}\theta^{2^{m-1}} + \cdots + a_1\theta^2 + a_0) \]

\[ = a_{m-3}\theta^{2^{m-1}} + a_{m-4}\theta^{2^{m-2}} + \cdots + a_0\theta^2 + a_{m-1}\theta^2 + a_{m-2}\theta \]  

(2.5)

### 2.2 Montgomery Multiplication over \(GF(2^m)\)

Montgomery multiplication was first proposed by Montgomery in 1985 [10] and was extended to binary field by Koc in 1998 [12]. Compared with the standard multiplication, the Montgomery multiplication can avoid trail division operations whereas standard modular multiplication cannot.

Montgomery multiplication in \(GF(2^m)\) is defined by equation (2.6).

\[ C(x) = A(x) \times B(x) \times R(x)^{-1} \mod f(x) \]  

(2.6)

**Table 2.1: Algorithm of Binary Field Bit-Parallel Montgomery Multiplication**

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Binary Field Bit-Parallel Montgomery Multiplication</th>
</tr>
</thead>
<tbody>
<tr>
<td>Inputs:</td>
<td>(A(x), B(x) \in GF(2^m), f(x), f'(x))</td>
</tr>
<tr>
<td>Outputs:</td>
<td>(C(x) = A(x) \times B(x) \times R(x)^{-1}(x) \mod f(x))</td>
</tr>
<tr>
<td>Step 1:</td>
<td>(T(x) = A(x)B(x))</td>
</tr>
<tr>
<td>Step 2:</td>
<td>(U(x) = T(x)f'(x) \mod R(x))</td>
</tr>
<tr>
<td>Step 3:</td>
<td>(C(x) = [T(x) + U(x)f(x)]/R(x))</td>
</tr>
</tbody>
</table>

Instead of obtaining the product of \(A(x)B(x) \mod f(x)\) directly, we multiply an extra polynomial \(R(x)\), to compute \(A(x)B(x)R(x)^{-1} \mod f(x)\). \(f(x)\) is the irreducible polynomial which is used to generate the binary field \(GF(2^m)\) and \(R(x)\) is treated as a fixed element in \(GF(2^m)\). The Montgomery multiplication requires that \(R(x)\) and \(f(x)\) are relatively prime. Under this condition, we have the property that \(R(x) \cdot R(x)^{-1} + f(x)f'(x) = 1\), the two polynomials \(R(x)^{-1}\) and \(f'(x)\) can be computed using extended Euclidean algorithm [12]. Table 2.1 presents an algorithm of bit-parallel binary field Montgomery multiplication. It has been proved that, by letting the value of \(R(x)\) equals to \(x^m\), efficient implementations of the Montgomery multiplier can be obtained [12]. For example, as the algorithm shown in Table 2.1, if \(R(x) = x^m\), modular \(R(x)\) in Step 2 can be accomplished simply by ignoring the terms which degree is larger than \(m\), and in Step 3, the division operation can
be implemented by shifting the polynomial to the right side by \( m \) bits. Besides, in [25], the work shows \( R(x) = x^{m-1} \) is also a suitable Montgomery factor for efficient implementation of Montgomery multiplication.

2.3 Elliptic Curve Cryptosystem

2.3.1 Elliptic Curves

Elliptic curves (EC) are a set of curves that satisfy equation (2.7), where \( b \neq 0 \), see Section A.9.1 in [14].

\[
y^2 + xy = x^3 + ax^2 + b
\]  

(2.7)

On an elliptic curve, two point operations can be defined [14]: point addition and point doubling. One special point called point at infinity or zero point is also defined, see Fig 2.1.

![Figure 2.1: Operations in an elliptic curve](image)

In Fig 2.1(a), the two points \( P = (x_1, y_1) \) and \( Q = (x_2, y_2) \) do not overlap, line \( PQ \) intersects the curve at point \(-R\), then we draw a vertical line via \(-R\) to get its reflection point \( R \) on the curve, and \( R = (x_3, y_3) \), thus the point addition operation can be defined as: \( R = P + Q \), and \( x_3 \) and \( y_3 \) can be calculated by the equations presented in (2.8).
\[ x_3 = a + \lambda^2 + \lambda + x_1 + x_2 \\
\lambda = (y_1 + y_2)/(x_1 + x_2) \]  
\hspace{1cm} (2.8)

In Fig 2.1(b), points \( P \) and \( Q \) are overlapped at point \( P \), a tangent line is drawn via \( P = (x_1, y_1) \) that intersects the curve at point \( -R \), and the reflection of \( -R \) is \( R = (x_3, y_3) \). In this case, the point doubling operation is defined: \( R = 2P \). Equations presented in (2.9) presents the coordinate computation of \( R \).

\[ x_3 = a + \lambda^2 + \lambda \\
y_3 = \lambda x_3 + x_3 + x_1^2 \\
\lambda = x_1 + y_1/x_1 \]  
\hspace{1cm} (2.9)

By combining point addition and point doubling operations, point scalar multiplication can be defined, for example: \( 5P = 4P + P = 2(2P) + P \), this indicates that one point addition and two point doubling operations are required to obtain \( 5P \).

In the third case of Fig 2.1, the line \( P(-P) \) is perpendicular to the x-axis. Mathematically, we assume the line intersects the curve at a third point at infinite, and define this third point as the point at infinity or zero point, denoted as \( O \). According to this definition, we have: \( P + (-P) = O \), \( P + O = P \), \( O = -O \) and \( P + Q + R = O \). The set of points on the elliptic curve is an Abelian group, which implies that the point operations satisfy the common algebraic laws: commutativity and associativity.

When we extend the elliptic curve to binary field \( GF(2^m) \), then \( a, b \in GF(2^m) \), and the equations in (2.8) and (2.9) should modular \( f(x) \) at the end of each equation, \( f(x) \) is the irreducible polynomial to generate \( GF(2^m) \). In Fig 2.2, we see that the binary field elliptic curve presented in the coordinate graph is no longer a "curve" with a set of infinitely points in a real number field, instead, it consists of finite many points, the points being distributed separately on the first quadrant and the non-negative axises of the plane coordinate graph. The number of points involved in the elliptic curve \( E \) including \( O \) is called the order of \( E \), denoted as \( #E(GF(2^m)) \). The order of a single point \( P \) on curve \( E \) is defined as the
smallest positive integer \( n \) such that \( nG = O \), every point on the curve as an order, and this order divides the order of the curve \( \#E(GF(2^m)) \). Commutativity and associativity are still satisfied for point operations in binary fields.

![Figure 2.2: Elliptic curve over binary field \( GF(2^m) \)](image)

2.3.2 Finite Field Inversion Using Normal Basis

Finite field inversion operation is one of the basic operations of ECC computation. Assume \( \alpha \) belongs to the finite field \( GF(2^m) \) and \( \alpha \) is represented using normal basis:

\[
\alpha = a_{m-1}\theta^{2^{m-1}} + a_{m-2}\theta^{2^{m-2}} + \cdots + a_1\theta + a_0\theta
\]  

(2.10)

Since for \( \forall \alpha \in GF(2^m) \) there exists an order, denoted as \( ord(\alpha) \), and according to the definition of the order, we have:

\[
\alpha^{ord(\alpha)} = 1
\]  

(2.11)

Also, \( ord(\alpha) \) divides \( 2^m - 1 \). If we assume that \( n \times ord(\alpha) = 2^m - 1 \), by taking the
power of \( n \) from both sides of equation (2.11), we could have:

\[
(\alpha^{\text{ord}(\alpha)})^n = \alpha^{2^m-1} = 1^n = 1
\]  

(2.12)

By dividing \( \alpha \) with both sides of equation (2.12), we could get the expression of inverse \( \alpha \):

\[
\alpha^{-1} = \alpha^{2^m-2}
\]  

(2.13)

Since \( 2^x \)-power only needs a left shift, see equation (2.5) as a reference, we could take the advantages of this property and obtain an efficient algorithm to compute finite field inversion using normal basis representation. In Chapter 5, an efficient computation and implementation of finite field inversion in \( GF(2^{163}) \) is provided based on equation (2.13), using normal basis.

### 2.3.3 Elliptic Curve Cryptosystem

Elliptic curve cryptosystem (ECC) is a public-key cryptosystem that has a shorter key size compared with RSA in same secure level. Suppose a base point \( G \) on elliptic curve \( E \) has order \( n \), then we could define the key pair as follows: the private key \( k \) is a positive integer smaller than \( n \); the corresponding public key \( K \) is a point on the curve \( E \), where \( K = kG \) and \( K \) is computed by point scalar multiplication. The encryption/decryption operations can be described as follows:

1. Alice (known as User A) selects an elliptic curve \( E \), a base point \( G \) on \( E \), and she determines the key pair of private key \( k \) and public key \( K \). She then sends curve \( E \), base point

\[
\text{Encryption: } C_1 = M + rK, C_2 = rG
\]

\[
\text{Decryption: } C_1 - kC_2 = M + rK - k(rG) = M + rK - r(kG) = M
\]
G together with the public key \( K \) to Bob (known as User B) for private communication;

(2) If Bob has a message (known as the plaintext) and he wants to send it to Alice privately. First, he maps or encodes the text to a point on \( E \), denotes this point as \( M \), different mapping methods can be found from [11], [23], [26] and [31]. Second, Bob chooses a random number \( r < n \) and encrypts \( M \) with public key \( K \) and base point \( G \), see equation (2.14). The two computed points \( (C_1, C_2) \) are knowns as the corresponding ciphertext of \( M \). Third, Bob now sends the ciphertext \( (C_1, C_2) \) to Alice and this process can be described as encryption;

(3) Alice receives the ciphertext \( (C_1, C_2) \) sent from Bob and decrypt them with the private key \( k \), see equation (2.15), and finally she can read the secret message Bob sends. This process can be described as decryption.

Fig 2.3 shows the process of ECC encryption and decryption.

\[
C_1 = M + rK, \quad C_2 = rG
\]  
(2.14)

\[
C_1 - KC_2 = M + rK - k(rG) = M + rK - r(kG) = M
\]  
(2.15)
CHAPTER 2. MATHEMATICAL PRELIMINARIES

From the above brief introduction of the ECC encryption/decryption operations, we could see that the key issue to break this cryptosystem is to resolving the value of the private key $k$ from the equation $K = kG$. The fact is that knowing $k$ and $G$ to compute $K$ is simple by calculating a set of point addition and point doubling operations, however, knowing $K$ and $G$ to compute $k$ is extremely hard when the size of the selected binary field is large. For this reason, this type of cryptosystem relies for its security level on the difficulty level of the elliptic curve discrete logarithm problem.

The computation of ECC contains four levels, see Fig 2.4. The top level is the ECC itself. The major operation involved in ECC is the point scalar multiplication, and it is the second level of ECC computation. A point scalar multiplication can be efficiently calculated by a set of ECC computation. A point scalar multiplication can be efficiently calculated by a set of point doubling and point addition operations, for example $9P = 2(2(2P)) + P$ can be decomposed into four point doubling operations plus one point addition operation. The computation of a point scalar multiplication is similar to the squaring-multiplying algorithm when calculating an exponentiation operation. Thus, the point operations are the third level of ECC computation. The coordinate computation of the point operations, see equations (2.8) and (2.9), indicates that the basic operations of ECC computation are finite field multiplication and finite field inversion.
Chapter 3

A Review of Existing Work

The existing Montgomery multipliers can be grouped into two types in terms of their architectures: general style including three sub-types: bit-serial [12], [22], [25], [28], bit-parallel [12], [15], [25] and digit-level [12], [28]; and systolic style [18], [20], [16], [27]. Bit-serial multipliers load one operand bit-by-bit and the other operand in parallel. They usually have the lowest gate complexity but require the longest time to process one operation; in contrast, bit-parallel multipliers could reach the fastest processing speed by loading and calculating both operands in parallel, but cost the most gate count when implementing. Digit-level multipliers allow us to combine the advantages of both and seek the balance between area and speed by processing one operand by constant bits each clock cycle and the other operand in parallel. Systolic style multipliers consist of matrix-like rows of data processing units (cells) known as a systolic array. These units are similar to central processing units, each unit shares the information with its neighbors. Systolic style architectures are well suited to VLSI design due to the scalability, short inter-connection and highly repetitive nature of the units. Our work will mainly focus on the architecture of digit-serial Montgomery multipliers, and we will review some digit-serial polynomial multipliers and digit-serial Montgomery multipliers first.

Montgomery multiplication was first applied to binary field multiplication in 1998 by Koc [12], who reported the general algorithms of bit-serial (Table 3.1), bit-parallel (Table 2.1) and digit-serial (Table 3.2) Montgomery multiplications. Koc [12] first showed that by selecting the Montgomery factor $R(x) = x^n$, the multiplication can be efficiently implemented in both bit-serial and digit-serial architectures. Besides, he proved that us-
ing digit-level Montgomery method for finite field multiplication could offer a much faster processing speed compared with the standard digit-level multiplication.

Table 3.1: Algorithm of Bit-Serial Montgomery Multiplication

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Bit-Serial Montgomery Multiplication</th>
</tr>
</thead>
<tbody>
<tr>
<td>Inputs:</td>
<td>$A(x), B(x) \in GF(2^m), f(x)$</td>
</tr>
<tr>
<td>Outputs:</td>
<td>$C(x) = A(x) \times B(x) \times x^{-m} \mod f(x)$</td>
</tr>
<tr>
<td>Step 1:</td>
<td>$C(x) = 0$</td>
</tr>
<tr>
<td></td>
<td>for $i = 0$ to $m - 1$ do</td>
</tr>
<tr>
<td>Step 2:</td>
<td>$C(x) = C(x) + a_i B(x)$</td>
</tr>
<tr>
<td>Step 3:</td>
<td>$C(x) = C(x) + c_0 f(x)$</td>
</tr>
<tr>
<td>Step 4:</td>
<td>$C(x) = C(x) / x$</td>
</tr>
</tbody>
</table>

Table 3.2: Algorithm of Digit-Serial Montgomery Multiplication, where $d$ is the digit size, $f_0'(x)f_0(x) = 1 \mod x^d$, $C_0(x)$ and $f_0(x)$ are the least significant digits of $C(x)$ and $f(x)$, respectively

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Digit-Serial Montgomery Multiplication</th>
</tr>
</thead>
<tbody>
<tr>
<td>Inputs:</td>
<td>$A(x), B(x) \in GF(2^m), f(x), f_0'(x)$</td>
</tr>
<tr>
<td>Outputs:</td>
<td>$C(x) = A(x) \times B(x) \times x^{-m} \mod f(x)$</td>
</tr>
<tr>
<td>Step 1:</td>
<td>$C(x) = 0$</td>
</tr>
<tr>
<td></td>
<td>for $i = 0$ to $s - 1$ do</td>
</tr>
<tr>
<td>Step 2:</td>
<td>$C(x) = C(x) + A_i(x) B(x)$</td>
</tr>
<tr>
<td>Step 2:</td>
<td>$M(x) = C_0(x)f_0'(x) \mod x^d$</td>
</tr>
<tr>
<td>Step 3:</td>
<td>$C(x) = C(x) + M(x)f(x)$</td>
</tr>
<tr>
<td>Step 4:</td>
<td>$C(x) = C(x) / x^d$</td>
</tr>
</tbody>
</table>

In 1998, Song proposed two different polynomial multiplier architectures: least significant digit (LSD) first and most significant digit (MSD) first, respectively. In 2005, Tang reported a bit-parallel digit-serial multiplier in $GF(2^{233})$, the architecture of the proposed $GF(2^{233})$ multiplier is shown in Fig 3.1. Tang’s architecture contains three main modules: a multiplier module to generate the partial product $A_j \times B$, a register to store the value of $C_{30-j-1}$, and a constant multiplier to calculate the product of $x^8 \times C_{30-j-1}$. The register module can be implemented by a D-flipflop array, and since Tang used an irreducible trinomial to generate $GF(2^{233})$, the constant multiplier can also be easily implemented. The most complicated module would be the partial product multiplier which computes $A_j \times B$. Fig 3.2(a) shows Tang’s design of this module. In Fig 3.2(a), we could see that Tang’s
structure of partial product multiplier includes an AND gate section to logic AND each bit of the digit $A_j$ with operand $B$, a left-shift modular section to calculate the multiplied by $x^i$ moduli operation, and finally an XOR tree section to add up all eight rows together. Tang’s proposed digit-serial architecture can be treated as a landmark work since subsequent works on digit-serial finite field multipliers are more or less optimizations or modifications of his work.

In 2006, Kumar proposed another polynomial multiplier in $GF(2^m)$ [19]. There are two major differences between Kumar’s work and Tang’s: one is that in the partial product generator unit, after logic AND each bit of the digit $A_j$ and left shift the bit-string by corresponding $i$ bits ($0 \leq i \leq d - 1$), Kumar directly added up all rows together with no reduction operations, thus, the data-flow during processing has $m + d - 1$-bit bandwidth. As a consequence, Kumar added an extra module called the final reduction unit to process modular $f(x)$ operation when the whole computation operation is over. The other difference is that Tang begins the multiplication from the most-significant-digit while Kumar begins from the least-significant-digit. In that case, Kumar saves the cost of modular reduction.

Figure 3.1: (a) Tang’s architecture of $GF(2^{233})$ multiplier [17] (b) Kumar’s architecture of $GF(2^m)$ multiplier [19]
CHAPTER 3. A REVIEW OF EXISTING WORK

Figure 3.2: Tang’s architecture of partial product multiplier, generates the product of $A_j \times B$ [17]

operation for all $d$ rows in the partial product module, but as a trade off, one extra clock cycle would be needed to complete the multiplication, another register for storing the value of $Ax^d \mod f(x)$ is required, and in addition, the bandwidth of the data-flow was enlarged by $d$ bits, see Fig 3.1(b).

In 2009, Meher [24] proposed a polynomial multiplier with a new structure of finite field accumulator unit, which is the major difference between his work and the former works reviewed. The block diagram of Meher’s work is presented in Fig 3.3. In the finite field accumulator block, he used a T-flipflop array to implement the accumulate operation instead of the structure using XOR gates and D-flipflop array. Besides, Meher also combined the constant multiplier and partial product multiplier to generate $Ax^d \mod f(x)$ and $A \times B_j \mod f(x)$ in parallel in order to further reduce the number of blocks. However, this modification has not resulted in the reduction of gate count or critical path delay. Also in
the same year, Hariri [25] published his work proving that, besides \( R(x) = x^m \), \( R(x) = x^{m-1} \) could also be an efficient Montgomery factor in bit-serial structure, and later it was proved by [28] that it can be applied to the digit-serial structure of Montgomery multiplications.

![Figure 3.3: Meher’s block diagram of proposed field multiplier over \( GF(2^m) \) [24]](image)

The most recent digit-serial Montgomery multiplication architecture was that proposed by [28] in 2011. A Linear Feedback Shift Register (LFSR) was used as the main building block to implement the Montgomery multiplication. In this work, the cases when \( R(x) = x^m \) and \( R(x) = x^{m-1} \) are discussed. As reported, the proposed multipliers could adapt to different classes of irreducible polynomials such as general cases, all one polynomials, trinomials and pentanomials, by changing the value of digit size \( d \), the reported multipliers could also work as bit-serial multipliers or bit-parallel multipliers. The high flexibility of their work is the critical contribution to the study of field multiplication. See Fig 3.4 of [28]’s work.

In this thesis, a constraint condition is proposed to select the irreducible pentanomials for the generation of finite field \( GF(2^m) \). A most-significant-digit first and a least-significant-digit first digit-serial Montgomery multiplications are also proposed. The architectures proposed in this work have some similarities with the works reported in [17] and [24]. However, these two architectures have two major differences compared with the proposed works in this thesis. First, the algorithms in [17] and [24] are about polynomial multiplication: they consider the product of \( A(x) \times B(x) \mod f(x) \) rather than the product of \( A(x) \times B(x) \times x^{-m} \mod f(x) \) (or \( A(x) \times B(x) \times x^{-(m-1)} \mod f(x) \)). Second, we proposed novel fixed elements \( R(x) \) which are different from \( x^m \) and \( x^{m-1} \). By applying the proposed constraint condition, the critical path delay of the architectures are found to
be as good as or better than the existing works for the fields generated by the irreducible pentanomials.
Chapter 4

Proposed Digit-serial Montgomery Multipliers

In this chapter, the detailed algorithm and architecture of the proposed digit-serial most-significant-digit first and least-significant-digit first Montgomery multiplier will be introduced. The finite field is generated by irreducible pentanomial polynomials. The parameter selection of the irreducible pentanomials is discussed, and a general condition to further reduce the time delay of the multiplier is proposed. The gate count and time delay of the multiplier will be considered and analyzed when \( R(x) = x^u \), where the value of \( u \) is different from \( m \) or \( m - 1 \). Further discussions are included. After this, comparisons with other types of digit-serial multipliers are provided. Finally, the FPGA implementation of the proposed digit-serial Montgomery multipliers will be given, as well as its simulation and compilation results.

4.1 Proposed Digit-Serial MSD First Montgomery Multiplier

In this section, a digit-serial MSD first Montgomery multiplier will be proposed. Two different architectures of the proposed multiplier are presented. In the first architecture, the multiplication and reduction operations are processed in separate units; in the latter architecture, the multiplication and reduction operations are combined and implemented in
one circuit block, and the performance is proved to be more efficient than architecture 1.

### 4.1.1 Algorithm

Consider the field elements $A$, $B$, $R$, and their product $C$ over $GF(2^m)$. Using the polynomial representation, we have:

$$
A(x) = \sum_{i=0}^{m-1} a_i x^i, \quad B(x) = \sum_{i=0}^{m-1} b_i x^i, \quad R(x) = x^l, \quad C(x) = \sum_{i=0}^{m-1} c_i x^i. \tag{4.1}
$$

Here we use irreducible pentanomial $f(x)$ to generate $GF(2^m)$, and $f(x)$ is represented as:

$$
f(x) = x^m + x^k_3 + x^k_2 + x^k_1 + 1 \tag{4.2}
$$

Since the idea in a digit-level multiplier is to compute a set of constant $d$ bits from $B(x)$, where $d$ usually equals to a power of two (2, 4, 8, etc.) in practice, neither one bit at each clock cycle, nor all bits in parallel at the same time, we divide $B(x)$ into blocks with equal length $d$, such that $B(x)$ has $s$ blocks, $s = \lceil m/d \rceil$. Thus the digit-level polynomial representation of $B(x)$ can be written as:

$$
B(x) = \sum_{i=0}^{s-1} B_i(x) x^{id} = \sum_{i=0}^{s-1} \sum_{j=0}^{d-1} b_{id+j} x^{id+j} \tag{4.3}
$$

Note that, due to the fact that $m$ may not always divisible by $d$, terms that generated by equation (4.3) with degree larger than $m-1$ or smaller than 0 should be set to 0. For example when $m = 233$, $d = 8$ and $s = 30$, when $i = 29$, $B_{29}(x)x^{232} = (b_{232} + b_{233}x^1 + \cdots + b_{239}x^7)x^{232} = b_{232}x^{232}$.

Using the digit-level representation of $B(x)$, we can write $C(x)$ as:

$$
C(x) = A(x) \times \sum_{i=0}^{s-1} B_i(x) x^{id} \times R^{-1}(x) \mod f(x) \tag{4.4}
$$

Defining integer $l$ as always satisfying $0 \leq l \leq d - 1$, where $d$ is the digit size, and $R(x) = x^l$,
then we could use equation (4.5) to compute $C(x)$:

$$
C(x) = (A(x)B_{s-1}(x)x^{-l}x^d + A(x)B_{s-2}(x)x^{-l}x^d + \cdots + A(x)B_1(x)x^{-l}x^d + A(x)B_0(x)x^{-l}) \mod f(x)
$$

(4.5)

Thus an algorithm of MSD-first Montgomery multiplication can be presented by Table 4.1.

**Table 4.1: Digit-serial MSD-first Montgomery Multiplier ($R(x) = x^l$), where $0 \leq l \leq d - 1$**

<table>
<thead>
<tr>
<th>Algorithm I</th>
<th>Digit-serial MSD-first Montgomery Multiplier</th>
</tr>
</thead>
<tbody>
<tr>
<td>Inputs:</td>
<td>$A(x), B_0(x), B_1(x), \ldots, B_{s-1}(x), f(x)$</td>
</tr>
<tr>
<td>Outputs:</td>
<td>$C(x) = A(x)B(x)x^{-l} \mod f(x), 0 \leq l \leq d - 1$</td>
</tr>
<tr>
<td>Step 1:</td>
<td>$C^{(0)}(x) = 0$</td>
</tr>
<tr>
<td></td>
<td>For $i = 0$ to $s - 1$</td>
</tr>
<tr>
<td>Step 2:</td>
<td>$T(x) = A(x)B_{s-1-i}(x)x^{-l} \mod f(x)$</td>
</tr>
<tr>
<td>Step 3:</td>
<td>$C^{(i+1)}(x) = C^{(i)}(x)x^d + T(x) \mod f(x)$</td>
</tr>
<tr>
<td>Step 4:</td>
<td>$C(x) = C^{(s)}(x)$</td>
</tr>
</tbody>
</table>

Step 1 is the initialization step, register $C$ is set to zero, $C(x)^{(0)} = 0$. In Step 2, the product of $A(x), B_{s-i-1}$ and $x^{-l}$ is computed, and the reduction operation is also processed in the same step. Then, in Step 3, the value generated in Step 2 is added with $C^{(i)}x^d \mod f(x)$ and the result is stored back to the register. When $i = s - 1$, $C(x)$ will be obtained, the multiplier will provide the final result. Step 2 and 3 are processed in the same cycle, also note that the calculation of Step 2 and $C^{(i)}(x)x^d$ in Step 3 can be done in parallel.

### 4.1.2 General Architecture

Fig 4.1 presents the block diagram of the proposed multiplier. The Multiplier Core unit implements Step 2, also an XOR array is included in Multiply Core to implement the addition operation in Step 3. Modular-Shift unit corresponds to $C^{(i)}x^d \mod f(x)$ in Step 3. The final result is provided in the register unit, REG C. By computing Step 2 in different orders, different architectures of the proposed multiplier can be obtained. This subsection will present a general architecture.

In Step 2, if the product of $A(x)$ and $B_{s-i-1}(x)$ is computed first, then the result times $x^{-l}$ followed by the reduction operation $\mod f(x)$. The degree range of $A(x)B_{s-i-1}(x)x^{-l}$ would be from $-l$ to $m + d - l - 2$, and the reduction operation needs to reduce the product.
from \([-l, m + d - l - 2]\) to \([0, m - 1]\). It is clearly a two-side reduction operation: both side of polynomial \(A(x)B_{s-i-1}(x)x^{-l}\) are beyond the bandwidth of \(GF(2^m)\). To further analyze the computation of Step 2, we let \(A(x)B_{s-i-1}(x)x^{-l} = T_H(x) + T_M(x) + T_L(x)\), the degree range of \(T_H(x), T_M(x), T_L(x)\) are \([m, m + d - 2 - l], [0, m - 1], [-l, -1]\), respectively. The reduction operation can be calculated as following equations:

Terms in \(T_H(x)\):

\[
\begin{align*}
T(x^{m+d-2-l}) \mod f(x) &= x^{k_3+d-2-l} + x^{k_2+d-2-l} + x^{k_1+d-2-l} + x^{d-2-l} \\
& \vdots \\
x^{m+1} \mod f(x) &= x^{k_3+1} + x^{k_2+1} + x^{k_1+1} + x \\
x^{m} \mod f(x) &= x^{k_3} + x^{k_2} + x^{k_1} + 1 \\
x^{-1} \mod f(x) &= x^{m-1} + x^{k_3-1} + x^{k_2-1} + x^{k_1-1} \\
x^{-2} \mod f(x) &= x^{m-2} + x^{k_3-2} + x^{k_2-2} + x^{k_1-2} \\
& \vdots \\
x^{-l} \mod f(x) &= x^{m-l} + x^{k_3-l} + x^{k_2-l} + x^{k_1-l}
\end{align*}
\]

Terms in \(T_L(x)\):

\[
\begin{align*}
x^{m+1} \mod f(x) &= x^{k_3+1} + x^{k_2+1} + x^{k_1+1} + x \\
x^{m} \mod f(x) &= x^{k_3} + x^{k_2} + x^{k_1} + 1 \\
x^{-1} \mod f(x) &= x^{m-1} + x^{k_3-1} + x^{k_2-1} + x^{k_1-1} \\
x^{-2} \mod f(x) &= x^{m-2} + x^{k_3-2} + x^{k_2-2} + x^{k_1-2} \\
& \vdots \\
x^{-l} \mod f(x) &= x^{m-l} + x^{k_3-l} + x^{k_2-l} + x^{k_1-l}
\end{align*}
\]
According to equation (4.6), from $T_H(x)$ reduction, four extra bit-strings are generated, the degree range of these are: $[k_3, k_3 + d - 2 - l], [k_2, k_2 + d - 2 - l], [k_1, k_1 + d - 2 - l], [0, d - 2 - l]$. Similarly, another four bit-strings are generated by $T_L(x)$ reduction operation: $[m - l, m - 1], [k_3 - l, k_3 - 1], [k_2 - l, k_2 - 1], [k_1 - l, k_1 - 1]$. Especially, bit-string $[k_3, k_3 + d - 2 - l]$ and $[k_3 - l, k_3 - 1]$ can be combined into one bit-string with range $[k_3 - l, k_3 + d - 2 - l]$, in this way, all eight bit-strings can be transformed into five bit-string with degree range equal to: $[m - l, m - 1], [k_3 - l, k_3 + d - 2 - l], [k_2 - l, k_2 + d - 2 - l], [k_1 - l, k_1 + d - 2 - l], [0, d - 2 - l]$, respectively. In order to avoid further reduction operation, the equations 4.6) must satisfy such conditions:

$$\begin{align*}
k_3 + d - 2 - l &\leq m - 1 \\
k_1 - l &\geq 0
\end{align*}$$

(4.7)
After simplifying equation (4.7):

\[
\begin{align*}
    k_3 & \leq m + 1 + l - d \\
    k_1 & \geq l
\end{align*}
\] (4.8)

From equation (4.6), we notice that we need to add up five bit-strings to $T_M(x)$, the gate usage is a constant number which is equal to $4(d - 1)$. This fact indicates that the computation of $A(x)B_{s-1-i}(x)\chi^{-l} \mod f(x)$ will not generate extra gate delay when compared with the computation of $A(x)B_{s-1-i}(x) \mod f(x)$. However, the time delay varies with different value of $k_1, k_2, k_3$. To have the minimum time delay $T_X$, the five bit-strings should have no overlapped parts, so the following conditions must be satisfied:

\[
\begin{align*}
    m - l & > k_3 + d - 2 - l \\
    k_3 - l & > k_2 + d - 2 - l \\
    k_2 - l & > k_1 + d - 2 - l \\
    k_1 - l & > d - 2 - l
\end{align*}
\] (4.9)

To sum up, $k_i \ (i = 0, 1, 2, 3)$ must satisfy:

\[
k_{i+1} - k_i \geq d - 1
\] (4.10)

Where $k_0 = 0$ and $k_4 = m$, this condition is denoted as Constraint Condition 1. Besides, from equation (4.10), $k_1 \geq d - 1 \geq l$, and $m - k_3 \geq d - 1 > d - 1 - l$, this fact implies that if equation (4.10) is applied when selecting the irreducible pentanomials of $GF(2^m)$, equation (4.8) will also be satisfied. The general architecture of the proposed multiplier is presented in Fig 4.2.

In Multiply Core unit, the implementation of $A(x)B_{s-1-i}(x)$ is simple: operand $A(x)$ is multiplied by each bit of $B_{s-1-i}(x)$, and add up the terms with same degree. This block costs totally $md$ AND gates for the multiplication operation, and $(m - 1)(d - 1)$ XOR gates for the field addition operations. The critical path delay of this unit is $\log_2dT_X + T_A$. The reduction operation costs $4(d - 1)$ XOR gates and if Constraint Condition 1 is applied, the time delay is $T_X$. Also, the XOR array needs $m$ XOR gates and time delay is $T_X$. Where $T_X$ and $T_A$ donate a two-input XOR gate and a two-input AND gate respectively. Thus, the
Multiply Core unit costs \( md \) AND gates, \((md + 3d - 3)\) XOR gates, and critical path delay is \( T_A + (2 + \log_2 d)T_X \)

REG C unit updates the value of \( C^{(i)}(x) \) every clock cycle. This unit is implemented by a D-flipflop array, with \( m \) D-flipflops connected in parallel.

Modular-Shift unit computes the modular multiplication \( C^{(i)}(x)x^d \mod f(x) \). If \( C^{(i)}(x) \) represents the most significant \( d \) bits of \( C^{(i)}(x) \), equation (4.11) can be used to present the computation of \( C^{(i)}(x)x^d \mod f(x) \).

\[
C^{(i)}(x)x^d \mod f(x) = C^{(i)}_d(x)(x^{k_3} + x^{k_2} + x^{k_1} + 1) + \sum_{i=d}^{m-1} C^{(i)}_{i-d}x^i \quad (4.11)
\]

To add up the five bit-strings together, in total \( 3d \) XOR gates will be needed, see Fig 4.3 for the implementation of equation (4.11) operation. By applying the condition obtained by equation (4.10), when \( k_{i+1} - k_i \geq d \), the four bit-strings, \( C^{(i)}_d(x)x^{k_3}, C^{(i)}_d(x)x^{k_2}, C^{(i)}_d(x)x^{k_1}, \) and \( C^{(i)}_d(x) \) will share no terms with same degree, thus the time delay of this circuit would be \( T_X \). For example, we let \( k_2 = k_3 - d \), so the degree range of \( C^{(i)}_d(x)x^{k_3} \) is \([k_3, k_3 + d - 1]\) while the degree range of \( C^{(i)}_d(x)x^{k_2} \) is \([k_2, k_2 + d - 1]\) which equals \([k_2, k_3 - 1]\). When \( k_{i+1} - k_i = d - 1 \), the degree range of \( C^{(i)}_d(x)x^{k_3}, C^{(i)}_d(x)x^{k_2}, C^{(i)}_d(x)x^{k_1}, \) and \( C^{(i)}_d(x) \) are \([k_3, m]\), \([k_2, k_3]\), \([k_1, k_2]\) and \([0, k_1]\) respectively, it can be seen that each two of them having one term with the same degree, thus the maximum depth of XORing these four bit strings is 2. In addition, since the range of \( C^{(i)}_d(x)x^{k_3} \) is \([k_3, m]\), the term with degree \( m \) needs another reduction operation and this will generate three more bits for XORing. As a consequence, the maximum depth of the XOR tree involved in this unit is 4, the time delay of this unit is maximumly \( \log_2 4T_X = 2T_X \), gate count is \( \leq 3d + 3 \).

![Figure 4.3: Implementation of equation (4.11)](image-url)
The critical path of this architecture is: Multiply Core → REG C. The complexity of each block is presented in Table 4.2, and the complexity of the proposed multiplier is presented in Table 4.3. In the tables, $T_{DFF}$ represents the delay of a D-flipflop.

### Table 4.2: Complexity of each block of the proposed MSD-first Montgomery multiplier

<table>
<thead>
<tr>
<th>Block</th>
<th>Gate Count</th>
<th>Time Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>Multiply Core</td>
<td>$md$ AND, $md + 3d - 3$ XOR</td>
<td>$T_A + (2 + \log_2 d) T_X$</td>
</tr>
<tr>
<td>REG C</td>
<td>$m$ D-flipflop</td>
<td>$T_{DFF}$</td>
</tr>
<tr>
<td>Modular-Shift</td>
<td>$\leq 3d + 3$ XOR</td>
<td>$\leq 2T_X$</td>
</tr>
</tbody>
</table>

### Table 4.3: Complexity of proposed digit-serial MSD-first Montgomery multiplication (Algorithm I, general architecture, when $k_{i+1} - k_i \geq d - 1, k_0 = 0, k_4 = m$ and $0 \leq l \leq d - 1$)

<table>
<thead>
<tr>
<th>Work</th>
<th>#AND</th>
<th>#XOR</th>
<th>#FF/Reg</th>
<th>#CLK</th>
<th>Critical path delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>MSD(Arch.1)</td>
<td>$md$</td>
<td>$\leq md + 6d$</td>
<td>$m$</td>
<td>$s$</td>
<td>$T_A + (2 + \log_2 d) T_X + T_{DFF}$</td>
</tr>
</tbody>
</table>

#### 4.1.3 Advanced Architecture

In this subsection, another architecture of the proposed multiplier one is introduced, the structure of Multiply Core unit is different from the previous one.

If $A(x)B_{s-i-1}(x)x^{-l} \mod (f(x)$) is computed in a different order, see equation 4.12. This indicates that $A(x)x^{-l} \mod (f(x), A(x)x^{-l+1} \mod (f(x), \cdots, A(x)x^{d-l-1} \mod (f(x)$ are computed first, then multiply each term with the corresponding bit of $B_{s-1-i}(x)$.

$$
A(x)B_{s-i-1}(x)x^{-l} \mod (f(x)
= A(x) \sum_{j=0}^{d-1} b_{(s-i-1)d+j} x^j x^{-l} \mod (f(x)
= \sum_{j=0}^{d-1} (A(x)x^{j-l} \mod (f(x)) b_{(s-i-1)d+j}
$$

(4.12)

The reduction operation is provided in equation 4.6. Based on that equation, $A(x)x$
mod $f(x)$ can be computed in this way:

$$A(x)x \mod f(x) = a_{m-1}(x^{k_3} + x^{k_2} + x^{k_1} + 1) + \sum_{i=1}^{m-1} a_{i-1}x^i$$ \hspace{1cm} (4.13)

Fig 4.4 is a circuit diagram which implements equation (4.13): when input $A(x)$, it will output $A(x)x \mod f(x)$. If we connect two of such models in serial, the final output would be $A(x)x^2 \mod f(x)$, see Fig 4.5. In the same way, we could obtain each value of $A(x)x^{j-l} \mod f(x)$, when $j = l + 1, l + 2, \ldots, d - 1$.

Similarly, the computation of $A(x)/x \mod f(x)$ is shown in equation (4.14)

$$A(x)/x \mod f(x) = a_0(x^m + x^{k_3} + x^{k_2} + x^{k_1}) \sum_{i=0}^{m-2} a_{i+1}x^i$$ \hspace{1cm} (4.14)

The implementation of equation (4.14) is shown in Fig 4.6. As a consequence, by combining multiples of the same circuit unit shown in Fig 4.6, each value of $A(x)x^{j-l} \mod f(x)$ can be obtained, where $j = 0, 1, 2, \ldots, l - 1$.

If we apply both Model 1 shown in Fig. 4.4 and Model 2 shown in Fig 4.6 to the implementation of equation (4.12), then the reduction operation is divided into two separate branches: one for the reduction of degrees larger than $m - 1$; another for the reduction of degrees smaller than 0. The architecture is presented in Fig 4.7. Note that blocks marked...
Figure 4.5: Implementation of computation \( A(x)x^2 \mod f(x) \)

with \( x \) represent the circuit structure of Model 1, and blocks marked with \( x^{-1} \) represent Model 2. The architecture of the proposed multiplier is given in Fig 4.8. The depth of the XOR tree is \( d + 1 \), the XOR tree adds up all \( d \) products of \( A(x)x^j b_{(s-1-i)d+j} \mod f(x) \) plus the value of REG C.

In order to have the least critical path delay of the Multiply Core unit, two groups of conditions must be satisfied at the same time:

Condition 1:

\[
\begin{align*}
    m - l &> k_3 - 1 \\
    k_3 - l &> k_2 - 1 \\
    k_2 - l &> k_1 - 1 \\
    k_1 - l &> -1
\end{align*}
\] (4.15)
CHAPTER 4. PROPOSED DIGIT-SERIAL MONTGOMERY MULTIPLIERS

Figure 4.6: Model 2: multiply by $x^{-1}$ structure

Condition 2:

$$m > k_3 + d - 2 - l$$

$$k_3 > k_2 + d - 2 - l$$

$$k_2 > k_1 + d - 2 - l$$

$$k_1 > d - 2 - l$$

(4.16)

To sum up, Condition 1 is $k_{i+1} - k_i > l - 1$, and Condition 2 is $k_{i+1} - k_i > d - l - 2$, where $i = 0, 1, 2, 3, k_0 = 0, k_4 = m$. Thus, to have the least time delay, $k_3, k_2, k_1$ must satisfy:

$$k_{i+1} - k_i \geq \max\{l, d - l - 1\}, \quad i = 0, 1, 2, 3 \text{ and } k_0 = 0, k_4 = m$$

(4.17)

This condition is denoted as Constraint Condition 2. More specifically, when $l > (d - 1)/2$, $k_{i+1} - k_i \geq l$; when $l < (d - 1)/2$, $k_{i+1} - k_i \geq d - l - 1$; when $l = (d - 1)/2$, $k_{i+1} - k_i \geq d/2 - 1/2$, since $d$ usually an even number, $k_{i+1} - k_i \geq d/2$.

The remaining two units are completely the same with the architecture shown in Fig 4.2. REG C is implemented by a D-flipflop array, and Modular Shift unit costs a maximum of $5d$ XOR gates, and $2T_X$ time delay. The complexity of this architecture is shown in Table 4.4.

Comparing the general architecture with the advanced architecture, the gate count of...
both architectures are completely the same, and they all need $s = \lceil \frac{m}{d} \rceil$ clock cycles to complete computation. However, the time delay of the latter one is shorter. Besides, the
Table 4.4: Complexity of proposed digit-serial MSD-first Montgomery multiplication (Algorithm I, advanced architecture, when $k_{i+1} - k_i \geq \max\{l, d - l - 1\}$, $i = 0, 1, 2, 3$, $k_0 = 0$, $k_4 = m$ and $0 \leq l \leq d - 1$)

<table>
<thead>
<tr>
<th>Work</th>
<th>#AND</th>
<th>#XOR</th>
<th>#FF/Reg</th>
<th>#CLK</th>
<th>Critical path delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>MSD(Arch.2)</td>
<td>$md$</td>
<td>$\leq md + 8d - 3$</td>
<td>$m$</td>
<td>$s$</td>
<td>$T_A + (1 + \log_2(d + 1))T_X + T_{DFF}$</td>
</tr>
</tbody>
</table>

latter architecture further extends the constraint condition of $k_i$ from $k_{i+1} - k_i \geq d - 1$ to $k_{i+1} - k_i \geq \max\{l, d - l - 1\}$, which indicates that more irreducible pentanomials can be applied to such Montgomery multiplication.

### 4.2 Proposed Digit-Serial LSD First Montgomery Multiplier

In this section, a digit-serial LSD first Montgomery Multiplier is proposed, and two different architectures are discussed when implementing the proposed multiplier. One of the architectures uses separate multiplication and reduction units, while the other one uses a linear-feedback-shift-register (LFSR) based structure.

#### 4.2.1 Algorithm

Suppose $A(x), B(x) \in GF(2^m)$, in polynomial representation, $B(x)$ is divided into digits of the same size:

$$B(x) = \sum_{i=0}^{s-1} x^{id} B_i(x), \text{ where } s = \lceil m/d \rceil$$

(4.18)

Let $C(x)$ be the product of $A(x), B(x)$, and a fixed element $R^{-1}(x) = x^{-u} = x^{-sl}$, where $l \geq 0$, and $l$ is an integer, the Montgomery multiplication can be computed by the way shown in equation (4.19). Based on equation (4.19), an algorithm of digit-serial LSD-first Montgomery multiplier can be proposed, see Table 4.5.
A generator generates $C_2$ computes the product of $A$, $\ldots$, $A$ generates the final result at the end of the clock cycle.

The structure of the multiplier is shown in Fig. 4.9. From top to bottom, block S1 computes $A^{(i)}(x)$ and $B_i(x)$; the result of Step 2 is forwarded to Step 3, after adding the value of register $C$, a shift-to-right modulo operation is processed; Step 4 generates $A^{i+1}(x)$ as the operand of next clock cycle; when $i = s - 1$, register $C$ will output the final result at the end of the clock cycle.

### 4.2.2 General Architecture

The structure of the multiplier is shown in Fig 4.9. From top to bottom, block S1 computes $A^{(i)}(x)$ and $B_i(x)$, note that the output bandwidth of the core is $m + d - 1$; the XOR symbol...
represents the operation $T_i(x) + C^{(i)}(x)$; block S2 computes the operation multiply by $x^{-l}$ modulo $f(x)$; and finally, REG C stores the result of each clock cycle, and obtains the final product.

![Diagram](image)

**Figure 4.9:** General architecture of the proposed digit-serial LSD first multiplier

When $0 \leq l \leq d - 1$, with the change of $l$, the complexity of block S1 and S2 will also change, while the rest of the blocks remain the same. The implementation of Multiply Core is simply logic AND each of the two operands, then add up the terms which have the same degree. The two register unit includes only D-flipflops. Table 4.6 shows the complexity of Multiply Core, REG A, REG C, and the XOR array: The same as the proposed MSD first multiplier, the reduction operation in Step 3 is also a two-side reduction. The computation

<table>
<thead>
<tr>
<th>Block</th>
<th>Gate Count</th>
<th>Time Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>Multiply Core &amp; XOR array</td>
<td>$md$ AND, $md - d + 1$ XOR</td>
<td>$T_A + \log_2(d+1)T_X$</td>
</tr>
<tr>
<td>REG A</td>
<td>$m$ D-flipflop</td>
<td>$T_{DFF}$</td>
</tr>
<tr>
<td>REG C</td>
<td>$m$ D-flipflop</td>
<td>$T_{DFF}$</td>
</tr>
</tbody>
</table>
equation is referred to equation (4.6):

\[
x^{m+d-l-2} \mod f(x) = x^{k_3+d-l-2} + x^{k_2+d-l-2} + x^{k_1+d-l-2} + x^{d-l-2}
\]

\[\vdots\]

\[
x^m \mod f(x) = x^{k_3} + x^{k_2} + x^{k_1} + 1
\]

\[
x^{-1} \mod f(x) = x^{m-1} + x^{k_3-1} + x^{k_2-1} + x^{k_1-1}
\]

\[
x^{-2} \mod f(x) = x^{m-2} + x^{k_3-2} + x^{k_2-2} + x^{k_1-2}
\]

\[\vdots\]

\[
x^{-l} \mod f(x) = x^{m-l} + x^{k_3-l} + x^{k_2-l} + x^{k_1-l}
\]

(4.20)

In order to further optimize the time delay, the condition of \(k_i, i = 1, 2, 3\), must be satisfied, see equation (4.10) Specifically, when \(l = 0\), the proposed multiplier would be a standard polynomial multiplier; when \(l = d - 1\), the XOR gate cost is the lowest, which is equal to \(md + 3d\).

When \(l = d\), the architecture of the multiplier can be further optimized: since \(d - l = 0\), REG A and S1 can be saved, S2 computes multiply by \(x^{-d} \mod f(x)\), the reduction operation is only one-side. Table 4.8 gives the complexity summary when \(l = d\).

<table>
<thead>
<tr>
<th>Work</th>
<th>#AND</th>
<th>#XOR</th>
<th>#FF/Reg</th>
<th>#CLK</th>
<th>Critical path delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>LSD(1 \leq l \leq d)</td>
<td>(md)</td>
<td>(md + 3(2d - l - 1))</td>
<td>(2m)</td>
<td>(s)</td>
<td>(T_A + (1 + \log_2 (d+1))T_X + T_{DFF})</td>
</tr>
</tbody>
</table>

When \(l \geq d + 1\), since \(l > l - d\), we could predict that if we avoid multiple reduction operations in block S2, we could also avoid the multiple reduction in block S1. Besides, since \(l \geq d + 1\), the modulo \(f(x)\) operation is only one-side reduction. \(l\) must satisfy \(l \leq k_1\) to avoid multiple reduction.
Assume we divide \((C^{(i)}(x) + T_i(x))/x^l\) into two parts:

\[
(C^{(i)}(x) + T_i(x))/x^l = T(x) + T_L(x)
\] (4.21)

Then the reduction operation would be:

\[
(C^{(i)}(x) + T_i(x))/x^l \mod f(x) = T_L(x)x^{m} + T_L(x)x^{k_3} + T_L(x)x^{k_2} + T_L(x)x^{k_1} + T(x)
\] (4.22)

The degree range of each product in equation (4.22) is presented in Table 4.9.

<table>
<thead>
<tr>
<th>Terms</th>
<th>Degree Range</th>
</tr>
</thead>
<tbody>
<tr>
<td>(T_L(x)x^m)</td>
<td>([m - l, m - 1])</td>
</tr>
<tr>
<td>(T_L(x)x^{k_3})</td>
<td>([k_3 - l, k_3 - 1])</td>
</tr>
<tr>
<td>(T_L(x)x^{k_2})</td>
<td>([k_2 - l, k_2 - 1])</td>
</tr>
<tr>
<td>(T_L(x)x^{k_1})</td>
<td>([k_1 - l, k_1 - 1])</td>
</tr>
<tr>
<td>(T(x))</td>
<td>([0, m + d - l - 2])</td>
</tr>
</tbody>
</table>

From Table 4.9, obviously, \(m + d - l - 2 < m - 1\), thus, instead of using \(l\) XOR gates to add term \(T_L(x)x^{m-l}\) to \(T(x)\), we only need \(d - 1\) XOR gates. Similarly, if \(m + d - l - 2 < k_3 - 1\), the XOR gate count of block \(S2\) can be further reduced. Here we use a table to present this result, see Table 4.10

<table>
<thead>
<tr>
<th>Conditions</th>
<th>XOR Gate Count of Block S1</th>
</tr>
</thead>
<tbody>
<tr>
<td>(d + 1 \leq l \leq \min{m + d - k_3 - 1, k_1})</td>
<td>(3l + d - 1)</td>
</tr>
<tr>
<td>(m + d - k_3 \leq l \leq \min{m + d - k_2 - 1, k_1})</td>
<td>(2l + 2(d - 1) + (m - k_3))</td>
</tr>
<tr>
<td>(m + d - k_2 \leq l \leq \min{m + d - k_1 - 1, k_1})</td>
<td>(l + 3(d - 1) + (m - k_3) + (m - k_2))</td>
</tr>
<tr>
<td>(m + d - k_1 \leq l \leq k_1)</td>
<td>(4(d - 1) + (m - k_3) + (m - k_2) + (m - k_1))</td>
</tr>
</tbody>
</table>

Considering the time delay of block \(S2\), when conditions \(k_3 < m - l + 1, k_2 < k_3 - l + 1,\) and \(k_1 < k_2 - l + 1\) are all satisfied, the time delay is \(T_X\). To sum up, \(k_i\) must satisfy:

\[
k_{i+1} - k_i \geq l
\] (4.23)

where \(i = 0, 1, 2, 3\) and \(k_0 = 0, k_4 = m\). However, comparing with equation (4.10), equation
(4.23) has narrowed the condition. The complexities of the multiplier when \( l > d \) is referred to Table 4.11.

Table 4.11: Complexity of digit-level Montgomery multiplication (Algorithm II, when \( l > d \), and \( k_{i+1} - k_i \geq l \), \( k_0 = 0 \), \( k_4 = m \))

<table>
<thead>
<tr>
<th>Work</th>
<th>#AND</th>
<th>#XOR</th>
<th>#FF/Reg</th>
<th>#CLK</th>
<th>Critical path delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>LSD((l &gt; d))</td>
<td>(md)</td>
<td>(\leq md + 6l - 3d)</td>
<td>2(m)</td>
<td>(s)</td>
<td>(T_A + (1 + \log_2(d + 1))T_X + T_{DFF})</td>
</tr>
</tbody>
</table>

4.2.3 LFSR-Based Architecture

When \( 0 \leq l \leq d - 1 \), a LFSR-based architecture can be provided.

Table 4.12: LFSR-Based Digit-serial LSD-first Montgomery Multiplier (\( R(x) = x^s \)), where \( 0 \leq l \leq d - 1 \)

<table>
<thead>
<tr>
<th>Algorithm III</th>
<th>LFSR-Based Digit-serial LSD-first Montgomery Multiplier</th>
</tr>
</thead>
<tbody>
<tr>
<td>Input: ( A(x), B_i(x), f(x), i = 0, 1, \ldots, s - 1 )</td>
<td></td>
</tr>
<tr>
<td>Outputs: ( C(x) = A(x)B(x)x^{-sl} \mod f(x) ), where ( s = \lceil m/d \rceil )</td>
<td></td>
</tr>
<tr>
<td>Step 1: ( A^{(0)}(x) = A(x), C^{(0)}(x) = 0 )</td>
<td></td>
</tr>
<tr>
<td>For ( i = 0 ) to ( s - 1 )</td>
<td></td>
</tr>
<tr>
<td>Step 2: ( T_i(x) = A^{(i)}(x)B_i(x)/x^l \mod f(x) )</td>
<td></td>
</tr>
<tr>
<td>Step 3: ( C^{(i+1)}(x) = C^{(i)}(x)/x^l \mod f(x) + T_i(x) )</td>
<td></td>
</tr>
<tr>
<td>Step 4: ( A^{(i+1)}(x) = A^{(i)}(x)x^{d-l} \mod f(x) )</td>
<td></td>
</tr>
<tr>
<td>Step 5: ( C(x) = C^{(s)}(x) )</td>
<td></td>
</tr>
</tbody>
</table>

A minor change in Step 2 and 3 of Algorithm II is applied, and Table 4.12 presents the new algorithm. In Algorithm III, Step 2 is computed as follows:

\[
T_i(x) = \sum_{j=0}^{d-l-1} A^{(i)}x^{i-j} \mod f(x) \cdot b_{id+j} \tag{4.24}
\]

In equation (4.24), \( A^{(i)}x^{l-1} \mod f(x) \) is computed first, then logic AND each bit of \( B_i(x) \). Since \( j = 0, 1, 2, \ldots, d - 1 \), when \( j = d - 1 \), the corresponding term of \( A^{(i)}x^{d-l-1} \) equals \( A^{(i)}x^{d-l-1} \mod f(x) \), also note that in Step 4 of Algorithm III, \( A^{(i)}x^{d-l} = A^{(i)}x^{d-l-1} \cdot x \), thus, by applying the circuit structure provided by Fig 4.4 and Fig 4.6, a LFSR based architecture can be obtained, see Fig 4.10.
In the architecture, register A and $d - l$ Model 1 units consist of a linear feedback shift circuit, in addition, multiplying $x$ modular operation and multiplying $x^{-1}$ modular operation are divided into two separate parts. Each unit of Model 1 and Model 2 cost 3 XOR gates, and in total, $3(d)d$ XOR gates. Thus, to have the minimum time delay of the architecture, $k_i$ must satisfy:

$$k_{i+1} - k_i \geq \max\{l, d - l - 1\} \quad (4.25)$$

where $i = 0, 1, 2, 3$, $k_0 = 0$, and $k_0 = m$. By applying the condition described in equation (4.25), the time delay of Multiply $x^{-l}$ mod $f(x)$ will be $T_X$, and costs $3l$ XOR gates. The remaining blocks, REG A and REC C, have the same structure with as the general architecture reported in subsection 4.2.2. Table 4.13 gives the complexity of such architecture. Compared with the general architecture, the proposed architecture in this subsection has
the same critical path delay. However, the LFSR-based architecture broadens the condition of irreducible pentanomial selection, the condition is extended from equation (4.10) to equation (4.25), also note that if we change parameter \( l \) to \( u \), equation (4.25) is completely the same with equation (4.17).

Table 4.13: Complexity of digit-level Montgomery multiplication (Algorithm III, when \( 0 \leq l \leq d - 1 \), and \( k_{i+1} - k_i \geq \max\{l, d - l - 1\} \), \( k_0 = 0 \), \( k_4 = m \))

<table>
<thead>
<tr>
<th>Work</th>
<th>#AND</th>
<th>#XOR</th>
<th>#FF/Reg</th>
<th>#CLK</th>
<th>Critical path delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>LSD(LSFR)</td>
<td>( md )</td>
<td>( md + 3d + 3l )</td>
<td>( 2m )</td>
<td>( s )</td>
<td>( T_A + (1 + \log_2(d + 1))T_X + T_{\text{DIFF}} )</td>
</tr>
</tbody>
</table>

4.3 Complexity Analysis

In this section, complexities of the proposed work in terms of gate count and time delay will be investigated and compared with several types of digit-level multipliers. Table 4.14 gives the practical time delay of 2-input AND gate and 2-input XOR gate at 25°C, 1.8V based on CMOSP18 technology as a reference.

Table 4.14: Intrinsic delay of XOR2 and AND2 gate, we assume each gate could drive a maximum of two gates (25°C, 1.8V, CMOSP18 Tech., \( Y = A \cdot B \), or \( Y = A \oplus B \))

<table>
<thead>
<tr>
<th>Description</th>
<th>Delay of AND2(ns)</th>
<th>Delay of XOR2(ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>( A \rightarrow Y \uparrow )</td>
<td>0.0720</td>
<td>0.1351</td>
</tr>
<tr>
<td>( A \rightarrow Y \downarrow )</td>
<td>0.0970</td>
<td>0.1294</td>
</tr>
<tr>
<td>( B \rightarrow Y \uparrow )</td>
<td>0.0763</td>
<td>0.1209</td>
</tr>
<tr>
<td>( B \rightarrow Y \downarrow )</td>
<td>0.1091</td>
<td>0.1475</td>
</tr>
</tbody>
</table>

Table 4.15 gives the comparison result of the work reported in [28] and our proposed work, both are digit-serial Montgomery multipliers. From the table, it can be seen that the critical path delay of the proposed works are better then [28]. As a trade off, the XOR gate count is greater than [28], except when the case \( l = d \), our proposed works have even better gate count than the works reported in [28].

The proposed works are Montgomery multiplier, the definition of which is different from general polynomial basis multipliers. Since both multiplications can be done using polynomial basis, and they are similar in architecture level, thus we consider they are comparable.
Table 4.15: Digit-serial Montgomery multipliers comparison \( f(x) = x^m + x^{k_3} + x^{k_2} + x^{k_1} + 1, s = m/d \)

<table>
<thead>
<tr>
<th>Type</th>
<th>Work</th>
<th>#AND</th>
<th>#XOR</th>
<th>#DFF</th>
<th>#CLK</th>
<th>Critical path delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>MSD</td>
<td>(<a href="x%5Em">28</a>)</td>
<td>(md)</td>
<td>(md + 5d)</td>
<td>(2m)</td>
<td>(s)</td>
<td>(T_A + (3 + \lceil \log_2 d \rceil)T_X + T_{DFF})</td>
</tr>
<tr>
<td></td>
<td>(<a href="x%5E%7Bm-1%7D">28</a>)</td>
<td>(md)</td>
<td>(md + 3d)</td>
<td>(2m)</td>
<td>(s)</td>
<td>(T_A + (3 + \lceil \log_2 d \rceil)T_X + T_{DFF})</td>
</tr>
<tr>
<td></td>
<td>(f(x)) satisfying (k_{i+1} - k_i \geq d - 1, i = 0, 1, 2, 3, k_0 = 0, k_2 = m)</td>
<td>(f(x)) satisfying (k_{i+1} - k_i \geq d - 1, i = 0, 1, 2, 3, k_0 = 0, k_2 = m)</td>
<td>(f(x)) satisfying (k_{i+1} - k_i \geq d - 1, i = 0, 1, 2, 3, k_0 = 0, k_2 = m)</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Proposed(Arch.1, x')</td>
<td>(md)</td>
<td>(\leq md + 6d)</td>
<td>(m)</td>
<td>(s)</td>
<td>(T_A + (2 + \lceil \log_2 (d+1) \rceil)T_X + T_{DFF})</td>
</tr>
<tr>
<td></td>
<td>Proposed(Arch.2, x')</td>
<td>(md)</td>
<td>(\leq md + 8d - 3)</td>
<td>(m)</td>
<td>(s)</td>
<td>(T_A + (1 + \lceil \log_2 (d+1) \rceil)T_X + T_{DFF})</td>
</tr>
</tbody>
</table>

Table 4.16 shows the comparison between the proposed MSD first multipliers and a group of MSD first Polynomial Basis finite field multipliers, and the field is generated by irreducible pentanomials. The table implies that the proposed MSD first multipliers have the smallest gate count. The time delay of the proposed multipliers are smaller than [17], but larger than that of [13], however, [13]’s work need one extra clock cycle to obtain the final result since a final reduction unit is applied in [13]’s proposed multiplier.

Table 4.16: Proposed multipliers compared with Polynomial Basis finite field multipliers (MSD cases, \( f(x) = x^m + x^{k_3} + x^{k_2} + x^{k_1} + 1, s = \lceil m/d \rceil \), \( T_{DFF} \) represents the time delay of a D-flipflop)

<table>
<thead>
<tr>
<th>Work</th>
<th>#AND</th>
<th>#XOR</th>
<th>#DFF</th>
<th>#CLK</th>
<th>Critical path delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>(<a href="MSD">13</a>)</td>
<td>(md)</td>
<td>(2md - d + m)</td>
<td>(m + d)</td>
<td>(s + 1)</td>
<td>(T_A + \lceil \log_2 (2d + 1) \rceil T_X + T_{DFF})</td>
</tr>
<tr>
<td>([17])</td>
<td>(md)</td>
<td>(3(d^2 + d)/2 + md)</td>
<td>(m)</td>
<td>(s)</td>
<td>(T_A + (3 + \lceil \log_2 d \rceil)T_X + T_{DFF})</td>
</tr>
<tr>
<td>Proposed(Arch.1)</td>
<td>(md)</td>
<td>(\leq md + 6d)</td>
<td>(m)</td>
<td>(s)</td>
<td>(T_A + (2 + \lceil \log_2 d \rceil)T_X + T_{DFF})</td>
</tr>
<tr>
<td>Proposed(Arch.2)</td>
<td>(md)</td>
<td>(\leq md + 8d - 3)</td>
<td>(m)</td>
<td>(s)</td>
<td>(T_A + (1 + \lceil \log_2 (d+1) \rceil)T_X + T_{DFF})</td>
</tr>
</tbody>
</table>

Table 4.17 presents the comparison results between proposed LSD first multiplier and LSD first PB multipliers. When \( l = d \) and \( l = d - 1 \), the XOR gate usage of the proposed LSD first multipliers is minimum. It can be seen the proposed works have the least usage of AND gate, register and MUX cell, but the XOR gate usage is more than the architecture.
reported in [24], that is because a T-flipflop is applied to implement the accumulator instead of a D-flipflop and a XOR gate. Also, [19]’s work uses a final reduction unit to compute the reduction operations instead of computing the reduction operation in each clock cycle, the critical path delay of [19] is shorter than the proposed multipliers, but a one more clock cycle is required before obtaining the final result. Therefore, by taking the both gate count and time delay into consideration, the proposed LSD-first multipliers are still remarkable.

Table 4.17: Proposed multipliers compared with Polynomial Basis finite field multipliers (LSD cases, \(T_M\) represents the time delay of a 2 \(\times\) 1 Multiplexer, \(T_{\text{FF}}\) represents the time delay of a T-flipflop)

<table>
<thead>
<tr>
<th>Work</th>
<th>#AND</th>
<th>#XOR</th>
<th>#DFF/FF</th>
<th>#MUX</th>
<th>#CLK</th>
<th>Critical path delay</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="LSD">13</a></td>
<td>(md)</td>
<td>(2md - d + m)</td>
<td>(2m + d - 1)</td>
<td>(m)</td>
<td>(s + 1)</td>
<td>(T_X +</td>
</tr>
<tr>
<td>[19]</td>
<td>(md + 8d - 4)</td>
<td>(md + 7d - 4)</td>
<td>(2m + d - 1)</td>
<td>(0)</td>
<td>(s + 1)</td>
<td>(T_X +</td>
</tr>
<tr>
<td>[24]</td>
<td>(md)</td>
<td>(m(d - 1) + 3(d^2 + d)/2)</td>
<td>(2m + d)</td>
<td>(0)</td>
<td>(s)</td>
<td>(T_X + (2 +</td>
</tr>
<tr>
<td>Proposed((l &lt; d))</td>
<td>(md)</td>
<td>(md + 3(2d - l - 1))</td>
<td>(2m)</td>
<td>(0)</td>
<td>(s)</td>
<td>(T_A + (1 +</td>
</tr>
<tr>
<td>Proposed((l = d))</td>
<td>(md)</td>
<td>(md + 3d)</td>
<td>(m)</td>
<td>(0)</td>
<td>(s)</td>
<td>(T_A + (1 +</td>
</tr>
<tr>
<td>Proposed((LSD))</td>
<td>(md)</td>
<td>(md + 3d + 3l)</td>
<td>(2m)</td>
<td>(0)</td>
<td>(s)</td>
<td>(T_A + (1 +</td>
</tr>
</tbody>
</table>

If we let \(m = 233\), \(d = 8\) and \(l = 4\), thus, \(s = m/d = 30\), and consider the area and latency by making the following assumptions: (1) The VLSI areas of an XOR gate is approximately two times of the area of an AND gate \(2\text{AND} = \text{XOR}\), as well as the gate delay \(2T_A = T_X\); (2) The VLSI areas and time delay of an DFF is approximately three times of an AND gate \(3\text{AND} = \text{DFF}, 3T_A = T_X\); (3) The VLSI areas and time delay of an TFF is approximately 3.5 times of an AND gate \(3.5\text{AND} = \text{TFF}, 3.5T_A = T_X\); (4) The VLSI areas and time delay of an 2X1 Multiplexer is approximately two times of an AND gate \(2\text{AND} = \text{MUX}, 2T_A = T_M\). Based on these assumptions, we could use the gate count and delay of AND gate to estimate the efficiency of the proposed and existing works. See Table 4.18 and Table 4.19.

In Table 4.18, use area and delay of AND gate to estimate the Montgomery multipliers, and assume the area and time efficiency of the proposed MSD multiplier is 100%, then calculate the efficiency of other proposed architecture and existing Montgomery multipliers. Note that when value of efficiency less than 100% implies a improvement is applied. The result shows when \(m = 233\) and \(d = 8\), the architecture of the proposed MSD and LSD-first Montgomery multipliers could reduce the time delay compared with [28] and [12], also with the reduced area cost.
CHAPTER 4. PROPOSED DIGIT-SERIAL MONTGOMERY MULTIPLIERS

Table 4.18: Efficiency of the proposed multipliers and existing Montgomery multipliers \((m = 233, d = 8, \text{if } l < d, \text{then } l = 4)\)

<table>
<thead>
<tr>
<th>Work</th>
<th>Area</th>
<th>Area Efficiency</th>
<th>Time Delay</th>
<th>Time Efficiency</th>
</tr>
</thead>
<tbody>
<tr>
<td>[12]</td>
<td>423804</td>
<td>6635.42%</td>
<td>30</td>
<td>214.29%</td>
</tr>
<tr>
<td><a href="x%5E%7Bm%7D">28</a></td>
<td>7038</td>
<td>110.19%</td>
<td>16</td>
<td>114.29%</td>
</tr>
<tr>
<td><a href="x%5E%7Bm-1%7D">28</a></td>
<td>7038</td>
<td>110.19%</td>
<td>16</td>
<td>114.29%</td>
</tr>
<tr>
<td>Proposed(MSD, Arch.1, x^l)</td>
<td>6387</td>
<td>100%</td>
<td>14</td>
<td>100%</td>
</tr>
<tr>
<td>Proposed(MSD, Arch.1, x^l)</td>
<td>6413</td>
<td>100.41%</td>
<td>14</td>
<td>100%</td>
</tr>
<tr>
<td>Proposed(LSD, l &lt; d, x^{d})</td>
<td>7056</td>
<td>110.47%</td>
<td>14</td>
<td>100%</td>
</tr>
<tr>
<td>Proposed(LSD, l = d, x^{d})</td>
<td>6339</td>
<td>99.25%</td>
<td>14</td>
<td>100%</td>
</tr>
<tr>
<td>Proposed(LSD, LSFR, x^{d})</td>
<td>7062</td>
<td>110.57%</td>
<td>14</td>
<td>100%</td>
</tr>
</tbody>
</table>

Table 4.19: Efficiency of the proposed multipliers and existing PB multipliers \((m = 233, d = 8, \text{if } l < d, \text{then } l = 4)\)

<table>
<thead>
<tr>
<th>Work</th>
<th>Area</th>
<th>Area Efficiency</th>
<th>Time Delay</th>
<th>Time Efficiency</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="MSD">13</a>*</td>
<td>10493</td>
<td>164.29%</td>
<td>14</td>
<td>100%</td>
</tr>
<tr>
<td><a href="LSD">13</a>*</td>
<td>11655</td>
<td>182.48%</td>
<td>14</td>
<td>100%</td>
</tr>
<tr>
<td>[17]</td>
<td>6483</td>
<td>101.50%</td>
<td>16</td>
<td>114.29%</td>
</tr>
<tr>
<td>[19]*</td>
<td>7175</td>
<td>112.34%</td>
<td>12</td>
<td>85.71%</td>
</tr>
<tr>
<td>[24]</td>
<td>6470</td>
<td>101.30%</td>
<td>14.5</td>
<td>103.57%</td>
</tr>
<tr>
<td>Proposed(MSD, Arch.1, x^l)</td>
<td>6387</td>
<td>100%</td>
<td>14</td>
<td>100%</td>
</tr>
<tr>
<td>Proposed(MSD, Arch.1, x^l)</td>
<td>6413</td>
<td>100.41%</td>
<td>14</td>
<td>100%</td>
</tr>
<tr>
<td>Proposed(LSD, l &lt; d, x^{d})</td>
<td>7056</td>
<td>110.47%</td>
<td>14</td>
<td>100%</td>
</tr>
<tr>
<td>Proposed(LSD, l = d, x^{d})</td>
<td>6339</td>
<td>99.25%</td>
<td>14</td>
<td>100%</td>
</tr>
<tr>
<td>Proposed(LSD, LSFR, x^{d})</td>
<td>7062</td>
<td>110.57%</td>
<td>14</td>
<td>100%</td>
</tr>
</tbody>
</table>

In Table 4.19, works mark with “*” need one extra clock cycle to obtain the final result. We assume the proposed MSD-first architecture has 100% efficiency and compare it with other proposed Montgomery multipliers and existing PB multipliers. In general, the proposed architectures further reduce the time delay, the area cost is within comparable size.

According to these comparisons, by applying the proposed two classes of fields, the proposed MSD-first digit-serial Montgomery multiplier and LSD-first digit-serial Montgomery multiplier have less time delay than the existing digit-level Montgomery multipliers, and less than most of the existing Polynomial Basis multipliers. The gate count of the proposed multipliers is also comparable with the most existing works.
is remarkable in terms of the further reduction of the critical path delay.

4.4 FPGA Implementation of the Proposed Multipliers

In this section, the proposed MSD-first and LSD-first Montgomery multipliers are implemented using FPGA. The advanced architecture of the MSD-first multiplier is selected to be implemented while the general architecture of the LSD-first multiplier is selected. The finite field size is set to be $m = 233$, and polynomial $f(x) = x^{233} + x^{185} + x^{121} + x^{105} + 1$ is chosen to generate $GF(2^{233})$. Digit size $d = 8$; integer $u$ of the MSD-first multiplier and $l$ of the LSD-first multiplier equal 4 respectively. FPGA development tool: Quartus II v9.1 and ModelSim v6.5b. FPGA model: Stratix II, EP2S60F1020C3.

4.4.1 Summary of the MSD-First Multiplier Implementation

<table>
<thead>
<tr>
<th>Cells</th>
<th>Usage</th>
</tr>
</thead>
<tbody>
<tr>
<td>Total registers</td>
<td>233</td>
</tr>
<tr>
<td>Total pins</td>
<td>476</td>
</tr>
<tr>
<td>≤ 3-input combinational ALUT</td>
<td>0</td>
</tr>
<tr>
<td>4-input combinational ALUT</td>
<td>227</td>
</tr>
<tr>
<td>5-input combinational ALUT</td>
<td>436</td>
</tr>
<tr>
<td>6-input combinational ALUT</td>
<td>275</td>
</tr>
<tr>
<td>Total combinational functions</td>
<td>938</td>
</tr>
</tbody>
</table>

Table 4.21: Gate count of each module ($m = 233$, $d = 8$, $u = 4$)

<table>
<thead>
<tr>
<th>Module</th>
<th>#Logic combinational functions</th>
<th>#Register</th>
</tr>
</thead>
<tbody>
<tr>
<td>REG C</td>
<td>0</td>
<td>233</td>
</tr>
<tr>
<td>Multiply Core</td>
<td>959</td>
<td>0</td>
</tr>
<tr>
<td>Unit of Multiplied by $x^d$</td>
<td>24</td>
<td>0</td>
</tr>
<tr>
<td>Top-level</td>
<td>983</td>
<td>233</td>
</tr>
</tbody>
</table>

Table 4.20 provides the usage of logic cells, including gates, pins as well as registers, after compiling. The term logic unit in the table represents logic gates and other types of
Table 4.22: Time complexity of the design ($m = 233$, $d = 8$, $u = 4$)

<table>
<thead>
<tr>
<th></th>
<th>Clock setup</th>
<th>Restricted to 500.00MHz</th>
</tr>
</thead>
<tbody>
<tr>
<td>Clock setup</td>
<td>Restricted to 500.00MHz</td>
<td></td>
</tr>
<tr>
<td>Clock period</td>
<td>2.000ns</td>
<td></td>
</tr>
<tr>
<td>Number of clock cycles for one multiplication</td>
<td>30</td>
<td></td>
</tr>
<tr>
<td>Total time cost for one multiplication</td>
<td>60.0ns</td>
<td></td>
</tr>
</tbody>
</table>

logic cells which are involved in a FPGA device. Table 4.21 shows the gate count of each module, since the compiler may optimize the structure when compiling, thus some modules may contain fewer logic elements than the designed architecture. Table 4.22 is a summary of time complexity of the design. Clock setup is the maximum operation speed the design can reach. In this implementation, the maximum clock frequency of the selected FPGA is 500MHz. Also the number of cycles and processing time for one multiplication are also included.

### 4.4.2 Summary of the LSD-First Multiplier Implementation

Table 4.23: Cells usage of compilation ($m = 233$, $d = 8$, $l = 4$)

<table>
<thead>
<tr>
<th>Cells</th>
<th>Usage</th>
</tr>
</thead>
<tbody>
<tr>
<td>Total registers</td>
<td>466</td>
</tr>
<tr>
<td>Total pins</td>
<td>476</td>
</tr>
<tr>
<td>$\leq$ 3-input combinational ALUT</td>
<td>12</td>
</tr>
<tr>
<td>4-input combinational ALUT</td>
<td>224</td>
</tr>
<tr>
<td>5-input combinational ALUT</td>
<td>451</td>
</tr>
<tr>
<td>6-input combinational ALUT</td>
<td>257</td>
</tr>
<tr>
<td>Total combinational functions</td>
<td>944</td>
</tr>
</tbody>
</table>

Table 4.24: Gate count of each module ($m = 233$, $d = 8$, $l = 4$)

<table>
<thead>
<tr>
<th>Module</th>
<th>#Logic combinational functions</th>
<th>#Register</th>
</tr>
</thead>
<tbody>
<tr>
<td>REG A</td>
<td>0</td>
<td>233</td>
</tr>
<tr>
<td>REG C</td>
<td>0</td>
<td>233</td>
</tr>
<tr>
<td>Multiply Core</td>
<td>911</td>
<td>0</td>
</tr>
<tr>
<td>S1</td>
<td>12</td>
<td>0</td>
</tr>
<tr>
<td>S2</td>
<td>21</td>
<td>0</td>
</tr>
<tr>
<td>Top-level</td>
<td>944</td>
<td>233</td>
</tr>
</tbody>
</table>
Table 4.25: Time complexity of the design ($m = 233, d = 8, l = 4$)

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Clock setup</td>
<td>326.16MHz</td>
</tr>
<tr>
<td>Clock period</td>
<td>3.066ns</td>
</tr>
<tr>
<td>Number of clock cycles for one multiplication</td>
<td>30</td>
</tr>
<tr>
<td>Total time cost for one multiplication</td>
<td>91.98ns</td>
</tr>
</tbody>
</table>

Note that the fixed element $R(x)$ of the proposed digit-serial LSD-first Montgomery multiplier is $x^{sl}$. The compilation results shows that the proposed LSD-first multiplier doubles the usage of the register compared with the proposed MSD-first multiplier, also the clock frequency is lower, however, the usage of logic element is less than the proposed MSD-first one.
Chapter 5

FPGA Implementation of Inverse Generator

In this chapter, we will introduce the FPGA implementation of a normal basis inverse generator. We first give the architecture of the inverse generator. Then, schemes for each module of the generator are provided and explained respectively, as well as an algorithm of the normal basis multiplication over $GF(2^m)$. We also obtain the simulation and compilation result of our designed inverse generator, the gate usage and clock setup are included in our implementation result. FPGA development tool: Quartus II v9.1 and ModelSim v6.5b. FPGA model: Stratix II, EP2S60F1020C3.

5.1 The Design of Inverse Generator

Fig 5.1 presents the architecture of the designed normal basis inverse generator [30]. And Fig 5.2 shows the block diagram for FPGA implementation. Comparing the previous two figures, the REG2 and $2^x$-power blocks are replaced by a shift register block, since the $2^x$ exponentiation operation in normal basis is simply shift operation, see equation (5.1) as an example. Besides, the normal basis multiplier module is implemented using a digit-level structure in order to reduce the gate count, and the number of clock cycle will be increased as a trade off. Also we add some control signals in order to control the operation of the design: input ”clk” signal to provide system clock; ”clk1” signal to enable or disable REG1 and REG2 module; input ”rst” signal to restart or reset the inverse generator; output ”rdy”
signal to indicate the final result is generated; and finally, the output “ctrl” takes the place of select signal of MUX. In the following subsections, the design of each module will be introduced.

\[(\theta^{2^i})^{2^x} = \theta^{2^{i+x}}\]  
(5.1)

### 5.1.1 REG1 Module

See Fig 5.3 for the REG1 module. The module contains two 163-bit registers, say R0 and R1, "reg_out0” is the output of R0 and ”reg_out1” is the output of R1. For each positive edge of "clk1” signal, when "rst” is logic one, REG1 will load the data from port "reg_init” into R0; when "rst” is logic zero, REG1 will load the data from "reg_in” into R0, then at the same clock cycle, R0 passes its data to R1. "rdy” acts as an enable signal, when "rdy” equals logic one, REG1 remains no change. "ctrl” signal controls the data selection of
MUX module: when "rst" is one, "ctrl" signal is set low-level voltage, which is logic zero, and when "rst" is zero, "ctrl" jumps to the opposite voltage level for each positive clock edge, for example, when "rst" is zero, the output value of "ctrl" will be: 1, 0, 1, 0, . . .

5.1.2 REG2 Module

See Fig 5.4 for the REG2 module. A counter is included in this module. "clk1" inputs the clock signal. When "rst" is one, the register will load the data from "reg_init" and circular left shift for one bit, then forward the result to output port "reg_out", also "rdy", "inverse_out" and the counter are all set to zero. When "rst" equals to zero, the register will load the data from "reg_in", circular left shift the data, then forward the data to the output port "reg_out", at the same time, the counter is increased by one. For the different value of the counter increase from 0 to 8, the register shifts the input data by 1, 3, 3, 9, 9, 27, 27, 81, 1-bit, respectively. When counter is equal to 9, "rdy" signal is set to be logic one, the "inverse_out" output the final result $\alpha^{-1}$ of the normal basis inverse generator.
5.1.3 MUX Module

Fig 5.5 presents the multiplexer module. "d_0" and "d_1" are two data inputs. When "ctrl" equals to zero, "q" select and output the data of "d_0"; and when "ctrl" is one, "q" is equal to the value of "d_1".

5.1.4 Digit-level Normal Basis Multiplier Module and Multiplication Algorithm

Fig 5.6 is block diagram of the digit-level Normal Basis multiplier module, "a_in" and "b_in" are two operands of the multiplication operation, and "out" port outputs the product of finite field $GF(2^m)$ multiplication; "rdy" could enable/disable the module, and "rst" is a
reset signal, when "rst" is equal to logic one, the module is reset to its initial state; "clk" is the clock signal and output signal "clk1" is the drive signal of REG1 and REG2.

This module contains three sub-modules: Input Reg module, NB Multiplier module, and Output Reg module, see Fig 5.7 for details. For each clock cycle, if "rst" and "rdy" signals both are logic zero, the input_reg module will do two alternative jobs: first, if the inner counter equals to zero, the module sets "clk1" to logic one, and reads the data from "a_in" and "b_in" and stores them in the register after circular left shift both two bit strings by 5 bits (since 163 is not dividable by 8, the sixth bit of the least significant digit must be the LSB of a_in×b_in, in that case, after right shift the bit string of the product a_in×b_in for 21 times and 8 bits each time, we could finally get the right answer); second, for other cases, the module will right shift the data by 8 bits, and "clk1" is set to be logic zero.

The NB Multiplier module contains only combinational circuits, no drive signals, nor control signals. The two inputs are both 163-bit and output is 8-bit. Which indicates the multiplier needs 21 clock cycles to calculate all 163-bit consequences. The Output Reg module stores the results of the NB multiplier module for each positive clock edge and right shift by 8 bits.
Following by the structure of the multiplier module, an algorithm of normal basis multiplication is provided. Since the design is a normal basis multiplier over $GF(2^{163})$, according to [14], there should exist a type 4 ($T = 4$) Gaussian normal basis for $GF(2^{163})$. Here we first check the existence of this Gaussian normal basis for $GF(2^{163})$ of given type $T = 4$. The algorithm is given below:

**Input:** an integer $m > 1$ not divisible by 8; a positive integer $T$.

**Output:** if a type $T$ Gaussian normal basis for $GF(2^m)$ exits, the message "True"; otherwise "False".

1. Set $p \leftarrow Tm + 1$.
2. If $p$ is not prime then output "False" and stop.
3. Compute the order $k$ of 2 module $p$.
4. Set $h \leftarrow Tm/k$.
5. Compute $d := \text{GCD}(h,m)$.
6. If $d = 1$ then output "True"; else output "False".
In this case, \( m = 163, T = 4 \), then get \( p = Tm + 1 = 653 \) is a prime number. After computing, we know that 2 has the order of 652 module 653, thus \( h = 1 \), and \( d = \text{GCD}(1, 163) = 1 \), so there does exist a Gaussian normal basis of type 4 for \( GF(2^{163}) \). Therefore, we could generate the first coordinate of the product of two elements which belong to the \( GF(2^{163}) \) type 4 Gaussian normal basis. The algorithm is given below [14]:

**Input:** integers \( m > 1 \) and \( T \) for which there exit a type \( T \) Gaussian normal basis \( G \) for \( GF(2^m) \), \( A, B \in GF(2^m) \), \( A = (a_{m-1}a_{m-2}...a_1a_0) \), \( B = (b_{m-1}b_{m-2}...b_1b_0) \).

**Output:** an explicit formula for the first coordinate of the product of two elements with respect to \( G \).

1. Set \( p \leftarrow Tm + 1 \).
2. Generate an integer \( u \) having order \( T \) modulo \( p \).
3. Compute the sequence \( F(1), F(2), \ldots, F(p - 1) \) as follows:
   3.1. Set \( w \leftarrow 1 \).
   3.2. For \( j \) from 0 to \( T - 1 \) do
      - Set \( n \leftarrow w \)
      - For \( i \) from 0 to \( m - 1 \) do
         - Set \( F(n) \leftarrow i \)
         - Set \( n \leftarrow 2n \mod p \)
      - Set \( w \leftarrow uw \mod p \)
4. If \( T \) is even, then set \( J \leftarrow 0 \), else set \( J \leftarrow \sum_{k=1}^{m/2} (a_{k-1}b_{m/2+k-1} + a_{m/2+k-1}b_{k-1}) \)
5. Output the formula
   \[
   c_0 = J + \sum_{k=1}^{p-2} a_{F(k+1)}b_{F(p-k)}
   \]

For \( T = 4 \) normal basis of \( GF(2^{163}) \), we could calculate the value of \( p \) is equal to 653, and \( u = 149 \) have the order 4 modulo 653. Since \( T = 4 \) is even, we have \( J = 0 \). Then we
use C-program to generate all the value of $F(s)$, where $s = 1, 2, \ldots, p - 1$. Furthermore, C-program is used to generate the VerilogHDL code of first coordinate $c_0$, which can be applied to FPGA implementation, see Appendix A and Appendix B. The other coordinates of the product are obtained from the formula $c_0$ by cycling the subscripts modulo $m$.

5.1.5 Top-Level

See Fig 5.2 for reference. The top-level module has three input ports: ”clk” is the designed system clock signal, ”rst” is the reset signal, ”alpha” is the value of normal basis element which is going to be inversed; two output ports: ”rdy” is a indicator signal to imply the final result is ready, ”inverse_alpha” is the value of the consequence. The top-level module is also known as the design entity of the normal basis inverse generator.

5.2 Simulation and Compilation

5.2.1 Simulation Results

Table. 5.1 presents an operation description of each clock, also the change of all the signals and registers involved in the generator is included.

<table>
<thead>
<tr>
<th>clock cycle#</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Initialization</strong></td>
<td>reg1 and reg2 read the data from port ”alpha”; reg1 stores the data into register R0, and register R1 is set to zero; MUX forwards the selected data into the multiplier as operand A; reg2 module circular left shifts the input data and forwards it to the multiplier as operand B. all counters are set to zero (counter of reg2 and counter of nb_multiplier) $rst &lt;= 1$, $rdy &lt;= 0$, $ctrl &lt;= 0$, $clk1 &lt;= 1$ $reg2/count &lt;= 0$, $nb_multiplier/count &lt;= 0$ $alpha &lt;= 163'h00000000000000000000000000000000000000001$</td>
</tr>
</tbody>
</table>

Continued on next page
### Table 5.1 – continued from previous page

<table>
<thead>
<tr>
<th>clock cycle#</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td># 1</td>
<td>”rst” signal is set to zero;</td>
</tr>
<tr>
<td></td>
<td>input_reg1 circular right shifts the two operands by 5 bits, then forwards them to then to the nb_multiplier;</td>
</tr>
<tr>
<td></td>
<td>nb_multiplier stop loading data from reg1 and reg2;</td>
</tr>
<tr>
<td></td>
<td>nb_multiplier compute the product and forwards it to output_reg;</td>
</tr>
<tr>
<td></td>
<td>output_reg stores the result at most significant 8 bits and left shift 8 bits;</td>
</tr>
<tr>
<td></td>
<td>counter of nb_multiplier increased by 1;</td>
</tr>
<tr>
<td></td>
<td>all the another registers remain no change.</td>
</tr>
<tr>
<td></td>
<td>$rst &lt;= 0, rdy &lt;= 0, ctrl &lt;= 0, clk1 &lt;= 0$</td>
</tr>
<tr>
<td></td>
<td>$reg2/count &lt;= 0, nb_multiplier/count &lt;= 1$</td>
</tr>
<tr>
<td></td>
<td>$a_out &lt;= 163'h000000000000000000000000000000000000000020$</td>
</tr>
<tr>
<td></td>
<td>$b_out &lt;= 163'h000000000000000000000000000000000000000040$</td>
</tr>
<tr>
<td></td>
<td>$output_reg/reg_in &lt;= 8'h20$</td>
</tr>
<tr>
<td></td>
<td>$inverse_alpha &lt;= 163'h000000...00000$</td>
</tr>
<tr>
<td># 2</td>
<td>input_reg1 circular left shifts the two operands by 8 bits, then forwards them to then to the nb_multiplier;</td>
</tr>
<tr>
<td></td>
<td>output_reg stores the result at most significant 8 bits and left shift 8 bits;</td>
</tr>
<tr>
<td></td>
<td>counter of nb_multiplier increased by 1;</td>
</tr>
<tr>
<td></td>
<td>all the another registers remain no change.</td>
</tr>
<tr>
<td></td>
<td>$reg2/count &lt;= 0, nb_multiplier/count &lt;= 2$</td>
</tr>
<tr>
<td></td>
<td>$a_out &lt;= 163'h10000000000000000000000000000000000000000$</td>
</tr>
<tr>
<td></td>
<td>$b_out &lt;= 163'h20000000000000000000000000000000000000000$</td>
</tr>
<tr>
<td></td>
<td>$output_reg/reg_in &lt;= 8'h00$</td>
</tr>
<tr>
<td># 3 - # 21</td>
<td>same operation as clock cycle # 2.</td>
</tr>
<tr>
<td># 22</td>
<td>”clk1” signal is set to high voltage;</td>
</tr>
<tr>
<td></td>
<td>nb_multiplier counter is set to zero;</td>
</tr>
<tr>
<td></td>
<td>Continued on next page</td>
</tr>
</tbody>
</table>
### Table 5.1 – continued from previous page

<table>
<thead>
<tr>
<th>clock cycle#</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>reg2 counter increased by 1;</td>
</tr>
<tr>
<td></td>
<td>reg1 and reg2 load the data from output_reg;</td>
</tr>
<tr>
<td></td>
<td>in reg1, R0 pass its previous value to R1;</td>
</tr>
<tr>
<td></td>
<td>”ctrl” signal jump to its opposite value;</td>
</tr>
<tr>
<td></td>
<td>input_reg load two new operands from MUX and reg2.</td>
</tr>
<tr>
<td># 23 - # 197</td>
<td>rst &lt;= 0, rdy &lt;= 0, ctrl &lt;= 1, clk1 &lt;= 1</td>
</tr>
<tr>
<td># 198</td>
<td>reg2/count &lt;= 1, nb_multiplier/count &lt;= 0</td>
</tr>
<tr>
<td></td>
<td>reg_out0 &lt;= 163’h00000001000200000000000000000000000002001</td>
</tr>
<tr>
<td></td>
<td>reg_out1 &lt;= 163’h000000000000000000000000000000000000000001</td>
</tr>
<tr>
<td></td>
<td>a_in &lt;= 163’h000000000000000000000000000000000000000001</td>
</tr>
<tr>
<td></td>
<td>b_in &lt;= 163’h000000020004000000000000000000000000000001</td>
</tr>
<tr>
<td></td>
<td>inverse_alpha &lt;= 163’h00…000000</td>
</tr>
</tbody>
</table>

The system will repeat the operation from clock cycle #1 to #22.

### 5.2.2 Compilation Results

See Fig 5.8 for the simulation results. Vector signal ”alpha” is the input 163-bit data, and ”inverse_alpha” is the output data of the generator. In this simulation, we use alpha <= 163’h000000…00000001.

**5.2.2 Compilation Results**

Fig 5.9 is the RTL of the design, and Fig 5.10 technology map view of the design. Note that we have combine the REG2 and the $2^x$-power module into one shift register. From
Fig 5.9, we can see that the input normal basis element "alpha" is loaded into REG1 and REG2, respectively. REG1 send the data into the multiplexer and at the same moment REG2 does a cyclic shifting operation. Then the NB-multiplexer get the two operands from both multiplexer and REG2 and calculate the product digit-by-digit.

Table 5.2: Cells usage of compilation

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Total logic elements</td>
<td>3944</td>
</tr>
<tr>
<td>Total registers</td>
<td>1154</td>
</tr>
<tr>
<td>Total pins</td>
<td>329</td>
</tr>
<tr>
<td>≤2-input logic unit</td>
<td>184</td>
</tr>
<tr>
<td>3-input logic unit</td>
<td>507</td>
</tr>
<tr>
<td>4-input logic unit</td>
<td>2609</td>
</tr>
</tbody>
</table>

Table 5.3: Area cost of each module

<table>
<thead>
<tr>
<th>module</th>
<th>logic combinational functions</th>
<th>register</th>
</tr>
</thead>
<tbody>
<tr>
<td>reg1</td>
<td>164</td>
<td>327</td>
</tr>
<tr>
<td>reg2</td>
<td>663</td>
<td>331</td>
</tr>
<tr>
<td>input_reg</td>
<td>342</td>
<td>333</td>
</tr>
<tr>
<td>multiplier</td>
<td>2131</td>
<td>0</td>
</tr>
<tr>
<td>output_reg</td>
<td>0</td>
<td>163</td>
</tr>
<tr>
<td>top-level</td>
<td>3300</td>
<td>1154</td>
</tr>
</tbody>
</table>

Table 5.4: Operation delay of the design Inverse Generator over $GF(2^{163})$

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Clock setup</td>
<td>130.28 MHz ($\text{period = 7.676 ns}$)</td>
</tr>
<tr>
<td>Clock period</td>
<td>7.676 ns</td>
</tr>
<tr>
<td>Number of cycles for one inversion</td>
<td>198</td>
</tr>
<tr>
<td>Total time for one inversion</td>
<td>1519.848 ns</td>
</tr>
</tbody>
</table>

Table 5.2, Table 5.3, and Table 5.4 present the cell usage, gate count of the each module and time delay of the design normal basis inverse generator, respectively. Note that the data of Clock Setup is the maximum clock frequency the system could reach.
Figure 5.8: Simulation result of the Inverse Generator
Figure 5.9: RTL of the design
Figure 5.10: Technology map viewer of the design
Cipher algorithms, especially for public key systems, are demanded for short key size as well as fast processing speed with a high secure level due to the widely application on small portable electronic devices, such as mobile phones, pads, and embedded systems, also the increasing secure threat to the personal privacy plays a not negligible role. In this case, Elliptic Curve Cryptosystems is studied extensively, since it seems the only suitable public key cryptosystem by now. The study shows that the processing speed of elliptic curve based cryptosystem is one of the bottleneck to implement fast ECC encryption/decryption, in addition, field multiplication and field inversion are the two basic operations involved in ECC. As the result of this situation, to speedup finite field computations could efficiently speed up ECC algorithms.

In this thesis, a brief introduction of cryptography is provided in the first chapter. Then, the mathematical backgrounds of finite field, Montgomery multiplications, field inversion and the concept of elliptic curve encryption/decryption are included in Chapter 2. After these, in Chapter 3 we have a brief review of the existing field multipliers, including bit-serial, bit-parallel, digit-level and systolic style architectures. In Chapter 4 we reported a digit-serial MSD-first and a LSD-first Montgomery multiplications, as well as their architectures and FPGA implementations. In Chapter 5, we reported a FPGA implementation of finite field inverse generator using normal basis.

For the proposed Montgomery multiplication, we have provided the architectures for different value of the Montgomery factor: \( R(x) = x^u \) and \( R(x) = x^{sl} \). The main contribution to the Montgomery multiplication is that we proposed two classes of finite fields \( GF(2^m) \)
for the multipliers with much reduced critical path delay. By applying the special fields, the
time delay of reduction operation can be reduced to one $T_X$. The FPGA implementations
of the proposed architectures are presented for the field $GF(2^{233})$ with digit size $d = 8$ to
further verify the correctness of it.

In Chapter 5 of this thesis, we provide an FPGA implementation of a novel finite field
$GF(2^{163})$ inversion algorithm using normal basis. This architecture involves two registers,
one multiplexer and one normal basis multiplier core and we used a digit-serial architecture
to implement this multiplier core.

In the future study, how to apply these fast finite field operation architectures to the
higher level computation of ECC in point scalar multiplication is still a critical problem for
fast ECC algorithms processing. And how to take the advantages of Montgomery reduction
or Montgomery multiplication method on efficient implementation of point addition and
point doubling operations will be the next goal of our work.
Appendix A

C-code of $F(s)$ and the First Coordinate $c_0$ Generation

```c
#include <stdio.h>
#define N 1000
#define M 300
FILE *fp;
FILE *fpt;
FILE *fptfp;
int count=0;

main()
{
    void write_result(int a, int b);
    void write_resultl1(int array_FN[N], int p_l);
    int finite_field_exp(int p_0, int Type);

    int w=1;
    int i, j, o, n, Fn, T, p, m, u, k;
    int array_Fn[N];
    int c_0[M][M]={0};

    fptfp=fopen("c_0.txt", "w");
```
fpt=fopen("F(n) value .txt", "w");
fp=fopen("F(n) .txt", "w");
printf("input the field size m=");
scanf("%d", &m);
printf("input the ONB type T=");
scanf("%d", &T);

p=T*m+1;
u=finitie_field_exp(p, T);

printf("%d", u);
for (j=0; j<T; j++)
{
    n=w;
    for (i=0; i<m; i++)
    {
        Fn=i;
        write_result(n, Fn);
        array_Fn[n]=Fn;
        n=2*n%p;
    }
    w=u*w%p;
}

write_result1(array_Fn, p);

/* J generator */
if (T%2!=0) {
    for (k=1; k<=m/2; k++) {
        if (c_0[k-1][m/2+k-1]==0)
            c_0[k-1][m/2+k-1]=1;
        else
            c_0[k-1][m/2+k-1]=1;
        if (c_0[m/2+k-1][k-1]==0)
c_0[m/2+k−1][k−1]+=1;
else
c_0[m/2+k−1][k−1]=−1;
}
}

/* generate c_0 */
for (k=1;k<=(p−2);k++){
  if (c_0[array_Fn[k+1]][array_Fn[p−k]]==0)
    c_0[array_Fn[k+1]][array_Fn[p−k]]+=1;
  else
    c_0[array_Fn[k+1]][array_Fn[p−k]]−=1;
  array_Fn[k+1], array_Fn[p−k]);
}
for (o=0;o<m;o++){
  fprintf(fptfp,"assign c[%d]=" , o);
  for (i=0;i<m;i++){
    k=0;
    fprintf(fptfp,"(a[%d] &^ (", (i+o)%m);
    for (j=0;j<m;j++){
      if (c_0[i][j]!=0){
        if (k==0)
          fprintf(fptfp,"b[%d]", (j+o)%m);
        else
          fprintf(fptfp,"^ b[%d]", (j+o)%m);
        k++;
      }
    }
    fprintf(fptfp,")");
  } if (i!=m−1)
    fprintf(fptfp,"^ ");
  else
    fprintf(fptfp,";");
APPENDIX A. C-CODE OF $F(S)$ AND THE FIRST COORDINATE $C_0$ GENERATION

```c
void write_result(int a, int b){
    fprintf(fpt, "F(%3d)=%3d
", a, b);
    if(count%10==9)
        fprintf(fpt, "\n");
    count++;
}

void write_result1(int array_FN[N], int p_1){
    int i;
    int temp=0;
    for(i=1;i<p_1;i++){
        fprintf(fp, "F(%3d)=%3d
", i, array_FN[i]);
        if(temp%10==9)
            fprintf(fp, "\n");
        temp++;
    }
}

int finite_field_exp(int p_0, int Type){
    int i, n;
    int g=2;
    int k=1;

    while(k%Type!=0 || g<p_0){
```
APPENDIX A. C-CODE OF $F(S)$ AND THE FIRST COORDINATE $C_0$ GENERATION

```
k = 1;
n = g;
while (n > 1) {
    n = n * g % p_0;
    k++;
}
g++;
}
g--;
n = g;
for (i = 1; i < (k / Type); i++) {
    n = n * g % p_0;
}
return (n);
```
Appendix B

Generated VerilogHDL-code of the First Coordinate $c_0$

```verilog
generate
  assign $c_0$ = ($a[0]$ & ($b[1]$))
    ^ ($a[4]$ & ($b[40]$ ^ $b[87]$ ^ $b[99]$ ^ $b[137]$))
```
APPENDIX B. GENERATED VERILOG-HDL-CODE OF THE FIRST COORDINATE $C_0$
APPENDIX B. GENERATED VERILOG-HDL-CODE OF THE FIRST COORDINATE $C_0^{68}$

\[
\begin{align*}
\& (a[52] \& (b[54] ^ b[66] ^ b[103] ^ b[130]))) \\
\& (a[53] \& (b[65] ^ b[144] ^ b[153] ^ b[155]))) \\
\& (a[55] \& (b[17] ^ b[121] ^ b[144] ^ b[158]))) \\
\& (a[56] \& (b[30] ^ b[75] ^ b[96] ^ b[133]))) \\
\& (a[57] \& (b[28] ^ b[42] ^ b[106] ^ b[114]))) \\
\& (a[58] \& (b[7] ^ b[63] ^ b[99] ^ b[137]))) \\
\& (a[61] \& (b[8] ^ b[22] ^ b[27] ^ b[37]))) \\
\& (a[62] \& (b[73] ^ b[76] ^ b[98] ^ b[146]))) \\
\& (a[64] \& (b[16] ^ b[68] ^ b[122] ^ b[154]))) \\
\& (a[65] \& (b[53] ^ b[88] ^ b[104] ^ b[127]))) \\
\& (a[67] \& (b[44] ^ b[85] ^ b[123] ^ b[153]))) \\
\& (a[68] \& (b[22] ^ b[37] ^ b[64] ^ b[160]))) \\
\& (a[69] \& (b[104] ^ b[119] ^ b[143] ^ b[150]))) \\
\& (a[70] \& (b[79] ^ b[88] ^ b[104] ^ b[150]))) \\
\& (a[71] \& (b[3] ^ b[41] ^ b[73] ^ b[146]))) \\
\& (a[72] \& (b[19] ^ b[40] ^ b[48] ^ b[97]))) \\
\& (a[74] \& (b[19] ^ b[29] ^ b[77] ^ b[94]))) \\
\& (a[75] \& (b[56] ^ b[92] ^ b[140] ^ b[145]))) \\
\& (a[77] \& (b[10] ^ b[50] ^ b[74] ^ b[107]))) \\
\& (a[78] \& (b[54] ^ b[103] ^ b[111] ^ b[145]))) \\
\& (a[79] \& (b[17] ^ b[38] ^ b[70] ^ b[105]))) \\
\& (a[80] \& (b[49] ^ b[73] ^ b[76] ^ b[93]))) \\
\& (a[81] \& (b[50] ^ b[94]))) \\
\& (a[82] \& (b[13] ^ b[132])))
\end{align*}
\]
APPENDIX B. GENERATED VERILOG HDL-CODE OF THE FIRST COORDINATE $C_0$69

^{(a[84] & (b[26] ^ b[101] ^ b[122] ^ b[154]))} \\
^{(a[85] & (b[25] ^ b[33] ^ b[67] ^ b[139]))} \\
^{(a[88] & (b[17] ^ b[65] ^ b[70] ^ b[144]))} \\
^{(a[90] & (b[7] ^ b[99] ^ b[152] ^ b[161]))} \\
^{(a[92] & (b[2] ^ b[75] ^ b[95] ^ b[133]))} \\
^{(a[93] & (b[9] ^ b[18] ^ b[34] ^ b[80]))} \\
^{(a[94] & (b[35] ^ b[50] ^ b[74] ^ b[81]))} \\
^{(a[95] & (b[92] ^ b[117] ^ b[132] ^ b[159]))} \\
^{(a[96] & (b[18] ^ b[56] ^ b[86] ^ b[140]))} \\
^{(a[99] & (b[4] ^ b[58] ^ b[90] ^ b[115]))} \\
^{(a[100] & (b[110] ^ b[121] ^ b[129] ^ b[158]))} \\
^{(a[102] & (b[110] ^ b[124] ^ b[129] ^ b[139]))} \\
^{(a[103] & (b[52] ^ b[78] ^ b[108] ^ b[118]))} \\
^{(a[104] & (b[65] ^ b[69] ^ b[70] ^ b[120]))} \\
^{(a[107] & (b[19] ^ b[40] ^ b[77] ^ b[137]))} \\
^{(a[108] & (b[66] ^ b[89] ^ b[103] ^ b[125]))} \\
^{(a[110] & (b[12] ^ b[91] ^ b[100] ^ b[102]))} \\
^{(a[112] & (b[60] ^ b[66] ^ b[105] ^ b[125]))} \\
^{(a[113] & (b[27] ^ b[31] ^ b[44] ^ b[148]))} \\
^{(a[114] & (b[31] ^ b[57] ^ b[142] ^ b[148]))} \]
APPENDIX B. GENERATED VERILOG-HDL-CODE OF THE FIRST COORDINATE $C_0$

\[\hat{(a_{115} \& (b_{24} \^ b_{87} \^ b_{99} \^ b_{161}))}\]
\[\hat{(a_{116} \& (b_{144} \^ b_{155} \^ b_{158} \^ b_{162}))}\]
\[\hat{(a_{117} \& (b_{1} \^ b_{2} \^ b_{51} \^ b_{95}))}\]
\[\hat{(a_{118} \& (b_{89} \^ b_{103} \^ b_{140} \^ b_{145}))}\]
\[\hat{(a_{119} \& (b_{23} \^ b_{69} \^ b_{151} \^ b_{152}))}\]
\[\hat{(a_{120} \& (b_{104} \^ b_{109} \^ b_{127} \^ b_{143}))}\]
\[\hat{(a_{121} \& (b_{5} \^ b_{15} \^ b_{55} \^ b_{100}))}\]
\[\hat{(a_{122} \& (b_{30} \^ b_{64} \^ b_{84} \^ b_{160}))}\]
\[\hat{(a_{123} \& (b_{32} \^ b_{67} \^ b_{109} \^ b_{127}))}\]
\[\hat{(a_{124} \& (b_{8} \^ b_{37} \^ b_{59} \^ b_{102}))}\]
\[\hat{(a_{125} \& (b_{3} \^ b_{41} \^ b_{108} \^ b_{112}))}\]
\[\hat{(a_{126} \& (b_{24} \^ b_{31} \^ b_{87} \^ b_{142}))}\]
\[\hat{(a_{127} \& (b_{65} \^ b_{120} \^ b_{123} \^ b_{153}))}\]
\[\hat{(a_{128} \& (b_{15} \^ b_{59} \^ b_{138} \^ b_{157}))}\]
\[\hat{(a_{129} \& (b_{15} \^ b_{59} \^ b_{100} \^ b_{102}))}\]
\[\hat{(a_{130} \& (b_{11} \^ b_{52} \^ b_{138} \^ b_{157}))}\]
\[\hat{(a_{131} \& (b_{12} \^ b_{91} \^ b_{155} \^ b_{162}))}\]
\[\hat{(a_{132} \& (b_{1} \^ b_{82} \^ b_{83} \^ b_{95}))}\]
\[\hat{(a_{133} \& (b_{26} \^ b_{56} \^ b_{92} \^ b_{159}))}\]
\[\hat{(a_{134} \& (b_{6} \^ b_{34} \^ b_{45} \^ b_{106}))}\]
\[\hat{(a_{135} \& (b_{19} \^ b_{21} \^ b_{29} \^ b_{48}))}\]
\[\hat{(a_{136} \& (b_{6} \^ b_{18} \^ b_{34} \^ b_{86}))}\]
\[\hat{(a_{137} \& (b_{4} \^ b_{10} \^ b_{58} \^ b_{107}))}\]
\[\hat{(a_{138} \& (b_{60} \^ b_{66} \^ b_{128} \^ b_{130}))}\]
\[\hat{(a_{139} \& (b_{8} \^ b_{85} \^ b_{91} \^ b_{102}))}\]
\[\hat{(a_{140} \& (b_{20} \^ b_{75} \^ b_{96} \^ b_{118}))}\]
\[\hat{(a_{141} \& (b_{23} \^ b_{39} \^ b_{46} \^ b_{147}))}\]
\[\hat{(a_{142} \& (b_{42} \^ b_{114} \^ b_{126} \^ b_{149}))}\]
\[\hat{(a_{143} \& (b_{69} \^ b_{120} \^ b_{152} \^ b_{161}))}\]
\[\hat{(a_{144} \& (b_{53} \^ b_{55} \^ b_{88} \^ b_{116}))}\]
\[\hat{(a_{145} \& (b_{2} \^ b_{75} \^ b_{78} \^ b_{118}))}\]
\[\hat{(a_{146} \& (b_{38} \^ b_{62} \^ b_{71} \^ b_{151}))}\]
APPENDIX B. GENERATED VERILOG-HDL-CODE OF THE FIRST COORDINATE C₀

\[
\begin{align*}
^\wedge (a[147] & (b[21] \wedge b[43] \wedge b[48] \wedge b[141])) \\
^\wedge (a[148] & (b[45] \wedge b[106] \wedge b[113] \wedge b[114])) \\
^\wedge (a[149] & (b[40] \wedge b[87] \wedge b[97] \wedge b[142])) \\
^\wedge (a[150] & (b[38] \wedge b[69] \wedge b[70] \wedge b[151])) \\
^\wedge (a[151] & (b[98] \wedge b[119] \wedge b[146] \wedge b[150])) \\
^\wedge (a[152] & (b[43] \wedge b[90] \wedge b[119] \wedge b[143])) \\
^\wedge (a[153] & (b[25] \wedge b[53] \wedge b[67] \wedge b[127])) \\
^\wedge (a[154] & (b[11] \wedge b[64] \wedge b[84] \wedge b[157])) \\
^\wedge (a[155] & (b[25] \wedge b[53] \wedge b[116] \wedge b[131])) \\
^\wedge (a[156] & (b[14] \wedge b[36] \wedge b[51] \wedge b[83])) \\
^\wedge (a[157] & (b[16] \wedge b[128] \wedge b[130] \wedge b[154])) \\
^\wedge (a[158] & (b[12] \wedge b[55] \wedge b[100] \wedge b[116])) \\
^\wedge (a[159] & (b[36] \wedge b[83] \wedge b[95] \wedge b[133])) \\
^\wedge (a[160] & (b[6] \wedge b[68] \wedge b[86] \wedge b[122])) \\
^\wedge (a[161] & (b[90] \wedge b[109] \wedge b[115] \wedge b[143])) \\
^\wedge (a[162] & (b[12] \wedge b[116] \wedge b[131] \wedge b[162]));
\end{align*}
\]
Bibliography


[29] 0.18µm TSMC CMOS Technology, Standard Cell Library, September 1999, available through Canadian Microelectronics Corporation.

Vita Auctoris

NAME: Wangchen DAI
PLACE OF BIRTH: Handan, Heibe, P.R.China
YEAR OF BIRTH: 1988
EDUCATION: Beijing Institute of Technology, B.Sc., Beijing, P.R.China, 2010
University of Windsor, M.A.Sc., Windsor, ON, CANADA, 2013