A reconfigurable digital multiplier architecture.

Pedram Mokrian

University of Windsor

Follow this and additional works at: https://scholar.uwindsor.ca/etd

Recommended Citation
https://scholar.uwindsor.ca/etd/728

This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor students from 1954 forward. These documents are made available for personal study and research purposes only, in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution, Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder (original author), cannot be used for any commercial purposes, and may not be altered. Any other use would require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or thesis from this database. For additional inquiries, please contact the repository administrator via email (scholarship@uwindsor.ca) or by telephone at 519-253-3000ext. 3208.
A Reconfigurable Digital Multiplier Architecture

by

Pedram Mokrian

A Thesis
Submitted to the Faculty of Graduate Studies and Research through the
Department of Electrical and Computer Engineering in Partial Fulfillment
of the Requirements for the Degree of Master of Applied Science at
the University of Windsor

Windsor, Ontario, Canada
April, 2003
The author has granted a non-exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or electronic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author’s permission.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
© 2003 Pedram Mokrian

All Rights Reserved. No part of this document may be reproduced, stored or otherwise retained in a retrieval system or transmitted in any form, on any medium or by any means without the prior written permission of the author.
Abstract

The recent growth in microprocessor performance has been a direct result of designers exploiting decreasing device feature sizes, while at the same time deepening the depth of pipelines. As transistor sizes continue to shrink, the traditional gains associated with smaller feature sizes will be degraded due to the adverse effects of wire scaling. The consequences of technology scaling on circuit performance have recently become a topic of significant importance, especially in arithmetic circuitry such as digital multipliers, which exhibit highly irregular interconnections.

A digital multiplier architecture will be introduced that alleviates some of the problems associated with interconnect scaling, in addition to allowing for simple variable precision reconfiguration. Regulated by a 2-bit control signal, the multiplier is capable of true double and single precision multiplication, as well as fault tolerant and dual throughput single precision execution. The architecture proposed in this paper is centred on a recursive multiplication algorithm by Danysz and Swartzlander, where a large multiplication is carried out using recursions of simpler base multiplier modules. This multiplication algorithm presents greater regularity in design than standard column compression multipliers, while avoiding the linear latency of array multipliers.

A separate invesitagtion of the recursive multiplication scheme has lead to favourable results for a design methodology recommended for future arithmetic architectures, which makes use of the proposed “locally optimized array” paradigm. Furthermore, a study of column compression techniques will be presented; this includes a novel suggestion for an optimized 4:2 compressor distribution in partial product reduction trees, and an overview of the transistor level configuration of arithmetic cells.
I would like to extend my sincere gratitude and appreciation to a number of people who have contributed to the completion of this thesis, including my colleagues, mentors and family.

First and foremost, I would like to thank my supervisor, Dr. Majid Ahmadi, for whom I have the utmost respect and admiration. His confidence in my capabilities, his enthusiasm towards my research and his guidance and assistance in all of my endeavours, have made a tremendous impact on me academically, socially and personally, and for this I am forever in his debt.

I am also grateful to Dr. Graham Jullien for his knowledgeable suggestions and insight on the field of computer arithmetic, and for granting me the opportunity to carry out research, and share my findings with students at the University of Calgary. I would also like to thank the faculty at the University of Windsor, including Dr. W.C. Miller for sharing his wisdom and his words of inspiration for a career in research, and my committee members, Dr. Arunita Jaekel and Dr. Xiang Chen for their patience and support.

Additionally, I would like to credit Mr. Till Kuendiger and Mr. Roberto Muscedere, two of the brightest people I have had the pleasure of working with, for their endless support in the lab environment, and for being instrumental in the completion of this thesis. Finally, I would like to acknowledge my colleagues at the University of Windsor, Mr. Perta, Howard and Soltis, and my parents for their assistance, encouragement and motivation.
# Table of Contents

**ABSTRACT** .................................................................................................................. iv  
**ACKNOWLEDGEMENTS** ............................................................................................... v  
**LIST OF FIGURES** ...................................................................................................... x  
**LIST OF ABBREVIATIONS** .......................................................................................... xiv

## Chapter 1 Introduction to Computer Arithmetic ............................................. 1

1.1 History of Computer Arithmetic ............................................................................. 2  
1.2 Current Trends in Computer Arithmetic ............................................................... 5  
   Low Power Design ........................................................................................................... 6  
   High Throughput Systems And Pipelining .................................................................. 6  
   Technology Scaling Effects ......................................................................................... 9  
1.3 Thesis Overview ......................................................................................................... 10  
   Thesis Highlights ......................................................................................................... 10  
   Thesis Organization ..................................................................................................... 11

## Chapter 2 Digital Multiplication Overview ..................................................... 12

2.1 Basics of Digital Multiplication .............................................................................. 13  
2.2 Serial Multiplication Schemes .............................................................................. 15  
   Shift-Add Multiplication .............................................................................................. 15  
   High Radix Multipliers ............................................................................................... 16  
2.3 Parallel Multipliers .................................................................................................... 17  
   Column Compression Multipliers .............................................................................. 18  
   Array Multipliers ........................................................................................................ 21  
   Latency Approximations ............................................................................................. 22  
2.4 Number Systems ....................................................................................................... 23
## Chapter 3 Partial Product Reduction Techniques

3.1 Booth Recoding .................................................. 32
3.2 CSA Reduction Schemes .......................................... 35
    Wallace and Dadda Trees ........................................ 36
    Minimum Reduction Stage Requirement ........................ 38
    Minimum Full Adder Requirement .............................. 40
    Variations of CSA Trees ........................................ 43
3.3 High Order Counters And Compressors ........................ 46
    Counters .......................................................... 46
    Compressors ...................................................... 48
3.4 4:2 Compressors ................................................ 49
    Structure Of 4:2 Compressors .................................. 49
    Proposed Optimized Compressor Layout ....................... 51
3.5 Low Power ...................................................... 63
    Leakage And Short Circuit Power Dissipation ............... 64
    Dynamic Power Dissipation ................................... 64
    Dynamic Power Management ................................... 67
3.6 Threshold Logic .................................................. 70

## Chapter 4 Arithmetic Circuitry ..................................... 72

4.1 Logic Styles ..................................................... 73
    Static CMOS ..................................................... 74
    Transmission Gate Logic ....................................... 76
    Dynamic Logic Families ....................................... 77
    Differential And Dual Rail Logic Families .................... 82
4.2 Pass Transistor Logic ........................................... 83
4.3 Full Adder Circuits .............................................. 85
4.4 4:2 Compressor Circuits ........................................ 95
4.5 Overview of Arithmetic Circuitry ............................... 101
    Logic Style Selection .......................................... 101
    Simulation Setup and Environment ............................ 107
Chapter 5 Interconnect Effects

5.1 Projected Issues With Technology Scaling
5.2 Physics of Wire Interconnects
5.3 Interconnect Effects On Arithmetic Circuitry
5.4 Locally Optimized Arrays

Chapter 6 Recursive Multiplication

6.1 Overview of the Recursive Multiplication Algorithm
   Background Information
6.2 6:2 Reduction Circuitry
6.3 Analysis Of The Base Multiplier

Chapter 7 Reconfigurable Multiplier Architecture

7.1 Introduction to DSP Multiplication
   Computation Parallelism
   Variable Data Width
   Fault Tolerance
7.2 Reconfigurable Architectures
7.3 Proposed Multiplier
   Double Precision Mode
   Single Precision Mode
   Dual Single Precision Mode
   Single Precision Fault-tolerant Mode

Chapter 8 Modeling and Simulation

8.1 HDL Model
8.2 Implementation and Layout
8.3 Simulation Results
8.4 Design Highlights
Chapter 9 Conclusions ................................................................. 170

9.1 Summary of Contributions ...................................................... 170
   Algorithmic Contributions ...................................................... 170
   Architectural Contributions .................................................... 171
   Transistor Level Contributions ................................................. 172

9.2 Conclusions ........................................................................ 172

REFERENCES ............................................................................... 174

Appendix A Complete 4:2 Compressor Analysis ............................... 186

Appendix B Interconnect Analysis of Various Multiplier Sizes .......... 188

Appendix C Verilog HDL Code for the Reconfigurable Multiplier .......... 195

Appendix D Recursive Multiplier Base-Multiplier Analysis ............... 214

Appendix E Component Breakdown of the Reconfigurable Multiplier .... 215

Appendix F Simulation Reports and Logs ........................................ 222

Vita Autoris ............................................................................. 230
## List of Figures

<table>
<thead>
<tr>
<th>Figure</th>
<th>Description</th>
</tr>
</thead>
</table>
| Figure 1.1 | (a) Original system taking t time to complete  
(b) System divided into 5 time equivalent sections  
(c) Latched pipeline arrangement of subdivided system                                                                                     |
| Figure 1.2 | Relative frequency of Intel devices if fabricated on the same process                                                                 |
| Figure 2.1 | Interrelation of arithmetic operations in modern computing devices                                                                          |
| Figure 2.2 | A 4x4-bit multiplication leading to an 8-bit product                                                                                       |
| Figure 2.3 | A simple dot diagram of 16-bit partial product array                                                                                            |
| Figure 2.4 | Shift-add multiplier implementation                                                                                                            |
| Figure 2.5 | Shift-add implementation of a basic radix-4 multiplier                                                                                        |
| Figure 2.6 | Schematic representation of the tree multiplier process [20]                                                                               |
| Figure 2.7 | Carry Save Adder (CSA) Array                                                                                                                |
| Figure 2.8 | Carry Save Adder (CSA) used for 3:2 compression                                                                                               |
| Figure 2.9 | 3:2 Compression scheme carried using a CSA array [8]                                                                                         |
| Figure 2.10 | One sided CSA tree forming an Array Multiplier                                                                                              |
| Figure 2.11 | A 5-bit array multiplier layout (the boxes represent full-adders) [8]                                                                     |
| Figure 2.12 | IEEE floating point standard word widths for  
(a) single precision  
(b) double precision                                                                                                                   |
| Figure 2.13 | A floating-point Adder / Subtractor scheme [8]                                                                                               |
| Figure 2.14 | A floating-point multiplication scheme [8]                                                                                                  |
| Figure 3.1 | Dot diagram for a Booth-2 16-bit Multiplication [20]                                                                                            |
| Figure 3.2 | The Carry Save Adder - (a) Dot diagram of individual full-adder cell  
(b) Several full-adders used to form one level of a CSA                                                                                     |
| Figure 3.3 | Variations of multiplier partial product reduction trees using CSAs [21]  
(a) Dadda implementation.  
(b) Wallace implementation                                                                                                                 |
| Figure 3.4 | Actual expansion values and theoretical bounds of minimum full adder requirements in multi-operand addition                                                                                       |
| Figure 3.5 | Algebraic representation of minimum full adder requirements                                                                                |
| Figure 3.6 | An 8x8 multiplication partial product matrix                                                                                                 |
| Figure 3.7 | Array of arrays layout as outlined in [10]                                                                                                  |
Figure 3.8  Logic diagram of a Full-Adder ..................................................45
Figure 3.9  Examples of Dot Representations of Parallel Counters .................47
Figure 3.10  Examples of Dot Representations of Multi-column Counters ............47
Figure 3.11  Cascaded full adders composing a basic 4:2 compressor ..................50
Figure 3.12  4:2 compressor layout making use of fast input/output paths ..........51
Figure 3.13  A minimized gate level representation of a 4:2 compressor ..............51
Figure 3.14  Definition of a 4:2 compressor row ........................................56
Figure 3.15  Compressor layout for a 16x16 multiplication ................................58
Figure 3.16  Compressor layout for a 24x24 multiplication ................................59
Figure 3.17  Total Number of cells in a partial product reduction tree
              (NFA = Number of (3,2) NCT = Number of total cells for 4:2) ..........61
Figure 3.18  Total Number of interconnects in a partial product reduction tree
              (IFA = Interconnects for (3,2) IC = Interconnects for 4:2) ..........61
Figure 3.19  Total Delay ( DFA = (3,2) scheme  DC = [4:2] scheme ) ...............62
Figure 3.20  Interchanging rows within an Array Multiplier structure ...............69
Figure 3.21  Hybrid MSB first Array Multiplier structure ..............................69
Figure 3.22  Threshold Gate implementation of a 4-input AND gate [70] ..............70
Figure 4.1   Static CMOS logic cell depicting the NFET and PFET networks ...........75
Figure 4.2   (a) NFET threshold voltage loss (b) PFET threshold voltage gain .......75
Figure 4.3   Transmission Gates  (a) Transistor level  (b) Logic symbol ..............76
Figure 4.4   Capacitive Nodes  (a) Basic circuit (b) Storage Capacitor model .......77
Figure 4.5   Voltage drop due to leakage effects in an NFET .........................79
Figure 4.6   Voltage rise due to leakage effects in a PFET ............................79
Figure 4.7   Primitive representation of a Precharge-Evaluate Logic block [78] .......80
Figure 4.8   A DOMINO logic block [78] ....................................................80
Figure 4.9   Single-Phase Logic Circuit Types [78]  (a) Single Phase Network setup  
              using latches (b) Single Phase Logic gates cascaded together ............81
Figure 4.10  Cascode Voltage Switch Logic (CVSL)
               (a) Static CVSL (b) Dynamic CVSL .........................................83
Figure 4.11  Basic Pass Transistor Logic Configurations ................................84
Figure 4.12  AND gate implementations using pass logic ..............................84
Figure 4.13  CPL Implementation of an AND gate ........................................85
Figure 4.14  Gate level (a) and transistor level (b) implementations of 12 transistor
              full-adder cell proposed in [83] ............................................87
Figure 4.15  A DOMINO logic full adder cell [84] .......................................87
Figure 4.16  Various XOR/XNOR configurations
               (a) Transmission Gate  (b) Transmission Gate with driving outputs
               (c) Inverter Based  (d) Proposed XOR/XNOR Configuration [85] .......88
Figure 4.17  Transmission Gate based Full-Adder circuit ...............................89
Figure 4.18  Pass-transmission implementation of a Full Adder .......................90
Figure 4.19  10 Transistor Full Adders outlined in [89] ...............................90

xi
Figure 4.20 Conventional 28 transistor CMOS full adder implementations (a) standard configuration (b) Mirror cell configuration [84] ................. 91
Figure 4.21 Low power 16 transistor full adder cell ........................................ 92
Figure 4.22 Low power 10 transistor full adder cell [91] ...................................... 92
Figure 4.23 LEAP full adder cell ................................................................. 94
Figure 4.24 Complementary Pass-Transistor Logic (CPL) full adder cell .......... 94
Figure 4.25 1.2 ns 4:2 compressor proposed in [100] ....................................... 96
Figure 4.26 Precharged pass logic compressor [56] ........................................ 97
Figure 4.27 Multiplexer cell based pass-logic 4:2 compressor [57] ................. 97
Figure 4.28 Gate level framework for DPL based compressor cell in [102] ......... 99
Figure 4.29 DPL compressor using Full swing & Non-Full Swing MUX cells [103] . 99
Figure 4.30 Pass-transmission implementation of the 4:2 compressor [98] ........ 100
Figure 4.31 Threshold voltage floor with decreasing supply voltage [104] .......... 103
Figure 4.32 Threshold voltage tolerance dependence on channel length [104] .... 103
Figure 4.33 Typical test-bench used for Full-Adder simulations [91] ................. 109
Figure 4.34 4:2 Compressor distribution in an 8x8 reduction matrix ................. 109
Figure 4.35 Full-adder distribution in an 8x8 reduction matrix ......................... 110
Figure 4.36 Cadence testbench for 8x8 multipliers ........................................ 110
Figure 4.37 Power measurement circuits proposed in [109] having a current source and parallel RC circuit (a) current controlled source (b) voltage controlled source ................................................................. 112
Figure 5.1 Contribution of interconnect effects to overall delay ....................... 117
Figure 5.2 Coupling capacitances associated with metal interconnects ............... 122
Figure 5.3 Interconnect RC delay per unit length (ps / mm) ............................. 122
Figure 5.4 Relative delay; binary tree to its no wire implementation for Booth 2, double precision multiplication [12] .................. 125
Figure 5.5 Interconnect effects with respect to multiplier width (a) Total chip interconnect length (b) Average chip interconnect length .................. 129
Figure 6.1 Dot diagram of a single level recursive n-bit multiplication ............... 134
Figure 6.2 A schematic of a single level recursive multiplier .......................... 134
Figure 6.3 Standard 6:2 reduction macrocell composed of 3 stages of full adders... 135
Figure 6.4 6:2 Macrocells capable of receiving a variety of input bits (a) 5 input (b) 4 input (c) 3 input (d) 2 input ........................ 137
Figure 6.5 Novel 6:2 Reduction Block ...................................................... 138
Figure 6.6 Delay associated with various base-multiplier sizes (a) Array base multipliers (b) Dadda base multipliers .............. 141
Figure 6.7 Relationship between recursive multiplier size and base-multiplier size 143
Figure 6.8 Base multiplier comparisons for a 64-bit recursive multiplier (a) total cell count (b) total interconnect segment count .................. 144
Figure 6.9 Percentage increase in cell count and interconnect count for various sizes of Array and Dadda base multipliers .................. 144
Figure 7.1 Time redundant RETWV error correcting multiplier [144] .............. 151
| Figure 7.2 | Outline of the reconfigurable multiplier | 154 |
| Figure 7.3 | Default double precision mode | 156 |
| Figure 7.4 | Single precision mode | 156 |
| Figure 7.5 | Dual single precision mode | 157 |
| Figure 7.6 | Single precision - fault tolerant mode | 158 |
| Figure 8.1 | (a) Top level module of the HDL model (b) Top level module outlining the input output clocked-latching circuitry | 160 |
| Figure 8.2 | An illustrated representation of the HDL model of the multiplier | 161 |
| Figure 8.3 | Layout view of the Reconfigurable Multiplier | 163 |
| Figure 8.4 | Breakdown of the Reconfigurable Multiplier (a) Area (b) Power (c) Delay | 164 |
List of Abbreviations

<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>ALU</td>
<td>Arithmetic Logic Unit</td>
</tr>
<tr>
<td>ASIC</td>
<td>Application Specific Integrated Circuit</td>
</tr>
<tr>
<td>BiCMOS</td>
<td>Bipolar CMOS</td>
</tr>
<tr>
<td>BSDC</td>
<td>Binary Stored Double Carry</td>
</tr>
<tr>
<td>CLA</td>
<td>Carry Look-Ahead Adder</td>
</tr>
<tr>
<td>CMF</td>
<td>Common Mode Failure</td>
</tr>
<tr>
<td>CMOS</td>
<td>Complementary Metal-Oxide Semiconductor</td>
</tr>
<tr>
<td>CORDIC</td>
<td>COordinated Rotation DIgital Computer</td>
</tr>
<tr>
<td>CPL</td>
<td>Complementary Pass-Transistor Logic</td>
</tr>
<tr>
<td>CPU</td>
<td>Central Processing Unit</td>
</tr>
<tr>
<td>CSA</td>
<td>Carry Save Adder</td>
</tr>
<tr>
<td>CTL</td>
<td>Capacitive Threshold Logic</td>
</tr>
<tr>
<td>CVSL</td>
<td>Cascode Voltage Switch Logic</td>
</tr>
<tr>
<td>DA</td>
<td>Distributed Arithmetic</td>
</tr>
<tr>
<td>DBNS</td>
<td>Double Base Number System</td>
</tr>
<tr>
<td>DC</td>
<td>Direct Current</td>
</tr>
<tr>
<td>DCSL</td>
<td>Differential Current Switch Logic</td>
</tr>
<tr>
<td>DCT</td>
<td>Discrete Cosine Transform</td>
</tr>
<tr>
<td>DIBL</td>
<td>Drain-Induced Barrier Lowering</td>
</tr>
<tr>
<td>BIDO</td>
<td>Bi-Directional Operation</td>
</tr>
<tr>
<td>DPL</td>
<td>Dual Pass-Transistor Logic</td>
</tr>
<tr>
<td>DSP</td>
<td>Digital Signal Processor</td>
</tr>
<tr>
<td>DVL</td>
<td>Dual Value Logic</td>
</tr>
<tr>
<td>ECDL</td>
<td>Enable-disable Cmos Differential Logic</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>-------------</td>
</tr>
<tr>
<td>EDVAC</td>
<td>Electronic Discrete Variable Automatic Computer</td>
</tr>
<tr>
<td>ENIAC</td>
<td>Electronic Numerical Integrator and Computer</td>
</tr>
<tr>
<td>FOX</td>
<td>Field Oxide</td>
</tr>
<tr>
<td>FPGA</td>
<td>Field Programmable Gate Array</td>
</tr>
<tr>
<td>GND</td>
<td>Ground Terminal or Voltage</td>
</tr>
<tr>
<td>HDL</td>
<td>Hardware Description Language</td>
</tr>
<tr>
<td>IC</td>
<td>Integrated Circuit</td>
</tr>
<tr>
<td>IEEE</td>
<td>Institute of Electrical and Electronics Engineers</td>
</tr>
<tr>
<td>KOA</td>
<td>Karatsuba Ofman Algorithm</td>
</tr>
<tr>
<td>LNS</td>
<td>Logarithmic Number System</td>
</tr>
<tr>
<td>LSB</td>
<td>Least Significant Bit</td>
</tr>
<tr>
<td>MBA</td>
<td>Modified Booth Algorithm</td>
</tr>
<tr>
<td>MIPS</td>
<td>Millions of Instructions Per Second</td>
</tr>
<tr>
<td>MOS</td>
<td>Metal-Oxide Semiconductor</td>
</tr>
<tr>
<td>MOSFET</td>
<td>Metal-Oxide Semiconductor Field Effect Transistor</td>
</tr>
<tr>
<td>MSB</td>
<td>Most Significant Bit</td>
</tr>
<tr>
<td>MSI</td>
<td>Medium Scale Integration</td>
</tr>
<tr>
<td>MUX</td>
<td>Multiplexer Cell (usually assumed as a 2:1 multiplexer)</td>
</tr>
<tr>
<td>NFET</td>
<td>N-type Field Effect Transistor</td>
</tr>
<tr>
<td>PDP</td>
<td>Power Delay Product</td>
</tr>
<tr>
<td>PFET</td>
<td>P-type Field Effect Transistor</td>
</tr>
<tr>
<td>PT</td>
<td>Pass-Transistor</td>
</tr>
<tr>
<td>RAP</td>
<td>Reconfigurable Arithmetic Processor</td>
</tr>
<tr>
<td>REDWC</td>
<td>REcomputing with Duplication With Comparison</td>
</tr>
<tr>
<td>REMOD</td>
<td>Reprocessing with MicrO Delays</td>
</tr>
<tr>
<td>RETWV</td>
<td>REcomputing with Triplication With Voting</td>
</tr>
<tr>
<td>RISC</td>
<td>Reduced Instruction Set Computer (Computing)</td>
</tr>
<tr>
<td>RMS</td>
<td>Root Mean Square</td>
</tr>
<tr>
<td>RNS</td>
<td>Residue Number System</td>
</tr>
<tr>
<td>SIA</td>
<td>Silicon Industry Association</td>
</tr>
<tr>
<td>SIMD</td>
<td>Single Instruction Multiple Data</td>
</tr>
<tr>
<td>SSDL</td>
<td>Sample-Set Differential Logic</td>
</tr>
<tr>
<td>SSI</td>
<td>Small Scale Integration</td>
</tr>
<tr>
<td>TDM</td>
<td>Three Dimensional Minimization</td>
</tr>
<tr>
<td>TG</td>
<td>Transmission Gate</td>
</tr>
<tr>
<td>THL</td>
<td>High-to-Low Transition Time</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>-------------</td>
</tr>
<tr>
<td>TL</td>
<td>Threshold Logic</td>
</tr>
<tr>
<td>TLH</td>
<td>Low-to-High Transition Time</td>
</tr>
<tr>
<td>TMR</td>
<td>Triple-Modular Redundancy</td>
</tr>
<tr>
<td>TOSHIRO</td>
<td>Fault Tolerance by Shifted and Rotated Operands</td>
</tr>
<tr>
<td>TSMC</td>
<td>Taiwan Semiconductor Manufacturing Company Inc.</td>
</tr>
<tr>
<td>VDD</td>
<td>Power Supply Terminal or Voltage</td>
</tr>
<tr>
<td>VLSI</td>
<td>Very Large Scale Integration</td>
</tr>
<tr>
<td>VSS</td>
<td>Lowest Chip Voltage (usually equivalent to GND)</td>
</tr>
<tr>
<td>VT</td>
<td>Threshold Voltage</td>
</tr>
</tbody>
</table>
Chapter 1

Introduction to Computer Arithmetic

The computer has permeated every facet of modern society, and has become an integral part of social order. There is a long history, dating back several centuries, of mathematicians and scientists developing computing machines to facilitate the manipulation of numbers. With this intrigue of computing machines was born the field of computer arithmetic. The history of computer arithmetic is closely connected with the development of the digital computer, and was one of the primary areas of research in early electronic computers.

Computer arithmetic is a sub-field of computer organization and architecture, dealing with the implementation of arithmetic algorithms on hardware and/or software, and is a fundamental aspect of computer processor architecture and the arithmetic logic unit (ALU). One of the principle focuses of this field is the development of hardware algorithms and circuitry to enhance the performance of numerical applications. The discussion on computer arithmetic begins with a look at the evolution of the computer, followed by a review of the advancements in arithmetic algorithms over the past century.
1.1 History of Computer Arithmetic

The computer was essentially developed as a means to facilitate numerical calculations, and its origins may be traced to the mechanical adding machine created by Blaise Pascal in 1642, and the various modifications made to the original concept throughout the next several decades. The English mathematics professor, Charles Babbage, drafted a proposal for a steam powered Difference Engine in 1822, capable of solving differential equations. A decade later he prepared plans for his second steam powered machine known as the Analytic Engine, which would be a real parallel decimal computer [1].

Babbage's inventions, although never fully realized, lay down the conceptual architecture for future computers. One of his intentions was to use punch cards to store the instructions for the machine, similar to those used on a Jacquard loom. Herman Hollerith, of the US census bureau, further explored this concept, where he used punch cards to facilitate the tabulation of census results. Hollerith's tabulator became so successful that he started his own firm to market the device in 1896; his company, through a series of mergers, eventually became International Business Machines (IBM) in 1924 [2].

Alan Turing introduced the concept of the finite state machine in 1936. In his paper, Turing proposed a machine that would determine an output (or next) state based on its current state and an input value. The nineteenth century English mathematician and logistician, George Boole, developed ways of expressing logical processes using algebraic symbols, creating a branch of mathematics known as symbolic logic. Boole's clarification of the binary system of algebra was the basis of the first binary computer, the Z3, developed by the German Engineer Konrad Zuse in 1941. At the same time, spurred on by the strategic importance of computing machines during the world war, the British manufactured their own code-breaking computer known as the Colossus [2].

In the US, the maturity of the modern computer continued with the works of IBM and Harvard engineer Howard H. Aiken, on the first all-electronic calculator, the Harvard-IBM Automatic Sequence Controlled Calculator, or Mark I for short. Another computer
development spurred by the war was the Electronic Numerical Integrator and Computer (ENIAC), produced by a partnership between the U.S. government and John W. Mauchly and John Presper Eckert at the University of Pennsylvania. ENIAC is generally acknowledged to be the first successful high-speed electronic digital computer (EDC) and was productively used from 1946 to 1955 [3].

One of the pioneers of modern computing, John von Neumann, initiated concepts in computer design that remained central to computer engineering for the next 40 years. Von Neumann designed the Electronic Discrete Variable Automatic Computer (EDVAC) in 1945, having memory to hold a stored program as well as data, and a central processing unit coordinating all of the computer's processes. This "stored memory" technique, as well as the "conditional control transfer", permitted the computer to be stopped at any point and then resumed, allowing for greater versatility in computer programming.

John von Neumann demonstrated that a computer could have a simple, fixed structure, yet be able to execute any kind of computation given properly programmed control without the need for hardware modification. Von Neumann's contributions resulted in a new understanding of the organization of practical fast computers architectures. As a result of these techniques and several others, computing and programming became faster, more flexible, and more efficient, with the instructions in subroutines performing far more computational work. These ideas, often referred to as the stored-program technique, became the universally adopted rudiments for future generations of high-speed digital computers.

The advent of the transistor in 1948 proved to be the next breakthrough in computer technology. By 1956, the transistor was employed in computers, and coupled with magnetic core memory, they enabled a new generation of faster, smaller and more efficient class of computing machines. These machines were predominantly used for numerical applications, and were used by the government, academia and large businesses.
Jack Kilby, an engineer with Texas Instruments, introduced the concept of the integrated circuit (IC) in 1958, and over the next few years, others were able to integrate more and more components onto a single semiconductor device. In the early seventies, Intel incorporated many facets of the computer onto a single IC, and named their revolutionary design the microprocessor. Thus a new physical archetype for the computer was in place, and although the device dimensions continue to diminish, the model has remained relatively unchanged [3].

The fundamental nature of the computer is a machine for performing numerical calculations efficiently. With this in mind, the manner in which those calculations are carried out becomes a significant factor in the performance of the machine, thus fostering the need for devoted research in the area of computer arithmetic. Most of the essential methodologies, number systems, and algorithms for the various arithmetic operations were developed in the mid twentieth century. In this era the computer's purpose was restricted to numerical applications, consequently a great deal of emphasis was placed on the advancement of the arithmetic logic of computers.

The historical perspective of computer arithmetic begins with the notions of addition and multiplication set forth by Charles Babbage. Modern hardware algorithms such as the concept of complement representation for subtraction, shift-add multiplication and division were fine tuned in the 1940's. In this time period, the focus of researchers was to prove the feasibility of the computer, and once this was demonstrated, the onus was placed on speed-up techniques. In the 1950's, practically all of the significant modern fast addition algorithms had been published. In addition, concepts of residue arithmetic and CORDIC algorithms were presented.

The 1960's saw the development of redundant, floating point and high radix arithmetic concepts. Significant advancements in digital multiplier algorithms were also carried out in this time period. Specifically, Wallace [4] and Dadda [5] published their work on tree multipliers, and partial product reduction concepts. Word lengths for floating point numbers were also standardized [6] for greater arithmetic functionality across all
platforms. The advent of vector computers and microprocessors in the 1970's, forced the requirement for higher throughput from arithmetic hardware. Shortly thereafter, VLSI technology enabled low cost, high performance embedded arithmetic circuitry. This in turn forced the enhancement of pipelining techniques in order to meet the processing demands of the new breed of computers.

1.2 Current Trends in Computer Arithmetic

The evolution of the microprocessor, and the current employment of deep-submicron technology, has forced the reconsideration of all arithmetic design. Pin and interconnect limitations are now of primary concern, consequently refinement of arithmetic algorithms, and hybrid designs have emerged as the focal points of modern arithmetic development. The shift in technology from SSI (Small Scale Integration) to VLSI (Very Large Scale Integration) has shifted emphasis from algorithm development to hardware level optimization of existing techniques.

As the function, capability and applications of the computer evolves, so too must the attitude towards the design of the modern computer. For the most part, algorithm development has been overshadowed by the significant improvements achievable by adapting existing methodologies to exploit contemporary VLSI technology. Reduction of gates in circuit implementations has taken a back seat to the study of interconnect effects, and the implementation of optimal circuitry on a transistor level.

Furthermore, with the increasing impact of portable devices, low power design has transpired into a critical area of exploration in the field of computer arithmetic, and hardware development. With new technologies and application requirements, the continued emergence of design challenges in the field of computer arithmetic is inevitable. This section focuses on the current trends in computer arithmetic and architecture, and their implication on digital multiplier designs.
1.2.1 Low Power Design

Low power techniques have emerged as the foremost area of interest in modern VLSI design. The recent awareness of this critical field of research is due to concerns over excessive heat dissipation, and the mounting yearn for portable devices. It is intuitive to appreciate the fact that low power devices will inevitably generate less heat, in addition to placing a smaller burden on portable batteries. Area efficiency and gate count have passed the torch of dominant design criteria to low power and low cost strategies.

Of the significant areas of research in low power design policy, those most prevalent to arithmetic design include low power logic circuit families, *Dynamic Power Management*, and reduced switching activity configurations. Asynchronous circuitry eliminates the need for a complex clock tree eradicating a considerable source of power dissipation, estimated at 15 to 45% of total power [7]. Though some may argue that the added overhead for handshaking circuitry may be justified by the savings in power and potentially delay [8], this technology has limited applications in a pure digital multiplier.

Circuit packaging and battery life now pose severe restrictions on many designers, including those engaged in computer architecture. It has been predicted that power dissipation in high performance processors will exceed packaging limits by 25 times in a matter of 15 years [9]. With the increased demand in performance from mobile computing instruments, power conscious strategies must permeate down to processor, and arithmetic unit design.

1.2.2 High Throughput Systems And Pipelining

To speed up arithmetic operations, the input-to-output latency, or simply the time between the application of inputs signals and the arrival of outputs, must be reduced. In order to sustain high throughput (number of operations performed per unit time) and consequently high clock rates in systems, pipelining strategies are employed allowing for concurrent operations. Although this technique has been in existing for a few decades, its impact on computer architecture continues to mount with every new major processor device.
Current designs are incapable of satisfying the clocking demands of most high-speed processors; for this reason concurrency, or one hardware unit performing multiple overlapped processes at once, is employed. Pipelining allows a given piece of hardware to be sub-divided into stages. The stages are separated by clocked (or asynchronous) registers; this all amounts to what is often referred to as the pipeline hardware overhead. By separating the large block into stages, the portions of the circuit that would typically be idle are capable of processing the next set of signals, and so forth. This in effect substantially increases circuit efficiency and throughput. Thus, although an operation will have a higher latency for a single computation due to the added hardware overhead, the overall system will have higher total throughput.

The following is a crude example of the basic principles of pipelined circuitry. Consider a function that takes time $t$ to complete from input to output as in Figure 1.1 (a). Subdividing the function into 5 subsections each of equivalent delay results in Figure 1.1 (b). Placing latches allows the capture of the output of each subsection, thus a new input may be applied at $t/5$ time intervals (Figure 1.1 (c)). This produces an increase in throughput by a factor of 5.

![Figure 1.1](image)

**Figure 1.1**  (a) Original system taking $t$ time to complete  
(b) System divided into 5 time equivalent sections  
(c) Latched pipeline arrangement of subdivided system
Pipelining techniques are employed in the majority of computing devices, including all of the major processor families. The degree to which a device is pipelined, known as the pipelining depth, is based on the target frequency and the number of gates that may be included per pipeline stage. The deeper the pipeline depth, the more stages per pipeline, the higher the clock speeds. However, this comes at the expense of increased system complexity, and added buffering to accommodate the longer pipelines.

Figure 1.2 depicts a normalized clock frequency graph of Intel's last six processor cores if they were all based on the same technology [10]. With the effects of technology scaling aside, it becomes apparent that as the depth of the pipelining increases (Pentium III and Pentium IV), so too does the relative clock frequency.

![Relative Clock Frequency Graph](image)

**Figure 1.2 Relative frequency of Intel devices if fabricated on the same process**

For a highly efficient pipeline, wave-pipelining techniques may be explored [8]. Wave pipelining makes use of the inherent delay through a pipelined segment as a temporary storage mechanism, allowing for 'waves' of unlatched data to be transmitted. Typical pipelining techniques employ temporal separation between consecutive signals, as a result achieving spatial separation as well. Wave pipelining, on the other hand, draws on spatial separation of consecutive signals that have no set temporal separation [8]. This method
alleviates the requirement for pipelining overhead, yet demands highly accurate delay predictions within the circuitry and the clock network enabling the inputs and outputs.

In the forthcoming generations of digital signal processors, and microprocessors, the dependence on pipelining methodologies will continue to intensify. Fully distributed micro-pipelines, systolic arrays, and asynchronous pipeline control will have a significant impact on the ability of industry meeting performance demands.

1.2.3 Technology Scaling Effects

Researchers have turned towards advancements in process technology in order to satisfy the ever-increasing demand for high-speed processors, and computational systems. As current devices delve into deep sub-micron technology, not only does the device geometry decrease, but switching times, and operating voltages also scale down. These gains come at the expense of increased layout complexity, and a greater susceptibility to parasitic effects in the interconnections.

We are rapidly approaching the era where interconnect effects due to poor wire scaling are forcing the reconsideration of conventional circuit topologies. According to the International Technology Roadmap for Semiconductors (ITRS 2001), we are beginning to reach the fundamental limits of the materials in the planar CMOS process [9]. As transistor sizes continue to shrink, the traditional gains associated with smaller feature sizes will be degraded due to the adverse effects of wire scaling. There has been recent awareness of the drastic effects of interconnect delay in VLSI implementations [11-16], and several investigations focused on this problem have been linked directly to multiplier structures.

It has been projected that over a span of six technology generations, wires that are scalable will have delays that will become worse by a factor of four relative to gate delays. This problem escalates for non-scalable, global wires, where the wire delay relative to gate delay will double for each generation [15]. For the most part, conventional analysis of
partial product reduction trees has been simplified through the use of compressor or gate delays. These models suffer in accurately modeling the performance of the structure since the associated interconnect delay, and capacitive coupling effects are ignored. Chapter 5 will provide an in-depth survey of interconnect scaling effects.

The implications of technology on digital design may no longer be brushed off as secondary importance to algorithm design. The era of elegant algorithm development by simply observing gate delays and gate counts has passed. Future experts in the field of computer architecture must also have a concrete understanding of the fundamentals of technology effects, and the trends in design. There have been significant advances in micro-architectures in order to cope with evolving technology issues, a comprehensive survey of which is provided in [17]. Such microprocessor level advancements in instruction coding, memory allocation, and system level modifications are beyond the scope of this thesis.

1.3  Thesis Overview

1.3.1  Thesis Highlights

This thesis will present a general investigation of digital multiplication from various levels of abstraction, and will highlight a novel reconfigurable multiplication architecture. The proposed design utilizes a variation of the “divide and conquer” or “recursive” multiplication scheme presented by Swartzlander et al. [18]. The principle advantage of this scheme lies in its multi-mode reconfiguration ability, allowing the user to select between four modes of operation:

- Double Precision (64-bit) Multiplication (default)
- Single Precision (32-bit) Multiplication
- Dual Single Precision Multiplication (double throughput)
- Single Precision Fault Tolerant Multiplication through a Majority Voter

The proposed scheme combines many desirable design characteristics, such as low power dissipation, high throughput capabilities, fault tolerance, and increased regularity reducing
interconnect congestion. Moreover, a 64-bit reconfigurable multiplier, with potential applications in Digital Signal Processor (DSP) devices, has been implemented using the TSMC 0.18 μm technology. This design has been contrasted against a standard high-performance architecture of equivalent size, and has demonstrated promising results, which will be presented in chapter 8.

In addition, other investigations into partial product reduction methodologies using counters and compressors will be addressed. In particular, an optimized 4:2 compressor distribution methodology for partial product reduction arrays will be presented. Furthermore, a design framework using the novel "locally optimized array" formalism will be introduced as a potential solution to the interconnect dilemma in future high performance architectures.

1.3.2 Thesis Organization

The thesis will begin with a general overview of the concept of digital multiplication, and the various multiplication algorithms in chapter 2. The various partial product reduction schemes used in modern multipliers will be the focus of chapter 3. This is a topic of particular importance since it is the partial product reduction scheme that distinguishes the individual multiplication schemes and their inherent characteristics. Chapter 4 will present the arithmetic sub-cells, and the numerous logic styles that are used in makeup of the partial product reduction strategies.

One of the topics of emerging importance in digital design, namely the wire interconnect, will be discussed in considerable detail in chapter 5. The physics behind these thin slices of metal, the trends with diminishing device sizes, and their effects on arithmetic performance will be presented. Chapter 6, 7 and 8 will focus on the introduction of the reconfigurable architecture, beginning with the outline of the recursive multiplication algorithm in chapter 6, followed by implementation and simulation results in chapter 7 and 8 respectively. The thesis will end with some closing remarks in chapter 9.
Chapter 2

Digital Multiplication Overview

In modern digital systems, the component responsible for handling the arithmetic operations is known as the Arithmetic Logic Unit (ALU). Arithmetic units, for the most part, lie in the critical data path of the core data processing elements. These include microprocessors (CPU), digital signal processors (DSP), in addition to application specific (ASIC) and programmable (FPGA) processing and addressing integrated circuits. Naturally the performance of the system, in regards to numerical applications, is directly related to the structure and design of the ALU.

The numerical operations carried out by the arithmetic unit may include, but are not limited to:

- addition/subtraction
- shift/extension
- comparison
- increment/decrement
- complement
- trigonometric functions
- multiplication
- division
- square root extraction
- exponential function
- logarithm function
- hyperbolic functions

As depicted in Figure 2.1 on page 13 [19], one of the critical functions carried out by the ALU is multiplication. Although not the most fundamentally complex operation, digital multiplication is
one of the most extensively used operations in signal processing, and other scientific applications. For this reason, it is one of the most widely studied areas of the field of computer arithmetic.

![Image of interrelation of arithmetic operations in modern computing devices]

**Figure 2.1** Interrelation of arithmetic operations in modern computing devices

### 2.1 Basics of Digital Multiplication

Prior to exploring the various multiplication algorithms, and the applications of each, it is imperative to present the essence of digital multiplication, and the standard nomenclature. Just as in the paper and pencil methodology of carrying a multiplication of two values, digital multiplication entails a sequence of additions carried out on partial products. The means by which this partial product array is summed to yield the final product is the key distinguishing factor amongst multiplication schemes.
In general, the partial product array for an \( M \times N \) bit multiplication is formed by the bitwise logical AND of the \textbf{multiplicand} \( A \) and \textbf{multiplier} \( X \), where:

\[
X = \{x_m, x_{m-1}, x_{m-2} \ldots x_2, x_1, x_0\}
\]

\[
A = \{a_n, a_{n-1}, a_{n-2} \ldots a_2, a_1, a_0\}
\]

The summation of the partial products will yield an \( (n+m) \)-bit product, \( P \), where:

\[
P = \{\sigma_{n+m}, \sigma_{n+m-1}, \sigma_{n+m-2} \ldots \sigma_2, \sigma_1, \sigma_0\}
\]

The partial product array will have \( n \times m \) bits, arranged in \( m \) rows of \( n \)-bit values. The array is in essence composed of a sequence of rows that are either shifted versions of the \textbf{multiplicand}, \( A \), or zeros, according to the bits of the \textbf{multiplier}, \( X \). The multiplication of two 4-bit values is illustrated in Figure 2.2.

\[
\begin{array}{c|c|c|c|c}
X & a_3 & a_2 & a_1 & a_0 \\
\hline
x_0a_3 & x_0a_2 & x_0a_1 & x_0a_0 \\
x_1a_3 & x_1a_2 & x_1a_1 & x_1a_0 \\
x_2a_3 & x_2a_2 & x_2a_1 & x_2a_0 \\
x_3a_3 & x_3a_2 & x_3a_1 & x_3a_0 \\
\hline
S_7 & S_6 & S_5 & S_4 & S_3 & S_2 & S_1 & S_0
\end{array}
\]

\textbf{Figure 2.2} A 4x4-bit multiplication leading to an 8-bit product

To better visualize the partial product reduction process, the concept of dot diagrams shall be introduced [20][21]. A dot diagram is a visual representation of the bits in an algorithm, where in this particular application the dots represent individual partial product bits. The nature of the dot diagram is to depict the bits using the relative position of individual bits, and the manner in which they are manipulated, irrespective of the actual value of each bit.

\textbf{Figure} 2.3 on page 15 shows the partial product array for a 16x16-bit multiplication [20]. The partial products are shifted to account for the differing arithmetic weight of the bits in
the multiplier, where dots of the same arithmetic weight are aligned vertically. The final product, represented by the double length row of dots at the bottom, is obtained via the summation of the dots in each column.

![Partial Product Selection Table](image)

**Figure 2.3** A simple dot diagram of 16-bit partial product array

### 2.2 Serial Multiplication Schemes

#### 2.2.1 Shift-Add Multiplication

In its most basic form, digital multiplication may be carried out through a sequence of shifts and additions of the *multiplicand* to the *partial product* register, governed by the individual bits of the *multiplier* (Figure 2.4 on page 16). This primitive form of multiplication, known as shift-add or iterative multiplication, although very simple in implementation, is very slow. The number of iterations, or cycles of addition, that are required grows linearly with the size of the *multiplier*, with each cycle having a delay of the required fast adder.
2.2.2 High Radix Multipliers

A variation of this rudimentary form of digital multiplication is the high-radix multiplication algorithm. Though fundamentally identical to the shift and add algorithms, these multipliers accept more than one bit of the multiplier on each clock cycle. This process reduces the number of clock cycles required to carry out a multiplication, at the added expense of the requirement for the immediate availability of fixed multiples of the multiplicand.

Figure 2.5 on page 17 provides an outline for a radix-4 multiplication scheme, where each clock cycle now utilizes two bits of the multiplier, effectively doubling the throughput over a conventional radix-2 binary multiplier [8]. Note that in this scheme, a separate register is required to store the previously multiplied value of 3A. The higher the radix of the multiplier, the more stored values that will be required. Through the use of higher radix multipliers (radix-8, radix-16, etc.), the greater the achievable computation
speeds; however, this comes at the expense of increased overhead in terms of shift circuitry, and storage registers for all of the required multiples of the multiplicand.

![Diagram of a basic radix-4 multiplier](image)

**Figure 2.5** Shift-add implementation of a basic radix-4 multiplier

### 2.3 Parallel Multipliers

Serial multipliers, and the concept of shift and add algorithms, are a class of primitive multiplication schemes that take advantage of simple implementation techniques. Such methods are employed where hardware overhead is an issue, or if there is a lack of a dedicated hardware multiplier. Modern high performance machines call for more sophisticated algorithms, in order to limit the computation latency.

Parallel multipliers in general may be classified into two distinct categories: linear parallel multipliers, and column compression multipliers. As opposed to the serial multiplier, parallel multipliers generate all of the partial products simultaneously. In addition, parallel multipliers limit the latency associated with carry propagation to one final fast adder.
2.3.1 Column Compression Multipliers

The foundation for the modern column compression multiplier was set forth in the 1960's by the works of C.S. Wallace, Luigi Dadda, and the Russian mathematician Yu Ofman [4][5][22]. The tree multiplier offers the potential for a logarithmic increase in delay relative to operand size. Once formed, the bits in the partial product array are passed onto a reduction network, which performs a column-wise compression of the bits, forming two final partial products. A final stage fast adder is used to sum the two resulting partial products. A schematic representation of this process is depicted in Figure 2.6 [20].

![Diagram of Column Compression Multiplier]

**Figure 2.6** Schematic representation of the tree multiplier process [20]
The methodology initially proposed by Wallace [4], makes use of Carry-Save Adder (CSA) arrays in order to carry out the column-wise compression of the partial product bits. The CSA is the most commonly used form of multi-operand adder, and it is simply composed of a series of non-interlinked Full-Adder blocks, as shown in Figure 2.7. The dot diagrams in Figure 2.8 and Figure 2.9 on page 20, depict a simple carry save adder, and the manner in which the CSA is used as a 3:2 compression scheme on a bit array. Through the use of such compression techniques, the carry propagation is postponed until the final stage, where the resulting partial products are summed.

![Carry-Save Diagram](image)

Reduce three numbers to two numbers

**Figure 2.7** Carry Save Adder (CSA) Array

![Carry Save Adder (CSA) Diagram](image)

**Figure 2.8** Carry Save Adder (CSA) used for 3:2 compression

Luigi Dadda proposed a systematic methodology for laying out the CSA reduction tree such that the minimum number of counters are used [5]. In his investigation, Dadda deduced that by determining the minimum number of required stages required for the partial product reduction process, 3:2 or even higher order counters may be placed in such
a manner as to minimize the hardware requirement. Since its inception, Dadda’s minimum circuitry paradigm has been critically analyzed and confirmed [21][23], and further explored for high order and heterogeneous counter arrays [24].

Although fundamentally demonstrated to be the most expeditious multiplication scheme, the column compression process suffers from several drawbacks. The highly irregular architecture leads to arduous an inefficient VLSI layout. In addition, large area requirements, and highly irregular interconnections create the potential for signal skew effects. These topics will be elaborated upon in the forthcoming sections.

![Diagram of 3:2 Compression scheme carried using a CSA array](image)

**Figure 2.9** 3:2 Compression scheme carried using a CSA array [8]
2.3.2 Array Multipliers

Linear parallel multipliers, often referred to as array multipliers, obtain their name from the linear relationship between their latency and operand size. The array multiplier may be regarded as a one sided CSA tree (Figure 2.10), where the reduction process occurs in ordered stages.

The highly regular layout of the array structure is depicted in 5-bit multiplier in Figure 2.11 on page 22. The systematic arrangement of the cells makes this design ideal for automated layout techniques, where the bits of the two operands are broadcast across the arrangement of full adder cells. In this scheme, the outputs of the adders trickle horizontally and vertically accordingly until the perimeter of the structure where the product bits are attained. The drawback of this scheme is that the partial products are introduced and reduced one row at a time, not in parallel as in tree multipliers. This leads to higher gate count, and slower performance.

![One sided CSA tree forming an Array Multiplier](image)

**Figure 2.10  One sided CSA tree forming an Array Multiplier**
2.3.3 Latency Approximations

One of the key characteristics that differentiates array and tree multipliers is their relative computation delay. Although the forthcoming analysis is by and large accepted as the standard measure of performance amongst multiplication algorithms, it should be noted that it is merely an approximation. In reality, the actual performance calculations encompass several factors that have been neglected for simplicity. This important topic will be discussed in further detail in the subsequent chapters.

![Diagram](image)

**Figure 2.11** A 5-bit array multiplier layout (the boxes represent full-adders) [8]

The standard timing performance analysis for multiplication algorithms is the gate, or full adder delay. A $k$-bit array multiplier is first examined, using the full adder as the standard measure of delay per stage of reduction, since it is the associated delay of one CSA. Referring to Figure 2.10 on page 21, the first stage of the Array multiplier absorbs two rows of partial products with $(k - 2)$ rows remaining. Thus a total of $(k - 2) + I$, or $(k - I)$
total reduction stages are required. The delay of the array multiplier has a linear relationship \( O(k) \) with operand size.

The tree multiplier latency calculation is somewhat more involved, since it is necessary to determine the number of stages of reduction required for a given operand size. By considering the reduction process as a column-wise compression of the partial products, where each stage reduces the column height by a factor of 2/3, we can deduce the relation:

\[
 n(h) = \left\lfloor \frac{3}{2} n(h - 1) \right\rfloor \\
 n(0) = 2
\]

where \( n(h) \) is the column height associated with an \( h \) - level tree [5][8][21][25]. Table 2.1 provides an expansion of the series that is generated through this representation, depicting the maximum column height that may be compressed down to 2 for a given number of stages. As always, symmetric multiplication (equal operand sizes) is assumed.

<table>
<thead>
<tr>
<th>( h )</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>( n(h) )</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>6</td>
<td>9</td>
<td>13</td>
<td>19</td>
<td>28</td>
<td>42</td>
<td>63</td>
<td>94</td>
</tr>
</tbody>
</table>

The above analysis may be summarized by defining the delay of a column compression multiplier as logarithmic \( O(\log k) \) with operand size. For this reason, tree multipliers are the design of choice for large, high-speed applications, where the trade-off in irregularity and size are more than compensated for by performance gains.

### 2.4 Number Systems

One of the underlying considerations that must be made when carrying out digital arithmetic is the number representation. Thus far, the discussion has been centered on the binary number system, and for the most part it is the predominant number representation of choice in most digital systems. However, it is not the only number system used in
computer arithmetic applications. There are several significant classes of unconventional number systems, each having a unique advantage in a given system, such as operation speed-up or increased accuracy.

The selection of a particular number representation may radically alter the fundamental manner in which an operation is carried out. In order to select a particular number system, the relative cost-performance benefit of the various alternatives becomes a critical factor. This section will explore the shortfalls and advantages of several number systems, beginning with fixed-point numbers, followed by real number systems.

2.4.1 Signed Fixed Point Numbers

The natural number system (also known as unsigned integer), although simple to put into practice, limits the potential of the overall design. To realize sophisticated arithmetic computation, the number system of choice must be capable of signed representation. There are several means by which signed representation may be achieved.

The easiest form of signed number representation is known as signed magnitude (or sign-and-magnitude), where a sign bit is included as part of the value. Similar to the manual representation of negative numbers, a number will have a numeric value, and a sign bit in front identifying negation. Thus a $k$ - bit system will have a $(k - 1)$-bit magnitude description. Although conceptually simple, the signed magnitude format suffers from difficult addition, where oppositely signed values will require supplementary circuitry (such as a magnitude comparator, or a subtractor) for proper addition.

An encoding scheme may also be used to eliminate negative values during computation. Biased representations, for example, will convert everything to positive numbers by adding a fixed bias value. Also referred to as excess-bias encoding, this type of representation appears in the exponential component of floating point numbers. The downfall to this representation is the difficulty in multiplication and division, and the additional computation that is required to unbias values.
Complement formats are the third major representation of signed numbers. A large complementation constant is added to all negative values, satisfying the condition that there is no overlap in the representation of the positive values. In binary systems, 2's complement representations, obtained by taking the ones complement (bit-wise negation) and adding 1, are used to describe negative values. The ease of negation and computation using complement formats leads to their attractiveness as a signed digit representation.

### 2.4.2 Redundant Number Systems

A number system with radix-$R$ may be fully described using $R$ distinct digits. For example the binary system (radix-2), can define any value using the 2 digit set $[0, 1]$. In general a positional radix-$R$ number system represents a $k$-digit value as a string of digits:

$$ (d_{k-1}, d_{k-2}, \ldots d_0) $$

where:

$$ \sum_{i=0}^{k-1} d_i R^i $$

A system is referred to as redundant if more than $R$ digits are used to define a radix-$R$ representation. Redundant number systems are primarily used in digital systems for arithmetic speed-up techniques. By over defining a system using redundant values, the cost of computation of certain operations may be appreciably reduced. Addition is one such application, where use of redundant representation allows for constant time addition, since the value of the carry bit may be obtained by examining a fixed number of previous bits [26].

A simple case of redundant number systems that has already been presented is the carry-save representation of a number. By describing a $k$-bit value using two $k$-bit numbers in the carry save format, $[0, 1, 2]$, the necessity for carry propagation is alleviated. The description of a value using what are essentially twice as many bits may be clearly justified when considering the array multiplier. By maintaining the partial product
summation in carry save form, the latency of each stage is reduced to only one full adder delay, since carry propagation is postponed until the final stage.

Other redundant number systems have also proven to be favorable alternatives to conventional systems. The signed digit number system, first classified in 1961, has demonstrated 33% savings in adders required for multiplication over standard binary notation [8]. For operand lengths that are halfway between powers of 2 (12, 24, 48, etc...), the binary stored-double-carry (BSDC) number system using the digit set [0, 3], has been a proposed as high efficiency representation [27].

The disadvantage of using redundant number representation is need for re-conversion back into conventional notation. A proposed solution is to devise an arrangement of units capable of accepting and forwarding redundant values. In their suggestion for a redundant arithmetic, Ferguson and Ercegovac introduce a multiplier that accepts two redundant operands [28]. By taking advantage of the fact that multiplication may be executed using a redundant multiplier, and a converted multiplicand, their design achieves gains of 20% over conventional forms. However, this increase in speed comes at the expense of a 30% penalty in both area and power consumption.

2.4.3 Residue Number Systems

In a residue number system representation (RNS), a number \( x \), is represented by the set of its residues with respect to moduli \( m \):

\[
x_i = x \mod m_i = < x >_{m_i}
\]

Since a value is uniquely represented using smaller residues, the mathematical operations that are carried out will inevitably be fast and simple. Addition, multiplication and subtraction are the primary vantage points in residue number systems. This is due to the fact that these functions may be carried out by directly performing the given operation on the smaller residues. Freking and Parhi [29] present an application of RNS arithmetic in
public-key cryptography schemes, which inherently use modular exponentiation and multiplication. Similar papers propose the use of RNS arithmetic, targeted towards applications that fundamentally use modular mathematics such as cryptography and encryption.

The downfall to such number systems is twofold. First, the representation efficiency of residue number systems is greatly reduced over binary notation. A k-bit representation, yielding $2^k$ unique values, may only produce half as many in RNS format. Secondly, any gains in performance achieved by implementing addition, subtraction and multiplication may be eclipsed by the severe complexity of other mathematical and logical operations.

2.4.4 Logarithmic Number Systems

In pure mathematics, the multiplication and division of logarithms is quite easily performed by addition and subtraction respectively. Since the hardware implementation of addition/subtraction circuits is substantially more straightforward than that of multipliers and dividers, logarithmic number systems (LNS) may be employed to carry out these operations. As in other non-conventional representations, LNS is only directed towards the enhancement of certain operations, and presents severe restrictions on most other standard numeric tasks, such as addition and subtraction.

Recently the Double Base Number System (DBNS) has been proposed as yet another class of redundant number representation [30]. Simple arithmetic operations are made possible by simple geometric interpretation of the orthogonal bases. DBNS provides logarithmic like computation with reduced look-up table dimensions. This representation provides yet another alternative for application specific computation enhancement.

2.4.5 Floating Point Number System

In order to achieve the levels of precision demanded by modern systems, it becomes imperative to have a number representation capable of describing real numbers. The
limited range and/or precision of fixed-point values are alleviated through the use of the floating-point number system. Unlike fixed-point representations where the location of the decimal point is predefined, floating point values allow extremely large or small numbers to be represented with the same high degree of precision by defining a value using a dynamic range.

As defined in the IEEE standard for binary floating-point arithmetic [6], a floating-point value is defined as:

\[ x = \pm f \times b^e \]

where \( x \) is the floating point value, \( f \) is the fraction of mantissa, \( b \) is the base (fixed at 2 for precision) and \( e \) is the exponent.

Floating point numbers have two distinct representations according to the standard, depending on word size. Figure 2.12 outlines the difference in the structure of the words of the 32-bit single precision and the 64-bit double precision formats. The sign (\( s \)), exponent (\( e \)) and fraction or mantissa (\( f \)) form the 32 and 64 bit precision formats. The mantissa is normalized to lie within the set \([1, 2)\), such that the MSB is a 1. In this manner the leading 1 is removed and understood; this is referred to as the “hidden one”, saving one bit in the representation. The signed integer exponent is biased accordingly, such that the value will always be a positive number; the exponent is biased for 127 and for single and 1023 for double precision formats.

Figure 2.12  IEEE floating point standard word widths for (a) single precision  (b) double precision
The non-trivial implementations of floating point addition and multiplication are outlined in Figure 2.13 and Figure 2.14 on page 30 respectively [8]. Floating-point numbers are essentially composed of a biased non-negative integer exponent, and a fixed-point fractional representation of the mantissa. Consequently, the mathematical operations that are carried out on floating-point numbers will use fixed-point arithmetic units with additional control and rounding circuitry to accommodate for the dynamic range. For this reason, the focus of design and hardware implementation of mathematical operations has been on fixed point, integer units. Conversion to floating point is made possible through additional circuitry.

![Adder & Subtractor](image)

**Figure 2.13** A floating-point Adder / Subtractor scheme [8]
2.5 Summary

In this chapter the fundamental concepts of digital multiplication have been presented. In addition, the basic framework for the various parallel digital multiplication algorithms have been discussed. It has been demonstrated that column compression techniques outperform array architectures. In general, the trade-off between performance and complexity is the driving force behind variations in multipliers. Designers have exploited the fundamental design characteristics of the individual multipliers for specific applications. For high-speed computation, the tree multiplier has been the design of choice, whereas low power, low area designs will tend to take advantage of the regular layout of the array type multipliers. As always, for the most basic algorithm, shift-add schemes will be drawn on.
A brief introduction to the applications of unconventional number systems in arithmetic systems has also been given. Although the use of such systems presents clear advantages in many unique circumstances, a complete system for generic applications has yet to be formed. Although there has been a proposal for a full DBNS processor for DSP applications [31], non-conventional number systems for the most part are targeted at systems which are capable of exploiting a particular aspect of a representation.

Furthermore, it has been established that although floating-point representations are different in nature to fixed point or integers, the basics of fixed point arithmetic form the framework for floating-point calculations. For the reasons presented, the remainder of the thesis will target integer implementations of arithmetic circuits in conventional binary format.
Chapter 3

Partial Product Reduction Techniques

The essential multiplication algorithms have been presented thus far. Such schemes form the basis for the multiplication schemes that are employed in modern devices. This chapter will focus on the modifications to the standard column compression multiplier that are made possible by the variations of partial product reduction techniques.

The predominant distinguishing factor amongst tree multiplier schemes lies in the manner in which the column wise compression of the partial products occurs. A desired aspect of system behavior, such as speed, area, layout or power, may be optimized by the proper selection of a particular approach. In the subsequent sections, methodologies that concentrate on each of the major design criterion will be presented.

3.1 Booth Recoding

Prior to delving into the understanding of partial product reduction schemes, the means by which the partial product array is obtained should be explored. The most primitive formation of the array is obtained by a bit-wise logical AND of the operands. This has been the method that has been presented in the discussion thus far.
It should be noted that recoding of the multiplier operands leads to significant reductions in the partial product array depth. The recoding formalism that has been used extensively in digital multiplier algorithms is known as Booth recoding [32]. Several variations to this recoding style have been presented [33][34], and their implementation and performance has been explored [20][35]. Some researchers believe that the recoding overhead required for the Booth algorithm counteracts the proposed gains [26][36][37]. However, this methodology has been widely applied in multipliers, and so its principles will be highlighted.

In the high radix multiplication algorithm for serial bit-at-a-time multipliers, the effective clock rate was increased by reducing the number of partial products using multiple multiplier bits at a time. However, the multiples of the multiplicand that were to be used in the scheme had to be calculated separately. Recall, in the case of the radix-4 multiplier, the bits of the multiplier were grouped into pairs, and the partial products were selected from the set \{0, A, 2A, 3A\}, where A is the multiplicand. In this case a carry propagate addition is required to generate the 3A multiple. Booth's Algorithm reduces the number of partial products without the necessity of any pre-addition to produce the partial products. If the multiples are selected from the set \{-A, 0, A, 2A, 4A\}, they are easily obtained by basic shifting and complimenting. The methodology suggest the use of 4A - A, as opposed to 3A.

Figure 3.1 on page 34 [20] outlines a 16 x 16 multiply using the 2-bit version of the Modified Booth Algorithm (MBA) known as Booth-2, MBA-2, or Radix-4 Booth. The name is based on the fact that two bits are recoded, thus halving the number of partial products, or simply regarding the multiplier as a radix-4 operand. The multiplier is subdivided into overlapping 3-bit wide groups, where each group selects a partial product according to the given selection table. To overcome the selection of 3A, depending on the group value, either 4A is pushed into the next most significant group or -A is pushed into the next least significant group. Negation is achieved by a bit-wise inversion of the value (ones complement), and a single bit (labeled in the bits marked S) is added to the least significant bit of the partial product.
Figure 3.1 Dot diagram for a Booth-2 16-bit Multiplication [20]

Although the number of partial products has been reduced from 16 to 9 in the 16-bit example, it is not a true reflection of the overall savings. Further reduction in the number of partial products may be achieved by using larger group sizes for the determination of the partial product; such recoding schemes, known as Booth-3, Booth-4 or higher. It has been demonstrated that due to circuit complexity there is little advantage Booth recoding beyond Booth-3 [20].

As presented earlier, the partial product generation circuitry required for Booth recoding may overwhelm any achievable gains. The vantage point of Booth recoding is the reduction of the number of bits per column to be compressed, especially for multipliers having operands lengths greater than single precision. The SYNOPSYS Foundation Design Libraries, for example, offer implementations of fast multipliers using both Booth-Recoded (wall), and non-Booth-Recoded Wallace Tree (nbw) partial product reduction arrays, the latter being used for smaller designs, while the Booth-Recoded version used for high speed multiplier designs having larger operand widths [38].
The primary disadvantage of the recoding scheme is its call for 2's complement notation for representation of negative partial products that are generated in the recoding process. Floating-point representation uses signed magnitude notation, thus permitting all multiplication to be carried out using positive values, with a simple sign comparison conducted to determine the nature of the product.

With this in mind, Oklobdzija et al. [36][37] claim that a single row of high order compressors, or properly allocated full adders as described in [39], will achieve the same outcome at a higher level of performance. The use of compressors and high order counters achieves at worst the same level of reduction of partial product, in less time [37]. Section 3.3 and Section 3.4 will present the use of high order counters and 4:2 compressors in high-speed multiplication algorithms.

Although Booth recoding does reduce the number of partial product rows, its employment does not affect the general structure of the partial product tree. For this reason, discussions regarding partial product reduction networks for the remainder of this chapter will be limited to non-Booth schemes for clarity; most of the formalisms presented may be extended to have encoding schemes as part of the initial partial product generation strategy.

### 3.2 CSA Reduction Schemes

In chapter 2, the parallel tree multiplier architecture using carry save adder (CSA) arrays was introduced. This scheme has formed the fundamental framework for the design of high-speed parallel multipliers over the past four decades. In this section, the dissimilarities between the Wallace and Dadda techniques will be presented. In addition, there have been various deviations from the core principles of the CSA tree layouts, each developed to introduce regularity, higher speeds, or greater efficiency; these schemes will also be reviewed.
3.2.1 Wallace and Dadda Trees

The course of action applied in these two pioneering schemes for parallel multiplication is the three phase procedure outlined in Chapter 2, and reiterated below:

- For an $k \times k$ bit multiplication, a partial product matrix is initially formed. Composed of shifted versions of the *multiplicand*, the matrix is $k$-bits high and $(2k - 1)$ bits wide.

- The matrix is reduced by a set of full adders, also referred to as (3,2) counters, or carry save adder (Figure 3.2 on page 37). The matrix is further reduced according to the number of reduction stages required as per Table 2.1 on page 23, until only two rows of partial products remains.

- A final fast adder, such as a carry look-ahead adder (CLA) is used to sum the two remaining partial products.

The total number of stages required for each scheme is the same, whereas the layout and number of required adder cells is not. The Wallace tree multiplier [4] combines the partial product bits at the earliest opportunity, where rows are put together in groups of three and reduced using (3,2) counters. This design was ameliorated by Dadda in his proposal [5], where he suggested combining partial product bits as late as possible, while keeping the critical path length (number of levels) of the tree minimal. This methodology, as confirmed by Habibi and Wintz [29], utilizes the minimum number of counters, leading to simpler CSA tree structures, but requiring a wider final fast adder.

These two schemes were more recently analyzed by Bickerstaff et al. [21] further confirming earlier results. The dot diagram representation of the two schemes is borrowed from this paper, and provided in Figure 3.3 on page 37 for clarification of the above discussion. It is apparent that the Dadda tree does in fact utilize fewer adders during the reduction process, while the Wallace tree tends to insert adders at the earliest opportunity. Although slightly more irregular, the Dadda scheme presents a more efficient design. An analytic discussion on the minimization of adders and reduction stages will be presented in the upcoming sections.
Figure 3.2 The Carry Save Adder - (a) Dot diagram of individual full-adder cell.
(b) Several full-adders used to form one level of a CSA

Figure 3.3 Variations of multiplier partial product reduction trees using CSAs
[21] (a) Dadda implementation. (b) Wallace implementation
3.2.2 Minimum Reduction Stage Requirement

As outlined in Chapter 2, considering the reduction process as a column-wise compression of the partial products, where each stage reduces the column height by a factor of 2/3, a simple relationship between maximum column height and number of required stages for compression may be obtained. This relation is outlined in Table 2.1 on page 23. The series is formed using an expansion involving rounded representations of the preceding values, making an exact algebraic interpretation difficult, if not impossible to formulate. There have been many attempts in providing a simple formula for the determination of the exact number of reduction stages required \( h \) for any arbitrary operand size \( k \). Danysh and Swartzlander use the following simplification in their analysis [18]:

\[
h = 2 \left( \log_2 k - 1 \right)
\]

This may be rearranged to represent the maximum column height \( n(h) \) for a given number of reduction stages \( h \) as:

\[
 n(h) = 2^{\left(\frac{h}{2} + 1\right)}
\]

This relation holds true only for the most typical operand size, and fails in most other cases. By returning to the formula presented above for the iterative procedure used in calculating the values in Table 2.1, some interesting bounds may be obtained. Parhami [8] has established that by ignoring the floor operation, the theoretical upper and lower bounds of

\[
 n(h) = \left\lfloor \frac{3}{2} n(h - 1) \right\rfloor
\]

may be established as:

\[
 n(h) \leq n_H(h) = 2 \cdot \left(\frac{3}{2}\right)^h
\]

\[
 n(h) > n_L(h) = 2 \cdot \left(\frac{3}{2}\right)^{h-1}
\]

The original expansion, denoted \( n(h) \), and the two mathematical boundaries are shown graphically in Figure 3.4.
Figure 3.4  Actual expansion values and theoretical bounds of minimum full adder requirements in multi-operand addition

By assuming that the actual value lies midway between the two boundaries, a novel algebraic representation may be developed using the average of the extremities:

\[ n(h) = \frac{n_L(h) + n_H(h)}{2} \]

\[ = 2 \cdot \left( \frac{3}{2} \right)^{h-1} + 2 \cdot \left( \frac{3}{2} \right)^h \]

\[ = \frac{2}{2} \]

\[ = \left( \frac{3}{2} \right)^h + \left( \frac{3}{2} \right)^{h-1} \]

\[ = \left( 1 + \frac{3}{2} \right) \cdot \left( \frac{3}{2} \right)^{h-1} \]

\[ = \frac{5}{2} \cdot \left( \frac{3}{2} \right)^{h-1} \]

\[ \therefore n(h) = \frac{5}{3} \cdot \left( \frac{3}{2} \right)^h \]
This new representation has been plotted, along with the boundary curves in Figure 3.4. It is apparent that this simplification may be used as an accurate model of the relationship between the number of stages of reduction required for a given operand size, since it almost completely overlaps the original expression derived using the expanded values. As a demonstration of the superiority of this new scheme, a comparison between it, and three other logarithmic simplifications are graphically presented in Figure 3.5. The new averaging approximation in addition to the other methods, labeled as “Log (Danysh)” used in [18], and “Log (Davio)” as outlined in [40], and “Log (Song)” as described in [41] are defined below:

\[
\text{New Averaging Approximation} \quad n(h) = \frac{5}{3} \cdot \left(\frac{3}{2}\right)^h
\]

\[
\text{Log (Danysh)} \quad n(h) = 2 \cdot \left(\frac{h}{2} + 1\right)
\]

\[
\text{Log (Davio)} \quad n(h) = 1 + \left(\frac{3}{2}\right)^h
\]

\[
\text{Log (Song)} \quad n(h) = 2 \cdot \left(\frac{3}{2}\right)^h
\]

### 3.2.3 Minimum Full Adder Requirement

The previous section discussed the development of an algebraic approach to calculating the minimum number of CSA stages with respect to operand size. In this section, the focus will be placed on the calculation of the minimum number of adder cells required in a partial product reduction matrix. The foundation for the following analysis was set forth in [5], with a detailed elaboration presented in [25].

Considering a symmetric \(k \times k\) multiplication, a partial product matrix of \(k\) rows by \((2k - 1)\) columns is formed. An ascending integer matrix representation is used to identify the columns, beginning with 0 and working to the right until 2\(k\) - 2, as in Figure 3.6.
Figure 3.5  Algebraic representation of minimum full adder requirements

Figure 3.6  An 8x8 multiplication partial product matrix
We now introduce \( p(j) \) to denote the number of “partial products” in the \( j^{th} \) column. From the symmetry in the partial product matrix outlined in Figure 3.6, the following generality may be made:

\[
p(j) = p(2k - 2 - j) = j + 1 \quad \text{for} \quad 0 \leq j \leq k - 1
\]

To account for the carry bits that are produced throughout the reduction process, \( e(j) \) is introduced. The “expanded column size” for each column is the summation of the partial product bits in the \( j^{th} \) column, and all carry bits that flow in from the \( (j - 1)^{th} \) column. Let \( q(j) \) be the number of adders needed to reduce \( e(j) \) to one. A full adder will absorb three bits from a column and generate two bits, one in the same column and one in the proceeding column. In effect, a full adder eliminates 2 bits from each column, disregarding the carry bits. Thus, the following relations must hold true:

\[
q(j) = \frac{e(j) - 1}{2} \quad \text{for odd columns}
\]

\[
q(j) = \frac{e(j)}{2} \quad \text{for even columns}
\]

\[
(1) \quad q(j) = \left\lfloor \frac{e(j)}{2} \right\rfloor \quad \text{in general}
\]

Each adder in column \( j - 1 \), whether a full adder or a half adder, will contribute one carry bit to column \( j \). Using this, the general formula for the expanded column size may be stated as:

\[
(2) \quad e(j) = p(j) + q(j - 1)
\]

Combining equations (1) and (2), and expanding the recursive relation using the staring condition that the first column does not need an adder \( (q(0) = 0) \), we obtain:

\[
q(j) = q(2k - 1 - j) = j \quad \text{for} \quad 1 \leq j \leq k - 1
\]

Thus, the number of required adders is:

\[
q_T = 2 \cdot \sum_{j=1}^{k-1} q(j) = 2 \cdot \sum_{j=1}^{k-1} j = n(n - 1)
\]

Partial Product Reduction Techniques

CSA Reduction Schemes
Where the expression was solved using the property:

\[ \sum_{j=1}^{N} j = \frac{1}{2} N(N + 1) \]

As mentioned previously, the statements have assumed the reduction to one single row; however, this is not necessary for partial product reduction. Digital multipliers use a final fast adder, due to their superior carry management, for the summation of the final two rows of the reduced partial product matrix. To account for this, the redundant adders will be removed (2n - 2 adders to be exact) from the adder count resulting in a total count of:

\[ q_T = (n - 2)(n - 1) \]

(n - 1) of which are half adders.

3.2.4 Variations of CSA Trees

The inevitable evolution of computer arithmetic has brought with it the development of the basic concepts of CSA tree multipliers. Over time, many researchers have contributed to the enhancement of the CSA partial product reduction scheme, targeting area, delay, regularity, and power. Naturally, the advancement of circuit technology created new and more efficient implementations of the CSA on a circuit level, however the focus here will be on algorithmic and architectural modifications.

The most straightforward modification of the Dadda/Wallace tree multiplier, is the incorporation of Booth's algorithm for an overall cutback of the total number of partial products fed into the reduction network. The Booth Encoded Wallace tree multiplier has been the pinnacle high-speed multiplication algorithm used in many digital design suites, including Synopsys. Millar et al. [34] have critically analyzed this technique in terms of area overhead and latency. Although crude estimations using gate count and gate delay were used as the test metrics, their results have shown that potential exists for a 16% decrease in delay, at a cost of 28% increased complexity, for a 32x32 multiplier.
A different approach to the standard tree has been explored by Hekstra and Nouta [10]. In their attempt to increase the regularity of the parallel tree multiplier, they propose an "array of arrays" based structure. By taking advantage of the simplistic layout of array multipliers, they propose the use of sub arrays that are used to sum portions of partial product matrix. The main array then sums the partial product sums formed by the sub arrays to form the product (see Figure 3.7). The choice of the number of stages and sub array height has a significant impact on the performance of the multiplier. It has been demonstrated that although this scheme promotes circuit regularity, it can at most achieve an optimal multiplication time of $O(\sqrt{k})$.

![Figure 3.7 Array of arrays layout as outlined in [10]](image URL)

The "Windsor Multiplier" [25], demonstrates that gains in terms of area and interconnect length may be achieved through the reallocation of adder blocks within a Dadda tree. In this paper, Wang et al. verify that for a $k$-bit symmetric multiplication (both operands being $k$ bits), the total number of full adders will be $N = (k - 1)(k - 2)$. Furthermore, by referring to Section 2.3.3 on page 22, the number of stages of reduction required for a $k$-bit multiplication will equal $h$, so long as:

$$n(\ h - 1 \ ) < \ k \leq \ n(\ h \ )$$
The distribution of the $N$ adders to the $h$ stages is the focus of attention of this paper. It has been confirmed in [25] that many arrangements may exist which yield the same overall result using the same number of adders and stages. A procedure for allocating adders to a CSA reduction tree has been presented that enables maximum area efficiency in terms of cell distribution across a given region of silicon. In addition, this technique may be modified to pursue minimum inter-stage interconnect lengths. The significance of this work is that it is one of pioneering papers in terms of addressing interconnect configuration as a vital element in high-speed design.

The final amendment to the standard scheme that will be covered in this section deals with the physical interconnection of the adder cells once they have been placed. The notion of the exploitation of fast input and fast outputs in arithmetic sub-cells has been introduced by Oklobdzija et al. [39]. An algorithmic approach to speed optimization in partial product reduction trees has been proposed that draws on the inherent timing characteristics of adder cells.

Figure 3.8 depicts a typical logic level representation of a full adder. With the aid of this diagram, it becomes apparent that the delay from the different inputs to the outputs is not constant. By connecting the fast outputs to the fast inputs a minimal path length will be formed which may be applied to the critical path. The method presented in the paper suggests that through careful modeling of the relationship between input and output delays, followed by global optimization of the interconnections consistent delays may be achieved through each path.

![Logic diagram of a Full-Adder](image-url)
The proposed Three Dimensional Minimization (TDM) methodology, though a different approach than Dadda's reduction schemes, maintains the same number of cells and levels as the previous method, while concurrently making use of fast data paths. Though advantageous over a standard Dadda arrangement, use of the TDM leads to more complex interconnections and layout. The analysis in the two papers discussing this approach [39][42] neglect the adverse effects of complicated interconnect and layout plans.

3.3 High Order Counters And Compressors

3.3.1 Counters

Although the work of Dadda has been directly linked to CSA reduction schemes, his manuscript [5] had a much broader focus, encompassing the applications of parallel counters for partial product reduction. The full adder, or carry save adder, is a particular subset of the class of parallel counters. A parallel \( (N, M) \) counter is defined as a combinational network having \( M \) outputs and \( N \leq 2^M \) inputs of equal weight. The \( M \) outputs are based on the number of logic 'ones' that appear at the \( N \) inputs. Any size counter may be constructed, so long as the \( M \) output bits are sufficient to represent all possible sums of the \( N \) inputs. Examples of typical counters include (3,2), (7,3), (15,4), and a few examples are depicted in Figure 3.9.

Stenzel et al. [24] expanded the notion of the parallel counter by introducing counters which can receive several successively weighted input columns. Counters of this type are denoted as \( (C_{k-1}, C_{k-2}, ..., C_0, d) \), where \( K \) is the number of input columns, \( C_i \) is the number of input bits in the column of weight \( 2^i \), and \( d \) is the number of bits in the output word. Several examples of such multi-column input counters are provided in Figure 3.10 on page 47.

Counters accepting inputs from adjacent columns in many cases introduce the requirement for carry propagation, thus making them inferior to standard full-adder arrays. By deviating from the normal layout and construction of binary counters, several new
alternatives may be formed. Highly efficient counters employing parallelism in handling large number of input bits have been proposed for high-speed arithmetic circuits [43][44]. These counters employ speed-up techniques used in fast adders, such as carry-select or carry skip. As a means of overcoming the carry propagation dilemma, the use of redundant number representations for counters has also been presented. The gains by these proposals may be questioned, since redundant format will need conversion circuitry to binary format, and sequential counters will necessitate complex clock trees throughout the design.

![Figure 3.9 Examples of Dot Representations of Parallel Counters](image)

![Figure 3.10 Examples of Dot Representations of Multi-column Counters](image)
Using multi-column and high order counters, a larger portion of the partial product matrix may be reduced in one cell. This comes at the expense of larger input and node capacitances, and longer pull-down paths. This is due to the number of input transistors increasing quadratically with the number of inputs [41]. Furthermore, the inherent irregularity of the manner in which such counters reduce the matrix, forces the need for various counter structures to be used within the same reduction process.

High order counters for partial product reduction have been explored and implemented in numerous proposals [8,23,24,32,41,43-49]. The schemes demonstrating the most promise for general multiplier architectures having arbitrary operand sizes include low order counter classes based on full adders, and 4:2 compressors. An interesting variation, in terms of the fundamental circuitry involved, are high order counters based on the use of threshold logic (TL) [50-52]. The use of threshold logic, more specifically capacitive threshold logic (CTL), has been suggested for arithmetic circuitry [53][54], and will be subsequently discussed in Section 3.6.

### 3.3.2 Compressors

Similar to counter structures, digital compressors are used to reduce a given set of inputs to a vector output. The primary distinction between counter and compressor circuits is that compressors do not necessarily follow the standard pattern of $M$ outputs drawn from $2^M$ inputs. An $[N : M]$ compressor in essence is a variation of a counter that employs a separate path between compressor units in order to generate $M$ final outputs using $N > 2^M$ input bits. Compressor configurations are generally formed using arrays of horizontally interconnected compressor units. In this manner, a horizontal carry signal may propagate laterally across the row of compressor units in order to account for the excess bits formed in the reduction process.

Higher order classes of compressors may also be used using variations of large counters with horizontal interconnections; however, these circuits suffer the same fate of high capacitance, large circuitry, and problematic matrix positioning as large counters. Song
and DeMichelli [41] have examined the implementation of higher order compressors. Labeled as the 9:2 family of compressors, these structures are formed using 4:2 compressor and (3,2) counters. An analysis of counters against compressors has been carried out by Mehta et al. [45]. In their research, the use of (7,3) counters against (7:3) compressors, amongst many others, has been evaluated. Their findings illustrated no major delay advantage in the use of large compressors over large counters, except for greater interconnect complexity introduced by the inter-cell wiring of the compressors.

The most widely used style of digital compressor, which displays several promising characteristics for multiplication applications, is the 4:2 compressor. This special class of compressors requires a section of discussion on their own.

### 3.4 4:2 Compressors

Since its inception by Weinberger in 1981 [55], the concept of the 4:2 compressor has soared in popularity in many digital multiplication and multi-operand addition schemes. The application of 4:2 compressors has also been the focus of several studies promoting its use over Booth recoding schemes [26][36][37]. This section provides an in depth look at the various configurations of 4:2 compressors, and a novel layout scheme for optimal placement.

#### 3.4.1 Structure Of 4:2 Compressors

The 4:2 compressor transformed the standard frame of mind of counter based partial product reduction schemes by introducing the notion of horizontal data paths within stages of reduction. Though not technically a counter, since it is impossible to use 2 output bits to represent 4 binary input bits, the 4:2 compressor is based on a 5:3 counter structure. The *vertical*, or cross-stage, data path forwards the outputs on to the next stage, while the *horizontal*, or inter-stage, interconnections allow for the propagation of the generated carry bits. The inter-stage carry transmission is limited to only one stage, due to the offset nature of the internal structure of the circuit.
Figure 3.11 Cascaded full adders composing a basic 4:2 compressor

The most primitive representation of the 4:2 compressor is a pair of cascaded full-adders (Figure 3.11). This configuration does not reduce the overall delay of the column compression process, in fact it may increase the overall delay depending on the column heights [39]. Compressors of this form allow for increased regularity of design.

A step towards the reduction of the delay of the compressor may be achieved by applying the concept of fast input and outputs. Oklobdzija et al. [39], present a modified 4:2 compressor composed of properly stacked full adders with a minimized critical path delay. Figure 3.12 outlines this methodology. The optimal 4:2 compressor structure, in terms of gate delay, arises when the entire structure is regarded as one entity, as opposed to a composition of two full adders. This enables further optimization of the overall scheme, as depicted in Figure 3.13 [36]. In this design, the critical path of the compressor has been reduced from 4 XOR gates, as in the full adder arrangements, to 3.

The circuitry of the 4:2 compressor may be further enhanced if the actual design of the cell is taken down from a gate level description to transistor level. Although a valid estimate of delay, the gate level description of any arithmetic circuitry provides only an estimate of performance. Just as the compressor block was further optimized by decomposing the internal arrangement from two full adders to a series of gates, so too can the gate level design. By dismantling the gates down to the transistor level, further optimization may be carried out, reducing both transistor counts, and path delays. This topic of critical importance to modern arithmetic design will be further explored in chapter 4.
Figure 3.12  4:2 compressor layout making use of fast input/output paths

Figure 3.13  A minimized gate level representation of a 4:2 compressor

3.4.2 Proposed Optimized Compressor Layout

An arbitrary distribution of 4:2 compressors, though effective, may not be entirely efficient. Just as Dadda's scheme takes into account the minimization of the number of counters required per stage, a similar plan is required for 4:2 compressor arrays. Unlike standard counters, the layout of the 4:2 array presents some unique challenges. In a given row of compressors, the middle cells may all be regarded as devices that convert 4 input bits in a column to 2 output bits of differing weight. However the two compressors at the opposite ends of the row must be regarded as (5, 3) counters. The first compressor takes
in 5 inputs from a given weight, while the last compressor generates 3 output bits, two of which are in the next higher weight.

The process for delegating compressors must follow the same principles that have been established by Dadda [5], and later used by Stenzel et al. in their analysis of variable length counters for partial product reduction [24]. It is necessary to first determine the number of stages required for a given partial product matrix size. Similar to the discussion presented earlier regarding the minimum number of full adders, we now examine the 4:2 compressor distribution. Consider a column within the matrix having height \( n(h) \), where:

\[
\begin{align*}
    n ( h ) &= \lfloor 2n(h + 1) \rfloor \\
    n ( h - 1 ) &= \left\lfloor \frac{n(h)}{2} \right\rfloor
\end{align*}
\]

This expression is a straightforward extension of the fact that the 4:2 compressor effectively cuts the number of bits in a column by half. By expanding the relationship, we obtain the relationship depicted in Table 3.1 between column height and the number of required stages.

### Table 3.1 Max column height per stage of a 4:2 column compression tree

<table>
<thead>
<tr>
<th>( h )</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
</tr>
</thead>
<tbody>
<tr>
<td>( n(h) )</td>
<td>3</td>
<td>4</td>
<td>8</td>
<td>16</td>
<td>32</td>
<td>64</td>
<td>128</td>
<td>256</td>
</tr>
</tbody>
</table>

Unlike the (3,2) counter case, however, there exist an exact algebraic formula for the association of the column height with the number of required compressor stages, and it is:

\[
h = \left\lfloor \log_2 n(h) - 1 \right\rfloor
\]

where once again, \( n(h) \) represents column height, and \( h \) represents the number of stages.

With the development of general relationship between column height and the number of compression stages, an optimization strategy can be formed. As with the previous
methods for counter distribution [5][24], the minimized allocation of 4:2 compressors will be one that reduces the column height sufficiently to meet the maximum column height for the next level.

A novel mathematical analysis has been carried on the minimum 4:2 compressor distribution, borrowing from the principles used in [25] in obtaining the theoretical bounds of full adders in a partial product matrix. Re-introducing the variables used in Section 3.2.3, we denote the number of “partial products” in the \(j^{th}\) column as \(p(j)\), where due to the symmetry of the matrix,

\[
p(j) = p(2k - 2 - j) = j + 1 \quad \text{for} \quad 0 \leq j \leq k - 1
\]

The “expanded column size”, \(e(j)\), for each column is the summation of the partial product bits in the \(j^{th}\) column, and all carry bits that flow in from the \((j - 1)^{th}\) column. Finally, \(q(j)\) is the number of compressors necessary to reduce \(e(j)\) to two, and their relation may be expressed algebraically as:

\[
q(j) = \left\lceil \frac{e(j)-2}{4} \right\rceil
\]

A 4:2 compressor takes in 5 bits from one column (4 partial products and one carry-in), and generates 3 bits, two carry bits to the next column and a sum bit in the same column. As a whole, a 4:2 compressor will eliminate 4 bits from column \(j\), while adding two bits to column \(j + 1\). With this in mind, \(e(j)\) may be expressed as:

\[
e(j) = p(j) + 2 \cdot q(j - 1)
\]

Combining (1) and (2), and expanding the recursive equation, we obtain:

\[
q(2k - j) = q(j) = \left\lfloor \frac{j}{2} \right\rfloor \quad \text{for} \quad 2 \leq j \leq k - 1
\]

\[
q(j) = \left\lfloor \frac{k - 1}{2} \right\rfloor \quad \text{for} \quad j = k
\]
From these relations, we can calculate the total number of 4:2 compressors as:

\[ N_C = 2 \cdot \sum_{j=2}^{k-1} q(j) + q(k) \]

For even operand sizes, the number of compressors is:

\[ N_C = 2 \cdot \left( 2 \cdot \sum_{j=1}^{\frac{k-2}{2}} j \right) + \left( \frac{k-2}{2} \right) \]

\[ = 4 \cdot \frac{1}{2} \left( \frac{k-2}{2} \right) \left( \frac{k}{2} \right) + \left( \frac{k-2}{2} \right) \]

\[ = \left( \frac{k-2}{2} \right) (k + 1) \]

Whereas for odd operand sizes, the number of compressors becomes:

\[ N_C = 2 \cdot \left( 2 \cdot \sum_{j=1}^{\frac{k-1}{2}} j \right) - \left( \frac{k-1}{2} \right) \]

\[ = 4 \cdot \frac{1}{2} \left( \frac{k-1}{2} \right) \left( \frac{k+1}{2} \right) - \left( \frac{k-1}{2} \right) \]

\[ = k \left( \frac{k-1}{2} \right) \]

The mathematical analysis provides a means by which an estimation on the minimum number of 4:2 compressors required in a reduction matrix for an arbitrary operand size. A further optimized approach has been developed, enabling a more efficient distribution of the compressors in a reduction matrix. Prior to introducing the iterative procedure
developed for minimum 4:2 compressor placement, a set of preliminary definitions and explanations will be presented for clarity:

a) The number of required stages, \( h \), and the targeted maximum column height, \( n(h) \), for the first reduction stage are solved using:

\[
    h = \left\lceil \log_2 n(h) - 1 \right\rceil
\]

\[
    n(h) = 2^{h+1}
\]

The maximum column height each subsequent stage is determined according to:

\[
    n(h-1) = \left\lceil \frac{n(h)}{2} \right\rceil
\]

where \( n(h) \) is the current maximum column height, and \( n(h-1) \) is the maximum column height for the next stage. For any given maximum column height, the maximum column height for the next stage may be calculated according to:

\[
    n(h-1) = 2^{\left\lceil \log_2 n(h) - 1 \right\rceil}
\]

b) The previously defined ascending integer index will be used, assigning 0 to the least significant column and 2\(k\)-2 to the most significant.

c) A 4:2 compressor row will be defined as a succession of linked 4:2 compressor cells beginning with one half-adder, and ending with one full adder cell. The connection of the cells is outlined in Figure 3.14.

d) A row of 4:2 compressors will be added for every 2-bits that a given column height must be reduced for the next stage. It should be pointed out that without the half-adder commencing the row, the first compressor in each row would in effect behave as a [5:1] compressor. This becomes apparent when one considers the notion that the compressor cell takes take 5 bits from a column, including the CARRY IN which is of equivalent
weight, and so may be taken from the same column. Furthermore, the output for this compressor would consist of two bits to the next higher weight, a CARRY bit for the next stage, and a CARRY OUT signal that propagates horizontally. Only one bit will be passed onto the equivalent weighted column of the next stage, that being the SUM. For increased regularity and efficiency the half-adder replaces the initial compressor.

e) The carry out of the final compressor for each row must be accounted for in the allocation of compressors, since the CARRY OUT signal is not being absorbed within the same stage. Thus for the proceeding stage the column of the next weight will receive two carry bits as a result of the row of compressors. The row of compressors may need to be extended if the two bits exceed the maximum column height for the next stage. The final full adder ending a given row within a stage will account for the additional bit generated.

![Diagram](image_url)

**Figure 3.14 Definition of a 4:2 compressor row**

With the necessary background information in place, the proposed iterative procedure for the compressor layout scheme for symmetric multiplication will be presented:

1) Determine the number of compressor rows \( N_R \) required for the given stage according to the equation:

\[
N_R = \left\lfloor \frac{n(h) - 2^{\left\lceil \log_2 n(h) - 1 \right\rceil}}{2} \right\rfloor
\]

where the expression, \( 2^{\left\lceil \log_2 n(h) - 1 \right\rceil} \), refers to the maximum column height for the next stage.
2) \( N_R \) rows of 4:2 compressors, as outlined in Figure 3.14 on page 56, are placed in the partial product reduction tree. The first row will begin at column:

\[ j_{1F} = 2^\left\lfloor \log_2 n(h) - 1 \right\rfloor \]

and ending at column:

\[ j_{1L} = 2k - 1 - 2^\left\lfloor \log_2 n(h) - 1 \right\rfloor \]

Every subsequent row will begin at column:

\[ j_{iF} = 2^\left\lfloor \log_2 n(h) - 1 \right\rfloor + 2i \]
\[ j_{iF} = j_{(i-1)F} + 2i \]

and end at column:

\[ j_{iL} = (2k - 1 - 2^\left\lfloor \log_2 n(h) - 1 \right\rfloor) - 2i \]
\[ j_{iL} = j_{(i-1)L} - 2i \]

where \((i)\) represents the row number within each stage up to \(N_R\) rows.

3) Repeat steps (1) and (2) until only two rows remain within the partial product matrix, at which point a final fast adder will be used.

This process is better explained through the aid of a graphic example. Figure 3.15 and Figure 3.16 on page 59 provide two examples of the minimized 4:2 compressor cell distribution for column compression. The symmetric layout of the compressor rows, in addition to the general configuration of each row (beginning with a half-adder and ending with a full-adder) is now evident.

Using the minimized 4:2 compressor distribution presented, we can obtain the bound on the number of compressor cells required. By observing that the longest compressor row for \(k \times k\) multiplication begins at column \(j_F = 2\), and ends at column \(j_L = 2k - 3\), and so is composed of \(2k - 4\) compressor cells. Through a simple observation, which
may be proved algebraically by examining the column height equations, the number of columns (and therefore compressor cells) in each subsequent row will be \(2k - 4 - 4i\), where \(i\) represents the compressor row number. The total number of rows \(N_{RT}\) required is:

\[
N_{RT} = \left\lceil \frac{k - 2}{2} \right\rceil
\]

The total number of compressor cells \(N_C\) will be the summation of the cells within each row by the number of rows required. This expression may be represented algebraically as:

\[
N_C = \sum_{i=0}^{\left\lceil \frac{k-2}{2} \right\rceil - 1} 2 \left( k - 2 - 2i \right)
\]

Of this number total, the number of cells which are half or full adders will be:

\[
N_A = 2 \cdot \left\lceil \frac{k - 2}{2} \right\rceil
\]

![Diagram of compressor layout for a 16x16 multiplication](image)

**Figure 3.15**  Compressor layout for a 16x16 multiplication
Figure 3.16  Compressor layout for a 24x24 multiplication

Thus the number of 4:2 compressors alone \((N_{C^*})\) will be reduced to:

\[
N_{C^*} = N_C - N_A
\]

\[
N_{C^*} = \sum_{i=0}^{\left[\frac{k-2}{2}\right]} 2 \left( k - 2 - 2i \right) - \left( 2 \cdot \left[\frac{k-2}{2}\right] \right)
\]

This scheme presents an efficient, and very regular, distribution of compressor cells. Through the use of half and full adders, for beginning and terminating each row respectively, redundant reduction is all but eliminated, and the cells, for the most part,
have complete occupancy at their inputs. This new scheme reduces the number of required 4:2 compressors as determined by the theoretical lower bounds calculated previously. There is an average reduction of 2.724% in the total number of cells, and 8.171% average decrease in 4:2 reduction cells if this new scheme is employed as opposed to a standard minimum distribution of 4:2 compressors. Appendix A provides the calculations for the given values.

The previous analysis grants the opportunity of forecasting the inherent characteristics of partial product reduction schemes involving (3,2) and 4:2 compressors. The most significant algebraic calculations include cell count, maximum gate delay and interconnect count. The particular cells are naturally technology and layout dependent, however the number of cells required under each scheme can offer implicit estimations of power and area.

A complete summary of the comparison between the proposed optimized 4:2 distribution and the minimized (3,2) Dadda tree is given in Appendix A. Some of the findings are plotted in Figure 3.17, Figure 3.18 and Figure 3.19. The calculation for the total cell count has been provided in this chapter for both full adder and compressor reduction schemes. As expected the number of cells in the 4:2 scheme is significantly less than that of the full adder reduction scheme, however the individual cells are in fact much larger.

The presented methodology for 4:2 compressor distribution, is by no means the only strategy that may be employed in partial product reduction arrays. It merely provides an algebraic and systematic method for the development of the reduction array having the least number of cells. Other distribution strategies involve the use of the maximum number of compressors as early as possible in order to minimize the final fast adder length [56][57][58]. The strategy employed in these 54-bit multiplier designs is similar to a Wallace tree layout using full adders, whereas the optimized distribution scheme is more closely linked to an optimized Dadda tree layout.
Figure 3.17 Total Number of cells in a partial product reduction tree
(NFA = Number of (3,2)  NCT = Number of total cells for 4:2)

Figure 3.18 Total Number of interconnects in a partial product reduction tree
(IFA = Interconnects for (3,2)  IC = Interconnects for 4:2)
The cell count, does lead to the calculation of the total number of interconnects within the reduction tree. The calculation for the number of wires is taken as a function of the number of cells, and their outputs. For the (3,2) cells which have 2 outputs, the derivation of the interconnect count, $I_{FA}$, is simply the sum of the outputs of the cells and the square of the operand size to account for the original partial product matrix formation:

$$I_{FA} = 2 \cdot N_{FA} + k^2$$

The 4:2 compressor interconnect length is calculated in a similar fashion, using a count of 3 for each compressor and 2 for each adder:

$$I_C = 3 \cdot N_{C \text{ no adder}} + 2 \cdot N_{C \text{ adders}} + k^2$$

The interconnect calculations show the advantage of the 4:2 compressor scheme, however there is a hidden benefit in this scheme. Approximately one third of the interconnects in the 4:2 scheme are the local, horizontal intra-stage carry paths. Thus, the advantages are twofold, fewer and shorter interconnects for the compressor reduction process. The delay for the two schemes is simply a product of the number of reduction stages and the approximate delay per stage (2 XOR for the (3,2) and 3 XOR for the 4:2).

![Figure 3.19](image.png)
3.5 Low Power

As discussed in chapter 1, low power design techniques have emerged as one of the dominant fields of VLSI research. The expansive applications of mobile electronics, more specifically portable processing units, have created new challenges for digital circuit designers in their attempt to minimize power dissipation. Arithmetic circuitry is by no means immune to these new trends. In fact, low power design methodologies factor significantly in new ALU architectures.

The current dichotomy in digital system design is the need for devices that are both high performance and low power. As with any other design process, this issue is one of compromise, where in this situation it is the trade-off between speed, power and area. In terms of power, there are four predominant concerns related to power dissipation in VLSI designs:

- Portable devices, relying on batteries as an energy source must make use of low power techniques in order to prolong battery life and operating time.
- The energy consumed by devices is for the most part due to heat dissipation. Current techniques for dispersing the heat away from the device packaging will not be able to cope with the increased densities. It has been predicted that in 15 years the amount of heat dissipated will exceed packaging limitations by 25 times [9].
- Overheating as a result of high current densities leads to effects such as electromigration, and other forms of breakdown, resulting in shortened life expectancies.
- High current densities and voltages induce greater coupling effects, leading to an increased susceptibility to erroneous computation and data transfer.

For true 'power conscious' design methodologies, all sources of dissipation, and all opportunities for minimization must be understood and taken into account at every level of the design process [59]. Low power circuitry, including circuit implementations and analysis of low power adders and compressors, will be discussed in chapter 4. This discussion will begin with an overview of sources of power dissipation in modern circuitry, followed by techniques used to alleviate some of the power losses.
3.5.1 Leakage And Short Circuit Power Dissipation

In a standard CMOS process, there are three sources of power dissipation, namely dynamic, short circuit and leakage power [59][60]. Leakage power dissipation refers to any source of power loss that may occur under steady state conditions as a result of small leakage currents in the devices. This type of power consumption is a direct consequence of the chosen logic style, and topology.

Short circuit power dissipation is a form of dynamic power consumption that occurs due to a momentary dc path established between the supply and ground rails during circuit switching. Standard CMOS logic, is prone to such power losses, especially in the case of a digital inverter, where only two transistors exist between the power and ground rails. An accurate estimation of short circuit power dissipation is provided in [61]. The combined effects of short circuit and leakage power may be limited to only 15% of the total power consumption with the remaining caused by dynamic power dissipation [60].

3.5.2 Dynamic Power Dissipation

Dynamic power consumption is the primary source of power dissipation in modern CMOS digital design. Dynamic power loss arises as a result of the charging and discharging of capacitances (some parasitic) due to signal switching, or unintentional transitions as a result of timing effects. The equation for dynamic power dissipation is defined in [7] as:

\[ P_{Dynamic} = \sum_i \left( C_i \cdot V_i \cdot \alpha_i \right) \cdot f_{clk} \cdot V_{DD} \]

\[ + \sum_i \left( K_i \cdot \alpha_i \right) (V_{DD} - 2V_T)^3 \cdot f_{clk} \]

where:

\[ K_i = \frac{\beta_t}{12} \]

and the summation occurs over all nodes (i) of the circuit.
The individual components in the expression are defined as:

\[
\begin{align*}
C_{i_{\text{load}}} & \quad \text{load capacitance at node } i \\
V_{i_{\text{swing}}} & \quad \text{voltage swing} \\
\alpha_i & \quad \text{switching activity factor} \\
f_{\text{clk}} & \quad \text{system clock frequency} \\
V_{DD} & \quad \text{power supply voltage} \\
V_T & \quad \text{transistor threshold voltage} \\
\beta & \quad \text{transistor gain factor}
\end{align*}
\]

The key algebraic representation for power in algorithm or architecture level description may be simplified to:

\[
P_{\text{Dynamic}} = \alpha \ C_L V_{DD}^2 f
\]

In this representation, \( C_L \) is the physical load capacitance, \( V_{DD} \) is the supply voltage, \( f \) is the operating frequency and \( \alpha \) is the switching activity. This expression is the dynamic power dissipation for one node; naturally the dynamic power of the overall system would be the summation of every node in the circuit. Assuming ideal input conditions are utilized (zero fall and rise times during transitions), then the total power consumption is independent of the transistor characteristics, and the generalized formula may be used [62].

The switching activity is the average rate of switching, or number of edge transitions (0 to 1, or 1 to 0) per clock cycle. The switching activity factor is introduced since the generalized power formula assumes a transition from 0 to \( V_{DD} \) occurs during every clock cycle, which is not necessarily the case in real circuits. The power dissipated in a transition is in fact the dissipation of heat through a PFET as the output node charges to \( V_{DD} \), and the dissipation of heat through the NFET as the output node discharges the stored charge. It is said that one half of the power drawn from the supply is dissipated by
the PFET [62], and only half is used to store charge on the output node. No energy is
drawn from the supply during the discharge phase.

By examining the given expression, it is clear that the reduction in any of the given
parameters will lead to a reduction in power. The operating frequency is one means of
reducing the power dissipation, however, it is in direct conflict with performance
requirements, and so is usually considered as a last resort. Having a quadratic relationship
with power, the operating voltage is the most critical factor in reducing power. There are a
number of methods for reducing voltages in circuitry, such as dynamic voltage scaling
[63], multiple on chip voltage levels, and simply reducing the supply voltage due to
process advancements. Although theoretically effective, the first two techniques are still
in their infancy, and the latter technique is process and technology dependent, and is not
influenced by the design.

Reduction in parasitic capacitances would result in the charging and discharging of
smaller capacitive nodes. This has the benefit of not only decreasing the power
dissipation, but the switching time also. The push towards deep sub-micron technology
has lead to smaller device dimensions, requiring shorter interconnects; both critical in
limiting the parasitic capacitance. The only issue that comes into play on an architectural
level is the effects of parasitic and coupling capacitance that is aggravated by technology
scaling. This is a topic of special interest in this thesis, and will be further explored in
chapter 5.

Finally, one of the most significant parameters of dynamic power dissipation is the
switching activity, or the average rate of switching on a circuit's nodes. Limiting the
switching activity, regardless of technology and implementation details, may drastically
reduce dynamic power consumption in arithmetic circuitry. By analyzing the switching
characteristics of arithmetic algorithms, optimal circuit configurations may be developed
for low power applications.
3.5.3 Dynamic Power Management

As a whole, the most obvious means of reducing power is by shutting down a circuit completely. *dynamic power management* refers to the selective shut down or slow-down of a system's components that may be idle or under utilized [64]. Computational circuits are on the whole event-driven, for this reason there will be periods of time that the system will be inoperative. By removing power from these dormant regions, dynamic power dissipation within these regions will be eliminated.

One of the most efficient means of reducing power dissipation in digital circuits is clock gating. As described in [65 - 68], clock gating is used to cut-off the clock signal from portions of the clock tree in synchronous systems. In this manner, the affected sections of the system will be forced into an idle state, preventing wasteful switching of the internal nodes.

For purely combinational systems, such as most partial product reduction trees, a technique known as *guarded evaluation* may be employed to shut down circuits during periods of inactivity. In this approach the entry of inputs to the main circuit, or portions of the system, is obstructed. Without a change in inputs, the circuit will remain at a halt, drastically reducing power dissipation.

*Precomputation networks*, fall in line with the concept of guarded evaluation, since they are used to block signals from entering and manipulating all or part of a system. The precomputation network is used to perform a preliminary calculation to determine the nature and size of the input vector entering the main circuitry. By taking advantage of a simple circuit up front, considerable savings in power may be realized in the overall system. For example, such a circuit may be put in place to determine the correlation between consecutive inputs to a multiplier. Based on its findings, a multiplication process may be carried out, else the previous output may simply be reused, and the new inputs discarded.
Algorithm inherent activity is the switching activity that has to occur in any realization of the design, independent of the implementation style chosen [59]. This particular topic is of great interest in architectural design, since it is technology and process independent. A designer has the ability of modifying an algorithm in order to prevent unnecessary transitions in logic states through a system.

The work of Muhammad et al. [69] provides an excellent example of limiting algorithm inherent activity with specific implications on digital multipliers. In this study, various configurations of array multipliers were evaluated against 4:2 tree multipliers for DSP applications. The research examined the effects of signal strength of the DSP input signals, and consequently the location of zeros in the operands of the multipliers, on switching activity. This problem may be more clearly defined by way of an example.

Assuming an array multiplication structure is used having a multiplier $A$, and multiplicand $B$. Any input having $a_i = 0$ will pass a row of zeros, and those where $a_i = 1$ will pass on the bits of $B$. In Figure 3.20, the authors depict a scenario where bits $a_3$ and $a_7$ of the multiplier $A$ are interchanged in the array structure, to demonstrate the independence of the row position to the overall structure. Such structures are possible since array architectures are concerned with adding partial products from cells occupying the same column; the order in which the partial products appear is insignificant.

In the case where the least significant bits (LSB) of $A$ are 1's and the most significant bits (MSB) are 0's, then it is clear that the output will have to propagate down through the remainder of the bits until the final bit of $A$ is applied, generating undue switching activity. If for the same example, a most-significant-bit first strategy is employed, the switching activity will be limited to the final portion of the array structure, and the top portion will remain idle due to the rows of zeros (Figure 3.21). Furthermore, the authors present hybrid array structures, where based on a priori signal probabilities and statistics, a certain partial product order is to be selected that minimizes the switching activity.
Figure 3.20  Interchanging rows within an Array Multiplier structure

Figure 3.21  Hybrid MSB first Array Multiplier structure
In the domain of general application circuitry, the research in [69] found that due to the balanced nature of tree multipliers, strong correlation in signals reduces switching activity automatically. These results are promising for general DSP architectures employing tree multipliers, since data signals are for the most part correlated, and rapid changes are seldom processed. The tree multiplication scheme lends itself to general applications, without the need for customized inner structures to meet predetermined statistical values.

3.6 Threshold Logic

Threshold logic offers a unique alternative to the traditional Boolean methodology of switching logic, restricted to only two possible states within a system, namely on (1) or off (0). Introduced in the early 1960's, threshold logic gates, otherwise known as majority gates, can compute any linearly separable Boolean function, where the result is based on the weighted sum of its inputs relative to a specified threshold value [70]. Threshold logic is closely related to artificial neural networks, in that each gate is formed by a large fan-in of input signals having an associated weight, and the gate (or node) must determine the output state based on the total input voltage level.

A primitive example of a threshold logic gate is depicted in Figure 3.22 on page 70. In this example the four inputs each have an associated weight of 1, and the gate has a threshold value of 4. Thus the output is active if the combined weight of the active inputs is greater than or equal to four. This only occurs in the case where all input signals are high, and so a four-input AND gate is formed.

![Figure 3.22 Threshold Gate implementation of a 4-input AND gate [70]]
Threshold logic gates form the most theoretically efficient counter structures, since a group of nodes, one for each output, are all that is required in the formation of a high order counter. This offers a huge advantage over the gate level implementation of the high order counter structures. The challenge with such arithmetic implementations lies in the creation of the gate structure itself; the inherently analog threshold gate presents a great hindrance to the digital designer.

The formation of threshold gates based on the use of capacitors to hold the charge and trigger the output of the gate has been one solution [70][71][72]. This CTL methodology suffers from noise sensitivity, and obvious technology scaling issues, since each CTL gate will have to completely custom designed for each fabrication process, and for each technology generation of a fabrication process. Variations of the threshold logic gate involving implementations using MOSFET circuits have been proposed [73], along with several other derivatives focusing on noise immune [74] low power [75], and latch implementations [76].

Without delving too deeply into the precise silicon layout of such devices, it is intuitive to conclude that the implementation of arithmetic counters using such unconventional design methods will have to be based on custom and technology specific applications. The analog nature of the threshold logic gate limits its employment in general digital architectures, and although they have demonstrated superior performance results in many cases, their restricted physical use inhibits their advancement.
Chapter 4

Arithmetic Circuitry

Until now, the bulk of the discussion has been devoted to the analysis of algorithms and architectural techniques employed in computer arithmetic. Another facet of this broad topic that must also be considered is the low level implementation of arithmetic techniques. This chapter deals with the details and the techniques involved with the physical design and layout of the aforementioned high level algorithms.

The traditional philosophy towards computer arithmetic has been the development of algorithms, and system level architectures that may be employed for the enhancement of numerical operations. Algorithms would be developed by computer scientists or mathematicians, and would then be implemented by a hardware engineer. It becomes apparent, however, that this approach will lead to sub-optimal solutions. Modern researchers, for the most part, have an outstanding comprehension of both VLSI design methodologies, as well as numerical techniques. This is only natural, since the thrust of present-day research deals with the adaptation of existing techniques to suit the latest integrated circuit technologies.
This chapter will begin with an overview of logic styles for arithmetic circuitry, followed by an in depth analysis of several newly proposed logic styles. The use of novel "pass-logic" design techniques for arithmetic sub-cells, with focus on full adders and 4:2 compressor layouts, will be presented next. Finally, an overall analysis of the arithmetic circuitry presented throughout the chapter will be presented, along with some comments and considerations for the development and simulation of novel circuitry.

4.1 Logic Styles

Digital design encompasses a wide variety of logic implementations, which arise, for all intents and purposes, as a result of the transistor configurations composing the individual logic elements. The synthesis of the particular digital system (or sub-system) will dictate the nature of the particular logic family chosen. In a survey of logic styles, Zimmermann and Fichtner [77], outline the various characteristics of the final digital system that are dictated by the initial selection of the logic style chosen for implementation. The factors include:

- **Circuit delay:** a function of the number of inversion levels, the number of transistors in series, transistor sizes (i.e., channel widths), and intra- and inter-cell wiring capacitances.

- **Circuit size:** depends on the number of transistors and their sizes and on the wiring complexity.

- **Power dissipation:** determined by the switching activity and the node capacitances (made up of gate, diffusion, and wire capacitances), the latter of which in turn is a function of the same parameters that also control circuit size.

- **Wiring complexity:** the number of connections and their lengths in addition to the choice of single-rail or dual-rail logic

- **Generality:** ease-of-use of logic gates in standard cell design techniques and logic synthesis

- **Robustness:** determined by the resilience to voltage and transistor scaling as well as varying process and working conditions

- **Compatibility:** ability to seamlessly integrate with the surrounding circuitries
All of these characteristics may vary considerably from one logic style to another and thus make the proper choice of logic style crucial for circuit performance. Several logic styles, and logic families will be presented in each section, along with their potential vantage points and applications in arithmetic circuitry.

4.1.1 Static CMOS

Static CMOS logic, otherwise known as standard CMOS logic, is the logic style of choice for most implementations, and is most often used in the development of standard cell libraries for automated digital synthesis. The principle behind a static logic cell is that it exhibits a well-defined output once the inputs are stabilized and the switching transients have decayed away. The cell is composed of complementary NFET and PFET networks, where the input voltages control the conductance of the networks. The switching network is designed such that only one network is a closed switch for any input combination, thus determining whether $V_{DD}$ or GND is connected to the output. Figure 4.1 on page 75 outlines a typical static CMOS cell configuration.

The reason behind the use of two separate networks is due to the physical nature of the MOSFET structures. The notion of threshold voltage loss inhibits the arbitrary transmission of signals through any MOS device. An NFET is designed such that a strong logic '0' is passed, while a logic '1' is passed with a threshold voltage loss according to:

$$V_{out} = V_{DD} - V_{TN}$$

Conversely, a PFET is designed to pass a strong '1', and a logic '0' with a threshold voltage gain according to:

$$V_{out} = |V_{TP}|$$

Figure 4.2 on page 75 provides graphs of this phenomenon in the two types of devices.

As mentioned previously, the individual networks of a logic cell are composed of an interacting group of transistors. The layout of the transistors ensures the proper functioning of the circuit, whereas the sizing of the transistors governs the DC switching voltages and transient switching times. The charge/discharge times of the circuit ($T_{LH}$,
$T_{HL}$ are determined by the transistor aspect ratios. The aspect ratio, the ratio between channel width and length, is very important in the device geometry as it determines the current flow, and is the easiest parameter to control since it is defined by the device layout.

**Figure 4.1** Static CMOS logic cell depicting the NFET and PFET networks

(a) $V$ vs. $t$

(b) $V$ vs. $t$

**Figure 4.2** (a) NFET threshold voltage loss (b) PFET threshold voltage gain
The advantage of such logic families is in the simplicity of developing a circuit that will perform a given function, however complex, while providing robust performance measures. Static logic, for the most part, demonstrates excellent noise immunity, and is less susceptible to process variation since the sizing of individual transistors do not vastly alter the circuits functionality. One disadvantage of this type of logic is the use of a large number of PFET devices, being both slow and large in comparison to NFETs. Furthermore, the longest length chain within each network will determine the worst-case scenario for charge/discharge delay, forcing more complex systems to carry out potentially sluggish execution.

### 4.1.2 Transmission Gate Logic

The CMOS transmission gate (TG) is designed to act as a very efficient voltage-controlled switch, and was one of the fundamental building blocks in SSI and MSI technologies. It is formed by a parallel combination of one NFET and one PFET device, as depicted in Figure 4.3, set-up in such a manner as to allow a full-voltage swing output based on the control signal. The use transmission gates to form logic cells simplifies the design of many involved circuits, by allowing signals to determine the conduction path of other signals. The formation of multiplexor cells using TG logic is one straightforward application of this type of logic.

![Transmission Gate Diagram](image)

**Figure 4.3 Transmission Gates**  (a) Transistor level  (b) Logic symbol

The downfall of TG logic lies in its requirement of a control signal and its complement, thus increasing interconnect and signal requirements. In addition, the output node does not
receive voltage support since there is no pure path to either the supply voltage or to ground. For this reason the input signal must be able to drive the output capacitance, leading to potential difficulties in high-fanout applications.

### 4.1.3 Dynamic Logic Families

Standard static CMOS logic maintains a valid output voltage, so long as the inputs are well defined and continuous. Dynamic CMOS logic on the other hand makes use of capacitive nodes to store electrical charge, and so are capable of sustaining a valid output only for a short period of time. The advantage of such logic families is in their ability to quickly transfer charge, and in turn have a tremendous performance advantage over static CMOS, and are common in high speed applications. Dynamic circuits differ from static circuits in that instead of fighting the constant limits due to parasitic RC elements, capacitances are used as integral components of the circuits.

Though there are several distinct logic families that fall under the dynamic CMOS classification, there is a common underlying principle behind their operation. In general, charge in supplied by the supply voltage to a few select capacitive nodes during pre-specified clock times. The stored charge is then used to control the movement of other charges. A basic capacitive node is formed when a transistor is in the cutoff region, the isolated node at its gate may be modeled as a storage capacitor ($C_s$). Figure 4.4 [78] provides a circuit diagram of this simple model.

![Figure 4.4 Capacitive Nodes](image)

**Figure 4.4 Capacitive Nodes** (a) Basic circuit (b) Storage Capacitor model
The name 'dynamic logic' is in direct reference to the operating characteristics of such circuits. Unlike static designs that maintain their output values indefinitely, the outputs of dynamic circuits are constantly changing, and have only a short window in time within which they are valid. By using capacitive nodes to hold a charge, the circuit is prone to leakage current effects given by:

$$I_{\text{leak}} = -\frac{dQ_S}{dt} = -C_S \frac{dV_S}{dt}$$

Maintaining $I_{\text{leak}}$ and $C_S$ constant and integrating we obtain:

$$V_S(t) \equiv V_{\text{max}} - \frac{I_{\text{leak}}}{C_S} t$$

In the case of an NFET driving the capacitive node, the transistor will slowly remove the charge from the node. Figure 4.5 on page 79 graphically illustrates the voltage drop at a capacitive node due to leakage effects. From the previous expression we can define the maximum logic 1 hold time as:

$$t_H \equiv \frac{C_S}{I_{\text{leak}}} (V_{\text{max}} - V_1)$$

When a PFET drives the capacitive node (Figure 4.6), it in actuality adds charge to the node, thus forcing a maximum logic 0 hold time:

$$t_H \equiv \frac{C_S}{I_{\text{leak}}} (V_0 - V_{\text{min}})$$

Dynamic logic design must be regarded as a system level design methodology, since it is almost impossible, and utterly imprudent to incorporate dynamic and static cells within a digital module. In fact, this notion may be extended further as to say that the combination of differing dynamic styles within a module should be avoided. The nature of the control and distribution of the signals through charged nodes is for the most part incompatible between the logic styles. To further exemplify this, a few of the prominent dynamic logic families will be presented.
The most basic type of dynamic logic is known as precharge-evaluate logic. This is a two state logic style where a clock signal controls a pair of complementary FETS managing the operation of the logic gate (Figure 4.7 on page 80). The two stages of operation are known as the precharge and evaluate stages. During precharge, the output node is charged via the precharge PFET, this is known as "pre-conditioning" the node. While the evaluate NFET is cut off. During this phase of the clock, the output and all of the inputs are invalid. During the evaluate stage, the evaluation NFET conducts, while the precharge PFET is cutoff. The inputs to the logic array are now valid, and if the logic array produces a value of '0', there will be a conduction path for the output charge to ground, else the charge will be maintained at the output and a result of logic '1'. The charge on the output node may only be held for a limited duration before being corrupted by charge leakage, and so timing is critical.
A dynamic system may be formed through the simple cascading of the individual cells. This leads to the formation of a variation of the precharge-evaluate logic known as DOMINO logic (Figure 4.8). The cascading of NFET logic arrays for the standard precharge-evaluate logic poses a potential glitch problem, which is overcome in DOMINO logic through the inversion of the output signal between cascaded cells. The name DOMINO logic arises from the rippling effect of the charge passing through successive stages of the cascade, where it is necessary for one stage to discharge prior to the proper functioning of the next stage.
**DOMINO** logic is of particular interest in high-speed arithmetic system design since it is employed in the design of the double-pumped arithmetic logic unit found in the Intel Pentium IV architecture [79]. Though this ALU does not feature a built-in multiplier module, it does carry out the pertinent delay critical arithmetic operations required by the microprocessor.

One final class of dynamic logic that will be introduced in this section is single-phase logic. Up to this point, single-clock, dual phase circuits have been discussed, where one clock, $\phi$, is used but both $\phi$ and $\bar{\phi}$ are used for timing. Single-phase logic uses one clock and one phase only, thus simplifying clock distribution and generation. The clock is applied to either a single NFET or PFET clock transistor as in Figure 4.9, where the NFET is active when $\phi = 1$, and the PFET is active when $\phi = 0$; thus, a complementary clock signal is not required.

![Diagram](image)

**Figure 4.9** Single-Phase Logic Circuit Types [78] (a) Single Phase Network setup using latches (b) Single Phase Logic gates cascaded together
In general, dynamic logic gates exhibit superior timing performance over static logic, at the expense of several critical issues. The use of a clock signal for synchronization leads to increased interconnect requirements, a more complex clock tree, increased switching activity resulting in high power dissipation. In addition, the use of isolated charge storage nodes raises the concern of charge leakage and charge sharing issues which must now be dealt with. For these reasons, it is the author's opinion that the use of dynamic logic techniques for digital multiplication schemes presents an unwarranted increase in architectural complexity and synchronization, while a combinational logic system implemented using static techniques will suffice for most applications.

4.1.4 Differential And Dual Rail Logic Families

The logic families covered in this section provide shortcuts to the development of switching arrays through the use of both the input signals, and their respective complements. Single rail logic makes use of a set of input signals to generate one or more valid outputs. Dual rail logic, on the other hand, requires an input pair for each signal, and in turn generates one or more output pairs. Differential logic refers to the use of the voltage difference between a signal and its complement for the generation of the output. The advantage in this arrangement is the doubling of the slew rate (the rate of change of the output), at the expense of an effective doubling of the interconnection requirement.

One of the more notable classes of dual rail logic families is known as Cascode Voltage Switch Logic (CVSL). The standard configuration of a CVSL gate, depicting the two major sections of the gate, is provided in Figure 4.10 (a). The cross-coupled PFET pair forms a simple latch for the outputs, while the NFET network, formed by two complementary switching blocks, dictates the logic function. A clock signal applied to the cross-coupled transistor gates, and the addition of an evaluate NFET as in Figure 4.10 (b), will transpose this circuit into a dynamic logic gate, unsurprisingly named DYNAMIC CVSL. There are several other variations of CVSL, such as Sample-Set Differential Logic (SSDL), Enable-disable Cmos Differential Logic (ECDL), Differential Current Switch Logic (DCSL); the elaborate details of such logic types is beyond the scope of this thesis.
4.2 Pass Transistor Logic

A family of dual rail logic that has been regarded as having significant promise in the creation of arithmetic sub-cells has been pass transistor (PT) logic. Simultaneously developed by Hitachi, Complementary Pass-Transistor Logic (CPL) and Dual Pass-Transistor Logic (DPL) are CMOS logic families targeted at low power and high performance architectures [80]. Pass transistor logic is the most prominent of alternate logic families amongst arithmetic circuit designers. This section will provide a brief overview of this variety of CMOS logic, prior to delving into the details of transistor level arithmetic circuitry in Section 4.3.

A pass transistor is simply a MOSFET with the input signal fed to the source and the output taken from the drain, with a control signal connected to the gate governing the output. A pass network is an interconnection of a number of PT's to achieve a function. Figure 4.11 provides the basic Pass Transistor configurations using NMOS and PMOS transistors. In both cases, X is the control variable, and Y is the pass variable, and the notation is $X \rightarrow Y$ and is read as "X passing Y".
Figure 4.11 Basic Pass Transistor Logic Configurations

One form of pass transistor logic, referred to as Complementary Pass-Transistor Logic (CPL), is based on the use of multiplexers to construct logic functions. The exclusive use of NFETs in the data path simplifies circuit layout. Figure 4.12 (a) shows a single NFET used to yield the AND function. It should be evident that this minimal layout does not account for all possible input combinations; to overcome this dilemma, a second transistor is required (Figure 4.12 (b)).

Figure 4.12 AND gate implementations using pass logic

There are downfalls to using strictly NFET arrays for the logic. The transfer of a logic one through an NFET is both slow and exhibits threshold voltage losses. To overcome the non-full swing output, an inverter is used at the output nodes. This "restoring" inverter is used to buffer the output by accommodating for the threshold losses, as well as increasing the drive strength of the output signal. Finally, since CPL uses both the original input signal and its complement, the output of a CPL gate must also provide the complement of the signal for the proceeding CPL cell. Consequently the CPL AND gate that was initially presented must be augmented to resemble the circuit in Figure 4.13 on page 85.
The use of PFETs within the logic array of a complementary pass transistor logic cells will do away with the problems of passing logic '1' through an NFET. This type of logic is referred to as Dual Pass-Transistor Logic (DPL). Although more transistors are required in DPL, each input variable to the basic gates is only used once, thus the driving gates have equal loads. Furthermore, the inverting buffer is not necessarily required at the output of each cell.

The use of DPL is limited by its excessive use of PFET devices, and redundant logic branches. A variation of DPL, coined Dual Value Logic (DVL, has been presented by Oklobdzija et al. [80]. The vantage points of DVL are in its elimination of the redundant logic branches and minimization of PFETs; these gains do come at the expense of larger transistors. The use of DVL has been limited and so will not be elaborated on any further in this section.

4.3 Full Adder Circuits

The full adder is one of the fundamental building blocks of arithmetic circuitry, and in particular, are one of the key enabling blocks of partial product reduction arrays in digital multipliers. For this reason, the nature of the full adder circuitry figures prominently in the overall characteristics of the multiplier as a whole. The significance of this topic has been
generally recognized and so there has been considerable focus on the development of full adder circuits [77,80-97]. This section will outline several recently proposed architectures.

The logic families discussed in Section 4.1 serve as the basis for the analysis of the various formations of the full adder cell. Simply stated, for every logic family there exists one or more CMOS implementations of the full adder circuit, each offering a particular vantage point. The logic styles of primary significance include:

- Standard CMOS
- Transmission Gate Logic
- Pass Transistor Logic
- Dynamic Logic

According to the previous analysis of the logic families, we may neglect the dynamic logic families from any further analysis due to the clocking and interconnect overheads that would be required in the highly irregular topology of multiplier circuits. The only dynamic logic family that has been employed on a large scale on a commercially available microprocessor has been DOMINO logic [79]; in spite of this, it has been determined by Alioto and Palumbo [84] that this type of logic consumes up to 700% more power and 150% more area than an equivalent standard CMOS cell. A DOMINO logic full adder cell is provided in Figure 4.15 on page 87 for reference, however the analysis for viable alternatives for CMOS full adders will focus strictly on the first three methods.

Pass transistor logic has been generally accepted as a low power, high performance logic style, since it allows for reduced signal transitions [81], reduced input capacitance, and smaller transistors [82]. One of the more significant incentives for the use of pass logic in full adder designs is the ease to which XOR/XNOR gates may be constructed using the multiplexer nature of pass-logic [77]. Figure 4.14 on page 87 depicts a transistor level and logic level depiction of the full adder cell, and the role of the XOR cell within its composition. By examining the expansion of the logic equations for the SUM and CARRY outputs of a full adder, the significance of the XOR gate becomes quite apparent:
\[ \text{Sum} = (X \oplus Y) \oplus C_{in} \]

\[ \text{Carry} = (X \oplus Y) \cdot C_{in} + XY \]

\[ = (X \oplus Y) \cdot C_{in} + (X \oplus Y) \cdot X \]

Figure 4.14 Gate level (a) and transistor level (b) implementations of 12 transistor full-adder cell proposed in [83]

Figure 4.15 A DOMINO logic full adder cell [84]
Non-monotonic gates, such as XOR and multiplexers, are regarded as the most difficult logic gates to layout in standard CMOS logic. Since such gates form the basis of full adders, and have inefficient standard CMOS layouts, pass logic techniques have emerged as the leading alternate logic style. One of the most cited papers on this topic was written by Wang et. al in the early 90's [85]. In their analysis, the authors develop a novel approach to the pass logic XOR gate, and at the same time provide an overview of the existing reduced transistor XOR gates.

![Figure 4.16 Various XOR/XNOR configurations](image)

(a) Transmission Gate  
(b) Transmission Gate with driving outputs  
(c) Inverter Based  
(d) Proposed XOR/XNOR Configuration [85]
Figure 4.16 on page 88 provides several of the reduced transistor count XOR/XNOR cells provided in [85]. A variation of the transmission gate XOR cell has been employed by Zhuang and Wu [86] in the development of transmission gate based full adder circuits (Figure 4.17). The principle advantage in such designs is the strong transfer of signals without threshold loss; however this comes at the expense of weakly driven outputs, and the necessity for the XOR/XNOR signal pair for the transmission gates.

![Diagram of Transmission Gate based Full-Adder circuit](image)

**Figure 4.17 Transmission Gate based Full-Adder circuit**

One application of transmission gates, is through the use of pass-transmission circuits, where the overall flavour of the design is pass transistor logic with transmission gates applied only where needed for full swing capabilities. Damu Radhakrishnan is a strong advocate of the use of pass logic in arithmetic circuitry, and has proposed several circuits based on the hybrid pass-transmission gate logic style for full adders, and 4:2 compressors [81][87][98]. He has proposed the use of a combination XOR/XNOR cell along with transmission gates to form a full swing full adder (Figure 4.18 [87]). The vantage point in his design lies in the formation of the XOR signal and its complement simultaneously for controlling the operation of the output transmission gates.

There has been a considerable amount of literature surveying and providing novel circuitry for full-adders based on pass logic principles. These circuits vary in size from 20 transistors down to 10 as the current minimum, with every other combination of transistor counts present in between. The driving force behind the development of such circuits lies
in the realm of low power electronics. The notion of *POWERLESS*, and *GROUNDLESS* logic gates has been presented in [89], where the authors eliminate all connections to the power supply and all paths to ground respectively. This in effect limits the short circuit power dissipation, which relies on conduction paths between the power supply and ground.

Figure 4.18 Pass-transmission implementation of a Full Adder

Figure 4.19 10 Transistor Full Adders outlined in [89]
The authors in [89] present an analysis of various configurations of 10 transistor full adders, a sampling of which is provided in Figure 4.19. The conclusions drawn from this investigation point to ADDER9A as the lowest in power consumption, and ADDER13A as the best performing. Furthermore, the conclusions provide favourable results for the use of such configurations over a standard CMOS adder, as shown in Figure 4.20, under both power and performance analysis.

Figure 4.20  Conventional 28 transistor CMOS full adder implementations
(a) standard configuration  (b) Mirror cell configuration [84]
Shams et al. [90] have carried out a similar analysis of full adder cells for power and performance measurements. Their approach involves the decomposition of the full adder into three sub-modules, namely the XOR/XNOR, the SUM block and the CARRY block. Through the various configurations of the possible combinations of such modules, a comprehensive collection of full adder cells is developed. The circuit having the least power dissipation is depicted in Figure 4.21. In a separate investigation [91], a 10-transistor full adder (Figure 4.22) is developed having approximately one half of the power dissipation of comparable circuits.

![Figure 4.21 Low power 16 transistor full adder cell](image)

Figure 4.21 Low power 16 transistor full adder cell

![Figure 4.22 Low power 10 transistor full adder cell [91]](image)

Figure 4.22 Low power 10 transistor full adder cell [91]
In one of the most complete manuscripts on the topic of full adder designs, Alioto and Palumbo [84] provide an in depth analysis of several of the most promising full adder designs, including:

- Standard CMOS (conventional and mirror cell implementations as in Figure 4.20)
- LEAP [92] (as in Figure 4.23)
- Complementary Pass-Transistor Logic (CPL as in Figure 4.24)
- Optimum circuit as defined in [93] (Figure 4.21)
- Transmission gate full adder (Figure 4.17)
- DOMINO logic full adder (Figure 4.15)

In this investigation, the authors provide simulation results of the cells, implemented in 0.35 micron technology, using power consumption, \textit{power delay product} (PDP), voltage scaling and a novel metric used for calculating delay. Their results naturally pointed in favour of \textit{DOMINO} logic for high performance implementations, and CPL as a high performance lower power alternative. The lowest power dissipation belonged to the cells that had no driving capability, namely the transmission gate and pass-transmission configuration. The limitation of these cells lies in their inability to provide signal drive strength, and so were recommended only for applications where high fanout, or cell cascading are not present.

The analysis did fail to consider most 10 transistor full adder implementations, however this oversight was justified by the authors' claims of signal degradation issues connected to such designs. The overall conclusions of the study revealed the advantages of using the robust conventional CMOS implementations, and the closely related 28 transistor mirrored-cell design. In the simulations involving decreasing supply voltage levels, the conventional CMOS logic circuits maintained a relatively consistent performance and power consumption measurements, which did not degrade to the extent as the other logic styles.
Figure 4.23  LEAP full adder cell

Figure 4.24  Complementary Pass-Transistor Logic (CPL) full adder cell
4.4 4:2 Compressor Circuits

In Chapter 3 the notion of the 4:2 compressor was introduced. Though the most primitive structure of the 4:2 compressor remains a simple two level cascade of two full adders, it was demonstrated that further optimization could be achieved through the decomposition of the cells. To this point, two gate level representations have been provided that present a reduction in the overall cell latency (Figure 3.12 on page 51 and Figure 4.13 on page 85). In this section, the 4:2 compressor is further broken to exploit the benefits of transistor level optimization.

Though there have been claims that the particular number system chosen, whether redundant, radix-2, or signed digit representations using various digit sets, effects the overall characteristics of the partial production reduction network. However, a recent paper by Peter Kornerup [99] elegantly disproves such notions and shows that the same 4:2 structure, with minor modifications to the external interface, may be used for all implementations. In his conclusions, he notes that no fundamentally different encoding system will allow for performance gains. Consequently, any further analysis of 4:2 compressors may be examined on a circuit level, without the need to take number systems or digit encoding into consideration.

Although not as popular a research area as the full adder, there have been quite a large volume of papers focused on the transistor layout of the 4:2 compressor [56-58, 98-103]. Several of the more interesting, and promising alternatives to the stacked full adder configuration will be presented in this section.

The first design, introduced in 1991 by Mori et al. [100], was incorporated into a 54-bit multiplier. This 58 transistor design is shown in Figure 4.25 on page 96, and in this schematic the longest delay path (1.2 ns) is identified. The downfall of this design rests in the number of inverters that are used throughout the compressor, leading to high short circuit power consumption, and in many cases unnecessary delay.
In a similar 54x54 multiplier [57], the authors present a multiplexer based approach to the design of the 4:2 compressor. The cell, implemented in 0.25 micron CMOS technology and having a latency of 460ps, features the use of complementary pass transistor logic multiplexer cells. An overly simplified rendition of the compressor is presented in Figure 4.27. The significant omission from this schematic is either the necessity for dual rail signals, or the placement of an inverter for compliment formation prior to each MUX cell.

Figure 4.25 1.2 ns 4:2 compressor proposed in [100]
Figure 4.26 Precharged pass logic compressor [56]

Figure 4.27 Multiplexer cell based pass-logic 4:2 compressor [57]
In order to increase the performance of the compressor cell, Hanawa et al. [56] propose the use of precharged pass transistor. In their design, DOMINO logic is used in order to boost the charge/discharge times of the pass logic circuitry at the outputs of the cell (Figure 4.26 on page 97), thus forming a 417ps latency compressor cell using 0.3 micron CMOS technology. As typical with all dynamic logic devices, this circuit suffers from high dynamic power consumption.

Margala and Durdle [101] explored the use of not only different logic styles, but also a particular technology in their development of new 4:2 compressor cells. In their research paper, new 4-2 compressors, based on new BiNMOS, a reduced-swing DPLBiNMOS and a BiDPL logic, are presented, where the circuits are designed and fabricated in 0.8 micron BiCMOS technology. Though the use of the BiCMOS process does present several unique advantages, unrealistic loading conditions were used in the power and delay analysis, thus not providing realistic loading conditions for the new circuits. Nevertheless, the standard CMOS implementation demonstrated relatively low average power dissipation in their investigation, second only to the BiDPL.

In a separate study [102], DPL is once again employed in the construction of the gate level circuit depicted in Figure 4.28 on page 99, primarily for the reduced internal load capacitances in the critical path. A 4:2 compressor is proposed in [58] that also uses dual rail DPL, for high performance applications. The use of inverters within the signal path is eliminated in this design, thus reducing short circuit power, and inversion delays.

In order to minimize the power dissipation while avoiding the drawbacks of non-full swing outputs and signal deterioration, the authors in [103] devise a 4:2 compressor layout that incorporates a series of full swing (FS) and non-full swing (NFS) multiplexers as shown in Figure 4.29 on page 99. In this manner the signal will never propagate through a succession of non-full swing cells, and so signal integrity is maintained. The advantage of the dual rail styles stems from the simultaneous availability of both the original signal and its compliment. This benefit comes at the obvious expense of twice the wire interconnect overhead for every internal signal.
Figure 4.28 Gate level framework for DPL based compressor cell in [102]

Figure 4.29 DPL compressor using FS and NFS MUX cells [103]
For high speed and low power, pass transistor logic is once again claimed to be the logic style of choice for 4:2 compressors according to Damu Radhakrishnan [98]. Structured using the same cells as in their full adder design [87] (Figure 4.18 on page 90), the authors present a compressor cell which is derived from the relationship:

\[
S = X_1 \oplus X_2 \oplus X_3 \oplus X_4 \oplus C_{in}
\]

\[
C = (X_1 \oplus X_2 \oplus X_3 \oplus X_4) C_{in}
+ (\overline{X_1} \oplus \overline{X_2} \oplus X_3 \oplus X_4) X_4
\]

\[
C_{out} = (X_1 \oplus X_2) X_3 + (X_1 \oplus X_2) X_1
\]

Reducing the above representation, we obtain:

\[
H_3 = X_1 \oplus X_2 \oplus X_3 \oplus X_4
\]

\[
S = (H_3) \overline{C_{in}} + (\overline{H_3}) C_{in}
\]

\[
C = (H_3) C_{in} + (H_3) X_4
\]

A 4:2 compressor circuit is developed using the above relationships (Figure 4.30).

---

Figure 4.30 Pass-transmission implementation of the 4:2 compressor [98]
4.5 Overview of Arithmetic Circuitry

This chapter has provided a brief introduction to the variations of logic styles, and design techniques that may be employed in the creation of arithmetic circuitry. With the necessary background in logic styles, coupled with a wide sampling of the most recent arithmetic circuitry from across the globe, the foundation for a systematic approach to arithmetic design is now in place.

In general, there is no consummate collection of arithmetic cells presented in any one paper. Each of the published manuscripts presenting either a survey or novel approach to the composition of arithmetic sub-cells have inherent biases towards either a certain logic family, or a particular test/simulation paradigm. This section will attempt to summarize and present a design formalism for the simulation and synthesis of arithmetic macrocells.

4.5.1 Logic Style Selection

As mentioned earlier, there exist a plethora of basic logic styles for a circuit designer to choose from. The decision must be centred around the application and the objectives of the overall design. Naturally it would be absurd to incorporate a dynamic logic style, such as DOMINO logic, in a power conscious system; however, there are several concepts which must be taken into consideration which may not be as intuitive or straightforward, and are often overlooked.

The use of pass transistor logic for the most part should be approached with great discretion. Although this logic style has a great following in arithmetic circuitry for its unsophisticated approach to non-monotonic gates, there are countless more who strongly oppose its use. The standard CPL and DPL logic families, if deployed as they are intended, offer no overall gains in any one category over conventional CMOS [77][84]. In fact, CPL cells suffer from static power consumption at the inverters of the output nodes, due to the low voltage swings at the inputs of the nodes [96]. The inevitable drawback to such logic styles is their use of dual rails; a problem which will only compound as feature sizes continue to shrink, and interconnect dimensions begin to dominate. Furthermore,
having a signal and its complement present will lead to guaranteed doubling of switching activity for any signal transition.

The logic styles exhibiting the most applicable potential are in the hybrid transmission gate/pass transistor logic, which has been coined “pass-transmission logic”. The majority of the low transistor count full adder cells fall under this general classification, and present some unique advantages. Naturally, the size of such circuits is smaller, and there are fewer internal nodes, thus decreasing both capacitance, and internal switching activity; two key elements in low power design. Finally, the circuits are all single rail logic.

There are downfalls with the use of such non-conventional logic styles. The case of the 10 transistor adders, and all of the variations thereof, exhibit threshold voltage loss (gain) issues at one point or another. Through the elimination of the transmission gate at the output nodes, the reduced transistor count full adders will inevitably have non-full swing outputs. Though some claim that reduced voltage swing is a means of reducing power consumption, it has a larger affect on signal quality, degradation and interpretation than any minimal power savings. For this reason, the most recent analysis of full adder cells [84] completely dismisses the 10 transistor class of adders as a viable solution.

Moreover, the topic of threshold voltage loss becomes an even greater challenge for deep sub-micron technologies having reduced supply voltages. The scaling of supply voltage with respect to device threshold voltage is not proportionate. The general trends have indicated that the process-nominal \( V_T \) has been scaling in proportion to \( V_{DD} \), however the worst-case low \( V_T \) has remained constant. This \( V_T \) floor is set by various chip parameters that become increasingly significant as the overall supply voltage is decreased. Figure 4.33 on page 109 represents an empirical fit to the nominal-process \( V_T \) trend [104]. It can be deduced that the threshold voltage will represent a larger relative percentage of the supply voltage with technology scaling. Consequently any threshold voltage loss (or gain) will have a greater effect on signal integrity, leading to invalid logic level interpretations. The threshold voltage has an increased dependence on channel length as the length is
decreased (Figure 4.32). Threshold voltage tolerance is also worse with short channel devices. Thus any process variation at these short channels will have a significant impact on the threshold voltage.

![Graph showing nominal $V_T$ vs $V_{DD}$](image1)

**Figure 4.31** Threshold voltage floor with decreasing supply voltage [104]

![Graph showing $V_T$ tolerance vs effective channel length](image2)

**Figure 4.32** Threshold voltage tolerance dependence on channel length [104]

Furthermore, *Narrow Channel Effects* on threshold voltage levels in deep submicron technologies may further ostracize the development of pass-logic circuitry. A narrow channel is defined as a MOS transistor having a channel width on the same order of magnitude as the maximum depletion region thickness ($xdm$). The predominant narrow channel effect is the increase in threshold voltage. This is due to the thicker field oxide...
(FOX) that exists on the edges of the channel. This creates a shallow depletion region beneath the FOX overlap area. The gate voltage must then be capable of supporting the additional depletion area before a conduction path is established. In wider devices, the charge contribution of this fringe area is negligible in comparison to the overall channel depletion charge, however since it must be accounted for in the narrow channel device model, the immediate effect is an increase in the threshold voltage [78].

A secondary concern with pass-logic devices in deep submicron implementations in the concept of Subthreshold Current Flow. The current flow in the channel of a typical MOS device is dependent on the creation of an inversion layer through the application of a gate bias voltage. In the standard model, if this bias voltage is insufficient \(V_{GS} < V_T\), a potential barrier blocks the flow of current. However with diminishing device geometry, the potential for subthreshold current flow exists for conditions where typically no current would be flowing. This is a result of the dependency of the potential barrier on both the gate to source voltage \(V_{GS}\) in addition to the drain to source voltage \(V_{DS}\). If a voltage is applied at the drain, a phenomenon known as Drain-Induced Barrier Lowering (DIBL) decreases the potential barrier, thus allowing for electron flow in the channel [78]. Some logic styles, pass-logic in particular, are more prone to DIBL effects.

The suggestion of POWERLESS and GROUNDLESS logic gates [89], is a clear indication that the signals must be used to drive the outputs; this is in fact the underlying principle behind pass-logic. The congenital flaw is that the signals are not decoupled, that is to say that a signal may be transmitted directly through the cell. Under ideal situations, and inadequate simulation environments using unsophisticated models, the line and transistor losses are neglected; however this is not the case. The passing of a signal from the source to the drain of a transistor will undergo signal degradation to a certain extent, and will eventually reach a state where the logic state may be too weak to be properly identified.

Considering that signals must be passed through transistors, sizing of transistors becomes a critical factor in the cell's operation. Having sizing dependant, otherwise known as ratio-ed, logic makes the formation of standard cell libraries with such devices more difficult.
With each signal seeing a variety of input capacitances at the input of the subsequent cell, the drive strength of the preceding cell must be adequate to match the output loads. This leads to the inability of scaling the cells both within a given technology, and most certainly across technologies. This is a critical dilemma when dealing with a logic style’s robustness and ease-of-use.

Besides the pass-logic families, the standard transmission gate logic adders face their own challenges. Though they offer the benefit of full voltage swing outputs, the outputs are still driven by other signals, and not clean paths to logic ‘1’ or ‘0’ ($V_{DD}$ and $V_{SS}$ respectively). For situations where the cells must be cascaded, which is in most applications of arithmetic macrocells, the driving power of the cell is imperative. To do accommodate for this, at least one of the following methods must be exercised:

- Proper sizing of the transistors in the critical path; usually involves increasing the transistor’s size in order to meet current demands
- Buffer insertion following every cell
- Using cells with pre-buffered (inverted) outputs

The first alternative is usually avoided, from a power consumption point of view, where the addition of extra transistors leads to a more efficient design, than having abnormally large transistors [90]. On a layout level, having similarly sized transistors facilitates the implementation of a circuit. By increasing the aspect ratio of a transistor, the drain, source and gate dimensions are increased, subsequently raising the overall and parasitic capacitances. Thus the notion of regularity may be passed down from system regularity to device regularity as well.

The analysis of 10 transistor adders as presented by Bui et al. [89] clearly states that the circuits developed suffer from threshold loss and drive-strength problems, yet the authors claim that “the adders are useful in larger circuits such as multipliers despite the threshold-loss problem.” This statement is completely unjustified and misleading. Even a regular array multiplier topology will suffer from signal degradation through the cascade of reduction blocks; this problem is further aggravated in the irregular tree multiplier layout.
Buffering is most certainly required, and this is one of the most commonly overlooked issues in circuit performance calculations.

One of the most widely studied aspects of dynamic power reduction in modern CMOS devices is the switching activity. A great deal of effort has been put forth towards the reduction of switching activity, and consequently the number of times a node capacitance is charged and discharged. Besides algorithmic, or desirable, switching activity, there are spurious transitions that may occur due to unequal signal propagation paths between blocks, known as glitches. Glitches add unnecessary transitions to a system due to critical races or dynamic hazards. One aspect of switching activity that is often overlooked in power analysis is glitching that may occur within the internal nodes of a block.

This is a topic of interest since it investigates the effects of avoidable charging and discharging of the capacitive nodes that form the logic block. Investigations on glitching for the most part have been focused on the superfluous transitions that may occur between logic gates, and not on the unbalanced paths that exist within the blocks themselves as the signal transitions propagate through the transistors. By reviewing several of the proposed unconventional designs, it becomes apparent that there are uneven paths throughout these configurations. More importantly, in several designs (Figure 4.30 on page 100 for example), an input signal is used to generate an intermediary signal, which is then used to pass the original input signal once again. This is a clear example of unbalanced signal paths built into the design itself.

In terms of the more obscure logic styles using completely different number systems or technology process (BiCMOS for example) are not considered as part of any rational analysis. It would be imprudent to postulate that the use of unconventional number systems does not present any gains in performance, power, noise margins, scaling potential, or any other pertinent metric. However, the interfacing circuitry overhead will, for the most part, negate any gains that an individual arithmetic component may have in an overall system. For this reason, diverse number systems, or technology processes must be
regarded as system level designs, and not be compared with individual cells implemented using contemporary processes.

4.5.2 Simulation Setup and Environment

In order to conduct a valid comparison amongst various designs and logic styles, a fair and undistorted simulation environment is required. Unfortunately this appears to be the one of the most critical oversights in most publications. Although consistency in an investigation will, for the most part, produce a legitimate comparison under the given conditions, it does not necessarily obstruct experimental biases. The most cogent studies are those carried out under realistic conditions, with as sophisticated a model as possible.

In a recent study [84], the authors point out the importance of proper assessment of parasitic capacitances that are present in any physical implementation of a logic block. To account for this in their analysis layout level models with extracted parasitics were used. This is a requirement for the proper simulation of pass-transistor based designs, since the internal parasitics have a considerable affects on the transmitted signal that must propagate through the transistor array. Less complex models, such as PSPICE, will not properly account for this, and so will yield overly optimistic results.

Furthermore, this method of modeling allows for the proper assessment of a cell’s physical dimensions as opposed to merely providing vague transistor counts. As an example the 22 transistor LEAP and 26 transistor TG designs were determined to be 26% and 14% larger, respectively, than the 28 transistor CMOS design. The authors provide sound evidence refuting the comparison of macrocells through transistor counts as a valid measure of size. This size difference is due to the high circuit regularity of the conventional CMOS topology as opposed to the irregular, and often large transistor requirements of the other logic styles [77].

A separate issue that must be taken into consideration for pass-logic and pass-transmission circuitry is the fact that many signals are connected to a gate of one or more transistor(s)
and the source of other transistor(s) at the same time. When a current is drawn from an input, switching of the signal connected to the corresponding transistor source slows down. If that signal is connected to another gate simultaneously, the switching of that transistor, and consequently another signal, will also slow down. This phenomenon, known as the source-gate effect [77], must be taken into account for proper simulation results, and circuit latency calculations.

Uneven signal path considerations should also be taken into account. Delay of the input signals through a range of values should be analyzed, and their effect on circuit performance. This will properly model the variations that are possible in the manufacturing process of a die. The CMOS fabrication process is highly complex, consisting of numerous stages and procedures, thus it is inaccurate to assume that all circuits will behave similarly once produced. The variations in manufacturing processes and operating conditions will inevitably alter the circuit parameters from the target or nominal values. The circuit simulations should incorporate a nominal value for a given parameter (such as W, L, temperature, VT, tox, etc.), in addition to a statistically varying component of that parameter to accurately reflect the range or tolerance that a device has to deviations from the norm.

In terms of the simulation test bench used, an environment which closely mimics the cell’s functional application is advised. If a cell is proposed for high order counters, or partial product reduction arrays, then it should be modeled and tested within those settings. It is insufficient to merely replace the cascading effects of such irregular topologies with idealistic capacitance models and take measurements using only a limited test structure as proposed in [96]. The test circuit shown in Figure 4.33 on page 109 [91] is acceptable only for the case where the cells are not cascaded. Through isolating the device under test, the dynamic loading effects present in pass-logic designs are eliminated. Consequently, the simulation results provide an idealistic environment for such cells where the signal is decoupled, and so is less susceptible to degradation. Moreover, the necessity for inter-stage buffering has been addressed in many papers [77][84][90][91], yet the buffers are often not included as part of the cell’s latency or power consumption.
For the most accurate representation of the cascaded environment for a full adder, or for a 4:2 compressor, a simple test bench based on an 8x8 partial product reduction matrix has been developed. Figure 4.34 and Figure 4.35 show the internal configuration of the multiplier reduction array used for the compressor and full adder respectively, while Figure 4.36 outlines the testbench setup used in the Cadence Virtuoso environment. The buffered inputs provide realistic source inputs, as opposed to a direct connection to an ideal voltage source. The individual blocks within the reduction trees may be interchanged such that a wide sampling of cells may be tested using the same configuration.

In this manner, a more dynamic simulation environment is provided which accounts for the irregular interconnections, and loads of the individual cells. Thus the sizing of the transistors, where necessary, may be done by examining all possible applications and locations of a reduction subcell. The testbench is further enhanced if the extracted layouts of the individual cells are used, where the parasitics are accurately modeled.
Figure 4.35 Full-adder distribution in an 8x8 reduction matrix

Figure 4.36 Cadence testbench for 8x8 multipliers
The proper simulation environment is incomplete without a suitable set of input stimuli. The need for a robust input transition pattern has been examined by Shams and Bayoumi [88], and in their study they have uncovered some very interesting results. By selecting a set of input signals that do not cover the entire range of possible transitions, they managed to reveal invalid average power consumption and latency values for a set of full adders. Their conclusions stated the need for input signals that not only cover every incoming combination of logic values, but also a valid permutation of all possible input transitions.

The input signal transitions force the various charge and discharge paths of the internal nodes of the device under test. The charge differences between subsequent states substantially affects the power consumption and latency calculations. For a full adder, having three inputs, there are $N_{3T}$ input combinations, where:

$$N_{3T} = 8 \times 7 = 56$$

Similarly, the number of combinations for a 5 input 4:2 compressors is:

$$N_{5T} = 32 \times 31 = 992$$

In general there is a need for:

$$N_{kT} = 2^k \times (2^k - 1)$$

input combinations for a system having $k$ inputs, where the entire range of $2^k$ input combinations must be used against all $2^k - 1$ combinations of the proceeding transition.

In terms of the actual delay calculation, a new metric has been developed in [84] that incorporates the propagation delay ($\tau_{PD}$) information, in addition to the rise/fall time ($t_R$). The propagation delay alone will not account for the output signal shape, whereas the rise and fall times have no bearing on the input signals and only describe the output dynamics. The novel figure of merit, denoted as $t_{SPEED}$, represents the time required to reach the steady state output value starting from the time in which the input signal crosses the logic threshold:

$$t_{SPEED} = \tau_{PD} + \frac{t_R}{2}$$
The calculation of power dissipation is much more complex situation. In conventional CMOS circuits, the overall power dissipation may be calculated as the total power drawn from the supply. When considering pass-logic and transmission gate based circuits, this calculation becomes much more complex since now the power drawn from the input signals must also be considered. In discussions with Dr. Graham Jullien [105], Dr. Jim Haslett [106], Dr. Wael Badawy [107] and Dr. Farid Najm [108], no single resolution to this dilemma has been formulated. However the common stance on the need to take into consideration the power drawn from the input source has been established.

The decomposition of such cells into their individual constituents and capacitive nodes is far too complex an approach, and an overly involved solution. The general approach should be to consider the entire circuit as one entity [105 - 108], and determine the power drawn by the entire cell by examining the current flow from the supply using one of the two power measurement subcircuits [109] outlined in Figure 4.37.

![Figure 4.37 Power measurement circuits proposed in [109] having a controlled current source and parallel RC circuit (a) current controlled (b) voltage controlled](image-url)
To account for the power drawn from the inputs, the dissipation through the input buffers may be analyzed by first measuring the unloaded buffer consumption, and subtracting this value from the loaded buffer consumption. This will provide the difference in power consumption as a result of the device under test's loading conditions on the input. A more accurate measurement would be to analyze a cascaded structure's overall dissipation, and use this as a comparison amongst subcell configurations.

In general the power dissipated through the buffers should be considered as part of the overall cell consumption. This is justified when one considers the fact the buffers are an integral component of the low transistor count cells, without which they would not be able to function properly. Furthermore, an unfair advantage is given to such circuits over conventional CMOS since the standard CMOS circuit does not require refreshing at its output nodes. An argument may be made that in most practical applications of large scale multipliers, the stages are pipelined, thus the cascading effect is not as severe, and refreshing may not be required. To support this argument then, the cells must be loaded and driven by the pipelining latches, and the effects of poor drive capacity, and non-full swing outputs must be taken into consideration for the latches involved.

Conventional CMOS logic has, to date, withstood the test of time, proving to be the most resilient logic style in terms of technology scaling, robustness, variations in operating conditions, ease of use in standard cell applications with synthesis tools and interfacing with its environment. Consequently it should be the targeted logic style for the majority of applications, unless in the extreme case where ultra low power, or high-speed are of the essence, and the price in the trade-off of the other factors is not a consideration.
Chapter 5
Interconnect Effects

"Conventional interconnect scaling will no longer satisfy performance requirements. Defining and finding solutions beyond copper and low k will require material innovation combined with accelerated design, packaging, and unconventional interconnect.” - SIA 2001 Technology Roadmap [9]

To date, the semiconductor industry has been able to satisfy Moore’s Law through advancements in process technology alone. Moore’s Law is named after Intel executive Gordon Moore who observed that the general trend over the past 30 years is that functionality per chip (transistor count) in addition to microprocessor performance (clock frequency x instructions per cycle = millions of instructions per second MIPS) doubles every 1.5 to 2 years.

However, according to the latest International Technology Roadmap for Semiconductors (ITRS 2001), published by the Semiconductor Industry Association in 2001 [9], breakthroughs in conventional technologies are required in order to meet the more aggressive scaling requirements through decreased feature sizes that are required in order to maintain the historical trends. Feature size reduction refers to the minimum channel width (poly width
available) or minimum interconnect width available for a technology. Suggestions for optical or wireless interconnects have been presented in order to alleviate the restrictions imposed by current metal layers [110]. Until those technologies appear as a feasible alternative, focus must be placed on the development of strategies for overcoming the obstacles using current processes.

The progression of fabrication technologies opens the door for more complex applications and developments targeting the integrated circuit. The monolithic integration of a large number of functions on a single chip provides:

- Less area
- Less power consumption
- Less system level test requirements
- Higher reliability (on chip interconnects)
- Higher Speed (on chip interconnects)
- Cost savings

Current designs focus on the concerted performance of interconnected transistor systems, and not on the functioning of individual transistors. With the foreknown gains of amalgamating multiple functions and systems onto a single piece of silicon, the complexity of the integrated circuit and the interaction of various sub-components is bound to increase. For this reason it is imperative to take into consideration the manner in which the transistors and subcells are interconnected, and not merely singular transistors.

The predominant motivation behind the study of interconnect effects has been the result of technology scaling effects. When physical devices are scaled from one technology to another, the transistor dimensions are reduced, as are the lengths of the local interconnects joining the transistors. Although transistor delay scales appropriately with device size, interconnect delay does not; in fact, as feature sizes shrink, wires contribute a larger portion of the total delay [12]. This is clearly outlined in the graph developed in [111] and presented in Figure 5.1 on page 117.
5.1 Projected Issues With Technology Scaling

The SIA produces a general survey of semiconductor technology approximately every 4 years outlining the predicted technology over the next several years, while outlining potential bottlenecks. The most recent issue was released in 2001, with an update in 2002 [9]. In this report, several areas have been designated as "Red Brick Walls", referring to predicted objectives that have no current solution and will not be met unless innovative solutions are developed. The relevant sections of the 200 page report pertaining to interconnections will be outlined in this section.

With the predicted microprocessor speeds exceeding 28GHz, and feature sizes below 0.009 micron (9 nm) by 2016, it is reasonable to agree with the claims that the existing CMOS technology as we know today will not be able to maintain these requirements. Breakthrough technologies are required to not only carry us to, but also beyond, these objectives. Currently, the interconnection density, measured as the length in metres of metal interconnects per square centimetre of active chip area, is 4843 m/cm², and it is predicted to reach 11169 m/cm² by the year 2007. Furthermore, the delays per millimetre of local metal interconnections are now 121 ps and are expected to increase by 200% by 2007, and 1600% by 2016 (depicted graphically in Figure 5.3 on page 122). Table 2.1 on page 117 summarizes a few of the values related to interconnect effects taken from the SIA roadmap.

One often overlooked topic in the analysis of interconnects in VLSI circuitry are thermal issues. The flow of current through metal wires causes wire-self-heating, which has a twofold effect on chip design and reliability. As depicted in Table 2.1 on page 117, the maximum allowable current density is of great significance in future technologies, and wire-self-heating will add to this problem by further limiting the maximum allowable current density. Secondly, wire reliability, which is governed my electromigration (EM), has an exponential dependence on the inverse of metal temperature [112].
Figure 5.1 Contribution of interconnect effects to overall delay

Table 2.1 2001 SIA Roadmap Summary

<table>
<thead>
<tr>
<th>Year of Production</th>
<th>2001</th>
<th>2002</th>
<th>2003</th>
<th>2004</th>
<th>2005</th>
<th>2006</th>
<th>2007</th>
</tr>
</thead>
<tbody>
<tr>
<td>Interconnect RC delay 1 mm line (ps)</td>
<td>66</td>
<td>121</td>
<td>176</td>
<td>198</td>
<td>256</td>
<td>303</td>
<td>342</td>
</tr>
<tr>
<td>Local Wiring Pitch (nm)</td>
<td>350</td>
<td>295</td>
<td>246</td>
<td>210</td>
<td>185</td>
<td>170</td>
<td>150</td>
</tr>
<tr>
<td>Total interconnect Length (mcm^2)</td>
<td>4086</td>
<td>4843</td>
<td>5788</td>
<td>6879</td>
<td>9068</td>
<td>10022</td>
<td>11669</td>
</tr>
<tr>
<td>Local wire aspect ratio for copper</td>
<td>1.6</td>
<td>1.6</td>
<td>1.6</td>
<td>1.7</td>
<td>1.7</td>
<td>1.7</td>
<td>1.7</td>
</tr>
<tr>
<td>Jmax (A/cm²) — wire (at 105°C)</td>
<td>9.6E+05</td>
<td>1.1E+06</td>
<td>1.3E+06</td>
<td>1.5E+06</td>
<td>1.7E+06</td>
<td>1.6E+06</td>
<td>1.5E+06</td>
</tr>
<tr>
<td>Interconnect RC delay 1 mm line (ps)</td>
<td>53</td>
<td>75</td>
<td>101</td>
<td>127</td>
<td>155</td>
<td>191</td>
<td>198</td>
</tr>
<tr>
<td>Intermediate wiring pitch (nm)</td>
<td>450</td>
<td>380</td>
<td>320</td>
<td>275</td>
<td>240</td>
<td>215</td>
<td>195</td>
</tr>
<tr>
<td>Intermediate wiring dual Damascene A/R (Cu wire)</td>
<td>4.6</td>
<td>1.6</td>
<td>1.7</td>
<td>1.7</td>
<td>1.7</td>
<td>1.7</td>
<td>1.8</td>
</tr>
<tr>
<td>MPU Printed Gate Length (nm)</td>
<td>90</td>
<td>75</td>
<td>65</td>
<td>53</td>
<td>45</td>
<td>40</td>
<td>35</td>
</tr>
<tr>
<td>Chip Frequency (MHz)</td>
<td>1664</td>
<td>2317</td>
<td>3088</td>
<td>3990</td>
<td>5173</td>
<td>5631</td>
<td>6730</td>
</tr>
<tr>
<td>MPU High-Performance Total Chip Area (mm^2)</td>
<td>310</td>
<td>310</td>
<td>310</td>
<td>310</td>
<td>310</td>
<td>310</td>
<td>310</td>
</tr>
<tr>
<td>MPU High-Performance Active Transistor Area (mm^2)</td>
<td>26.7</td>
<td>28.2</td>
<td>27.7</td>
<td>27.2</td>
<td>26.8</td>
<td>26.8</td>
<td>26.8</td>
</tr>
<tr>
<td>Equivalent Oxide Thickness - Tox (Electrical) (nm)</td>
<td>2.3</td>
<td>2.1</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>Nominal power supply voltage (Vdd) (V)</td>
<td>1.1</td>
<td>1</td>
<td>1.0</td>
<td>0.9</td>
<td>0.9</td>
<td>0.7</td>
<td>0.7</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Year of Production</th>
<th>2010</th>
<th>2013</th>
<th>2016</th>
</tr>
</thead>
<tbody>
<tr>
<td>Interconnect RC delay 1 mm line (ps)</td>
<td>565</td>
<td>976</td>
<td>2008</td>
</tr>
<tr>
<td>Local Wiring Pitch (nm)</td>
<td>135</td>
<td>135</td>
<td>135</td>
</tr>
<tr>
<td>Total interconnect Length (mcm^2)</td>
<td>348</td>
<td>348</td>
<td>348</td>
</tr>
<tr>
<td>Local wire aspect ratio for copper</td>
<td>18</td>
<td>13</td>
<td>9</td>
</tr>
<tr>
<td>Jmax (A/cm²) — wire (at 105°C)</td>
<td>18</td>
<td>18</td>
<td>18</td>
</tr>
<tr>
<td>Interconnect RC delay 1 mm line (ps)</td>
<td>18</td>
<td>13</td>
<td>9</td>
</tr>
<tr>
<td>Intermediate wiring pitch (nm)</td>
<td>11511</td>
<td>19348</td>
<td>28751</td>
</tr>
<tr>
<td>MPU High-Performance Total Chip Area (mm^2)</td>
<td>310</td>
<td>310</td>
<td>310</td>
</tr>
<tr>
<td>MPU High-Performance Active Transistor Area (mm^2)</td>
<td>26.8</td>
<td>26.8</td>
<td>26.8</td>
</tr>
<tr>
<td>Equivalent Oxide Thickness - Tox (Electrical) (nm)</td>
<td>0.6</td>
<td>0.5</td>
<td>0.4</td>
</tr>
<tr>
<td>Nominal power supply voltage (Vdd) (V)</td>
<td>0.6</td>
<td>0.5</td>
<td>0.4</td>
</tr>
</tbody>
</table>

White—Manufacturable Solutions Exist, and Are Being Optimized
Yellow—Manufacturable Solutions are Known
Red—Manufacturable Solutions are NOT Known
The dilemma associated with thermal issues is exacerbated with technology scaling. Increased current densities will add to the thermal issues, since the RMS value of current density is responsible for heat generation. The use of low-\(k\) dielectrics (materials having a lower thermal conductivity) and the increased number of metal layers, will contribute to the thermal issues by trapping more heat in the wires. Finally, wire congestion will increase not only the capacitive coupling between adjacent wires, but also thermal coupling.

Though the use of unconventional interconnect methodologies and materials have been suggested in the SIA Roadmap, the elimination of metal wires does not appear to be viable option in the near future. A summary of several of the proposed alternatives are provided in [119]. The use of low-loss off-chip thin film wiring, and optical interconnects appear to be the most promising alternatives to the conventional metal layer medium of present day processes. These solutions, if at all attainable, will be aimed at lengthy cross-chip lines and buses, and will see limited service in the local interconnect applications.

5.2 Physics of Wire Interconnects

Due to the large bandwidth and ultra fast switching speeds of modern devices, transmission line theory is required in order to properly assess signal propagation in on-chip interconnects. By investigating the transmission line effects of wire interconnects, a threefold concern of wires on circuit performance arises: added capacitive loads to driving gates, signal delay, and coupling noise. These arise due to the electrical characteristics of wires, namely resistance, capacitance and inductance, and the wire's geometry and relative position. Such effects will have implications on circuit performance, power consumption, reliability and cost. Ho et al. [15] have studied individual effect of each of the characteristics and their summarized results will be subsequently discussed.

All types of wire interconnects exhibit resistance as charge flows through them. Wire resistance (measured per unit length) is calculated as the material resistivity divided by cross sectional area. The resistance grows quadratically as the cross sectional area is
decreased. To counter this, wires are not scaled by an ideal scaling factor:

\[
S = \frac{1}{\text{feature size}}
\]

but by a quasi-ideal scaling factor:

\[
S = \sqrt[4]{\frac{1}{\text{feature size}}}
\]

This creates a taller wire while maintaining a constant cross sectional area [12]. With scaling effects, the wire resistance will inevitably increase as the height and the width of the wires scale down with the technology.

With an ideal scaling factor, having the overall wire dimensions decrease by a scaling factor \( S \), the wire resistivity per unit length will increase by a factor of \( S^2 \) [110]. This value is reduced to \( S^{1.5} \) with the quasi ideal scaling methods. It may be argued that the overall length of the wire interconnect will also be reduced by an equivalent scaling factor, since the individual points will now be located within closer proximity, and based on this notion, the total wire resistance will remain constant. It must be pointed out that the general trend, as outlined in the SIA roadmap, is for an increase in chip functionality and device counts as the technology scales. Hence, the assumption that as feature sizes decrease, so too will chip sizes does not necessarily hold true.

A system designer will take advantage of the enhanced circuit density to increase the on-chip system functionality, and with it increase the chip complexity [110]. For this reason, it is safe to expect wire resistances to increase. One means of overcoming this dilemma is through the use of \textit{locally optimized arrays}, which will be subsequently discussed in further detail.

Wire capacitance is a result of required charge that must be added or removed from a wire in order to change its electric potential. For the most part, basic capacitance models simply take into account the parallel plate capacitor effects of wires over a plane. A generally accepted capacitance model has been introduced in [113], and is used for the per unit
length capacitance estimation with edge fringing effects taken into consideration. The model has two representations, each dependent on the wire aspect ratio.

For $w \geq \frac{d}{2}$

$$C = \varepsilon_{ox} \left( \frac{w - \frac{d}{2}}{t_{ox}} \right) + \varepsilon_{x} \left( \frac{2\pi}{\ln \left( 1 + \frac{2t_{ox}}{d} + \sqrt{\frac{2t_{ox}}{d} \left( \frac{2t_{ox}}{d} + 2 \right)} \right)} \right)$$

For $w < \frac{d}{2}$

$$C = \varepsilon_{ox} \left( \frac{w}{t_{ox}} \right) + \varepsilon_{x} \left( \frac{\pi \left( 1 - \frac{0.0543d}{2t_{ox}} \right)}{\ln \left( 1 + \frac{2t_{ox}}{d} + \sqrt{\frac{2t_{ox}}{d} \left( \frac{2t_{ox}}{d} + 2 \right)} \right)} \right)$$

In these expressions, $w$ and $d$ are the wire width and depth, $t_{ox}$ and $\varepsilon_{ox}$ are the oxide thickness and dielectric constant, and $C$ is the capacitance of the wire per unit length. According to the SIA roadmap [9], the aspect ratio of local interconnects will not exceed 2 in the foreseeable future, and so the first expression will be the one of prime importance.

As discussed previously, as the aspect ratio of wires increases, the side profiles of the wires will have a greater contribution to the overall capacitive effect. Thus, the wire is better modeled by four parallel plate capacitors, plus a constant fringing term [15]. Although the parallel plate capacitance decreases with wire dimensions, the benefits are offset by the increased coupling capacitance to neighbouring wires as a result of increased wire congestion per layer [14]. This increased trend in capacitance is in direct accord with the predicted rise in interconnect RC delay per unit length as graphed in Figure 5.3 [9].

Figure 5.2 on page 122 outlines the coupling capacitances associated with metal interconnects within an integrated circuit, with particular focus on the second metal layer.
(M2). The complex interaction between the various wires may easily be visualized. In addition, the relative sizes and positions if the various metal layers may be observed, and the manner in which the wire dimensions increase with the metal layers. The layers appearing closest to the substrate are reserved for short local interconnects, and as the layers progress further away from the substrate they are used for intermediary and global lines respectively. The global lines have the largest dimensions simply for conductance purposes since the resistance of the larger lines will be much less than the finer pitch wires used for the local connections. The coupling capacitance between two wires, \( i \) and \( j \), may be represented as:

\[
C_{ij} = \frac{f_{ij} l_{ij}}{d_{ij}} \cdot \frac{1}{1 - \frac{w_i + w_j}{2d_{ij}}}
\]

where, \( w \) is the wire size, \( f \) is the unit length fringing capacitance, between the wires, \( l \) is the overlap length of the wires, and \( d \) is the distance between the wire centres [114].

Apart from the obvious downfalls associated with a quadratic increase in interconnect delay with wire length [15], circuit robustness especially in terms of coupling noise is another serious hindrance. CMOS devices are voltage controlled, thus the voltage noise on the wires relative to voltage margins at the receiving gates is crucial. Capacitive coupling is the principle contributor to this type of cross-coupling noise, especially on weakly driven, or excessively long wires.

Cross-talk induced noise is a result of undesired voltage brought about through parasitic coupling capacitances from one net onto another. It has a significant and detrimental effect in deep submicron technologies, as it leads to decreased signal integrity and increased delays. The two side capacitances of adjacent lines will experience data dependent capacitances, upwards of 70\% of the total capacitance, and in turn are the most common source of noise injection [15]. More recently, the effects of crosstalk and process variation have been combined in a new design parameter referred to as crosstalk sensitivity [114]. Crosstalk sensitivity is a measure of the effects of process variation on crosstalk. Based on the analysis in [114], it has been shown that the lower bound on crosstalk sensitivity
increases quadratically with technology scaling, while crosstalk increases linearly. Thus the significance of process variation on crosstalk effects are more profound in deep submicron technologies.

Figure 5.2 Coupling capacitances associated with metal interconnects

![Figure 5.2](image_url)

Figure 5.3 Interconnect RC delay per unit length (ps / mm)

Although there has been substantial research done as of late on accurate prediction and reallocation of problematic routing from a design level [114][115][116], this problem may only be moderated, not eliminated. VLSI CAD tools, in general, carry out routing in two phases. The global routing stage plans a coarse outline of the interconnections based on a
grid, which provides a tight boundary as to where the final wires may be placed. Maze search or Steiner embedding techniques govern the path congestion and delay requirements of the global routing process. Detailed Routing is the last step of the interconnect layout procedure, where the wires are placed within the predefined confinements according to the technology design rules. Neither routing stage is optimal for tackling the crosstalk effects, since global routing is too early for accurate prediction, whereas detailed routing does not have the liberty of restructuring the overall interconnection scheme to deal with conflicting aggressor and victim nets [116].

Since coupling effects are highly dependent on local geometries of interconnections, one solution has been to provide structural solutions in problematic areas, such as shielding wires, or repowering buffers. The use of repeaters has been shown to reduce crosstalk by up to 50% in VLSI implementations. In addition, the use of larger driver circuits have been shown to reduce the amount of crosstalk noise due to the reduced effective resistance; however, larger drivers lead to increased capacitive loads which in turn has an unfavorable effect on signal delay [117]. In Intel's 64-bit Itanium processor, 85% of the full-chip level routes require at least one repeater. There are approximately 13000 nets at the top level of the Itanium, 11000 of which require some type of a repeater [118].

In order to maintain the capacitance of interconnects relatively constant, the use of low-$k$ dielectrics has been explored. The dielectric constant is directly proportional to the capacitance, which is also proportional to the leakage current. While low-$k$ dielectrics reduce the parasitic capacitance between the metal lines, low-$k$ materials are less mechanically, chemically, thermally, and electrically stable than the historical material of choice, deposited SiO$_2$ [119]. The use of new materials, such as Copper metal and low-k dielectrics, in fabrication processes help reduce the resistance and parasitic capacitance of the metal interconnect lines. This process, known as "Dual Damascene", is used to create the multi-level, high-density metal interconnections needed for advanced, high performance ICs. [120].
For the most part, inductance effects are a point of concern for global wires. This is due in part to the lower resistance per unit length of these lines which results in the reactive component of the interconnect impedance becoming comparable to the resistive component [121]. In addition, the longer current return paths in global wires added to the fact that they are the furthest from the substrate, creates substantial mutual inductive coupling [122]. Furthermore, it has been shown that a properly designed inductive line can reduce the total power dissipated [123]. However, these inductive effects may be ignored for local interconnections, since they are pertinent only to longer global wires and clock networks, and are dominated by the local RC effect [124].

5.3 Interconnect Effects On Arithmetic Circuitry

Historically, the measure of circuit delay has been a function of the gate delay. With this frame of mind, the reduction of the gate count in the critical path was determined to be an accurate reflection of performance gains. This rationale has been the driving force behind arithmetic circuitry development over the past several decades. Recently the physical restrictions imposed by interconnections of highly irregular topologies have surfaced as a credible limiting element in high performance design.

Recent studies [11][12][13] have examined the consequences of technology scaling on arithmetic circuitry. Multiplier circuits present several unique challenges, such as sizable areas, large volumes of short but irregular interconnections, and significant amounts of switching activity within their cascading chains of sub-cells. For these reasons, multiplier architectures are highly susceptible to interconnect effects, and are of particular importance in this area of research. As depicted in Figure 5.4, the delay of a double precision multiplier is greatly affected when the presence of interconnections is taken into account.

Choe et al. [11] have predicted that with uniform scaling into deep sub-micron technology, the larger the multiplier the greater the effects of interconnect delays. Their analytic evaluation of interconnect delays demonstrated that as the operand size of multipliers
increases, so too does the influence of interconnections on circuit performance. Huang et al. [13] have obtained similar results in their study of wire effects on prefix adders. Their simulations have shown that in many cases wire delay exceeds logic delay, and has a significant impact on the critical path delay. For a given technology, the contribution of wires on circuit operation, increases as the datapath, and operand width, becomes wider.

In the investigation of technology scaling effects on multipliers [12], several recoding schemes have been examined leading to some conclusions regarding the influence of algorithm level design. It has been observed that the recoding schemes generating the fewest number of partial products, while having the largest volume of logic cells in the critical path, are the least vulnerable to interconnect effects. This is only natural since limiting the number of partial products in the array will lead to fewer interconnects, while the placement of logic cells will in effect act as buffers along the wires, thus reducing the interconnect lengths.

![Graph showing delay ratio vs. drawn feature size](image)

**Figure 5.4 Relative delay; binary tree to its no wire implementation for Booth 2, double precision multiplication [12]**
5.4 Locally Optimized Arrays

The analysis and understanding of the ramifications of interconnections within arithmetic circuitry will present opportunities for the development of the next generation of arithmetic circuitry. It has been observed that the schemes least affected by interconnect delay are those which have minimum length and methodical wire placement. In an invited paper by Luigi Dadda [125], the author stresses that a successful VLSI architecture will satisfy the criteria that only a few types of simple cells must be laid out using local and regular connections, avoiding long, irregular data paths. In this section, the notion of forming complex arithmetic circuits through the application of locally optimized arrays will be presented.

The notion of system partitioning has been in existence for quite some time, and it has been suggested that future microprocessors must be further partitioned into independent physical regions in order to cope with technology scaling effects [14]. The decomposition of a large system into smaller sub-blocks has been explored on a large scale with suggestion for solutions to scaling effects in system on chip applications [126]. On a lower level, a partitioning scheme has been presented for the optimization of interconnect power. In [60], the authors advance the notion of distributed computing through their approach of exploiting locality through the subdivision of an algorithm into spatially local clusters.

The concept of “systolic arrays” closely resembles the proposed optimized array formalism. Systolic arrays are arrays of processors, or identical processing elements, which are connected in a mesh-like topology to a small number of nearest neighbours. Generally the operations will be the same in each processor, with each processor performing an operation (or small number of operations) on a data item and them passing it on to its neighbour. The lock-step data movement through the array of cells resembles the rhythmic pumping of blood in the veins. One of the primary requirements in systolic array architectures is that no unlatched signal is allowed to propagate across the cells [8]. This is the distinguishing factor between this and the locally optimized array design paradigm, which excludes this restriction.
The Pentium IV architecture represents one of the most familiar general purpose architectures, and employs the partitioning scheme for performance optimization. The double pumped (operating at twice the system clock) high-speed ALU core of the processor is kept as small as possible to minimize the metal length and loading. Only the essential hardware necessary to perform the frequent operations is included in this high-speed ALU execution loop. Functions that are not used very frequently, for most integer programs, such as the multiplier, shifts, flag logic, and branch processing, are isolated from the key low-latency ALU loop and are implemented independently [79].

The concept of partitioning may also be applied on an even lower level abstraction, where the fundamental segment of an algorithm's composition, namely the arithmetic blocks, may be partitioned. Once again, the Pentium ALU presents an example of this through the "scattered add" formation of the fast adder which is a cluster of 16-bit adder slices [79]. Through the use of locally optimized arrays, several of the drawbacks of scaling of larger circuits may be accommodated. This holds particularly true in the case of digital multipliers that are particularly prone to scaling effects. Though irregular topology of large tree multipliers has in the past offered significant performance gains, the use of such schemes in deep-submicron implementations has been shown to be inefficient [21][11]. To counter this, the decomposition of large multiplications into smaller multipliers is suggested, with the details of the algorithm presented in Chapter 6.

According to the literature supporting the relationship between operand size and interconnect related issues, the larger the arithmetic element, the greater the effects of interconnects. Moreover, it has been shown that the doubling of operand sizes in a multiplier increases the average power dissipation by more than a factor of four, and increases the power to area ratio [21]. The use of smaller multipliers leads to the generation of fewer partial products, leading to shorter interconnect requirements. Furthermore, the reduction in the number of partial products will in turn affect the average net length due to the fewer number of stages required in the reduction array. The signals will on average need to by-pass fewer reduction sub-blocks, and the distribution of the blocks will be more dense due to the scaled down partial product array.
The decomposition of a large tree multiplier into smaller multipliers will increase the overall design regularity, while maintaining the performance advantages of tree multipliers over the linear array multipliers. The use of smaller multipliers will increase the number of logic blocks in the critical path, and will cut down on the length of the longest nets, mimicking the insertion of small repeaters on such wires. The use of deep pipelining will achieve the same effect, but will not increase the circuit’s regularity as significantly.

To support this theory, an analysis has been carried using various multiplier sizes implemented in 0.18 micron CMOS technology. Table 2.2 summarizes a sampling of the simulation results for the booth-recoded multipliers of various length. A complete summary of the analysis results, in addition to the simulation logs are provided in Appendix B.

Table 2.2 Summary of Interconnect Analysis for Various Multiplier Sizes

<table>
<thead>
<tr>
<th></th>
<th>64</th>
<th>54</th>
<th>32</th>
<th>16</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Number of Components</strong></td>
<td>12916</td>
<td>9391</td>
<td>3687</td>
<td>1069</td>
</tr>
<tr>
<td><strong>Number of Pins</strong></td>
<td>64481</td>
<td>66505</td>
<td>17559</td>
<td>5036</td>
</tr>
<tr>
<td><strong>Number of Nets</strong></td>
<td>14409</td>
<td>10388</td>
<td>3915</td>
<td>1140</td>
</tr>
<tr>
<td><strong>Average Number of Pins per Net</strong></td>
<td>4.48</td>
<td>4.48</td>
<td>4.49</td>
<td>4.42</td>
</tr>
<tr>
<td><strong>Total segments in regular wiring</strong></td>
<td>128059</td>
<td>82116</td>
<td>27057</td>
<td>6834</td>
</tr>
<tr>
<td><strong>Total segments in special wiring</strong></td>
<td>401</td>
<td>338</td>
<td>224</td>
<td>140</td>
</tr>
<tr>
<td><strong>Total wirelength in regular wiring (um)</strong></td>
<td>1512966.26</td>
<td>861731.66</td>
<td>230623.16</td>
<td>49271</td>
</tr>
<tr>
<td><strong>Total wirelength in special wiring (um)</strong></td>
<td>105551.24</td>
<td>78250.08</td>
<td>36160.6</td>
<td>15203.28</td>
</tr>
<tr>
<td><strong>Total wiring (um)</strong></td>
<td>1618317.5</td>
<td>939981.74</td>
<td>266783.76</td>
<td>64474.28</td>
</tr>
<tr>
<td><strong>Average regular wiring net length (um)</strong></td>
<td>11.8146</td>
<td>10.4941</td>
<td>8.5236</td>
<td>7.2097</td>
</tr>
<tr>
<td><strong>Average special wiring net length (um)</strong></td>
<td>262.7213</td>
<td>231.5091</td>
<td>161.4313</td>
<td>108.5949</td>
</tr>
<tr>
<td><strong>Average net length (um)</strong></td>
<td>112.3130</td>
<td>90.4873</td>
<td>68.1440</td>
<td>56.5564</td>
</tr>
<tr>
<td><strong>METAL 1</strong></td>
<td>146231.92</td>
<td>102118.66</td>
<td>44620.96</td>
<td>17036.44</td>
</tr>
<tr>
<td><strong>METAL 2</strong></td>
<td>273265.32</td>
<td>185962.3</td>
<td>69145.3</td>
<td>17575.84</td>
</tr>
<tr>
<td><strong>METAL 3</strong></td>
<td>475396.2</td>
<td>294621.48</td>
<td>96337.8</td>
<td>25701.42</td>
</tr>
<tr>
<td><strong>METAL 4</strong></td>
<td>388446.9</td>
<td>211710</td>
<td>40386.22</td>
<td>3398.28</td>
</tr>
<tr>
<td><strong>METAL 5</strong></td>
<td>194880.84</td>
<td>111835.58</td>
<td>16193</td>
<td>762.3</td>
</tr>
<tr>
<td><strong>METAL 6</strong></td>
<td>140096.32</td>
<td>33433.72</td>
<td>100.8</td>
<td>0</td>
</tr>
<tr>
<td><strong>Max crosstalk induced timing delta (ns)</strong></td>
<td>1.66</td>
<td>1.66</td>
<td>0.467</td>
<td>0.497</td>
</tr>
</tbody>
</table>

The results above strongly support the presumption that with increasing operand widths, the complexity and volume of the interconnections increases. The average net lengths, and
individual wire segment lengths increase dramatically with increasing multiplier sizes. Although it may be argued that the number of components does not increase by a factor of four as the operand width is doubled, it is clear that the advantage of partitioning a large operation lies in the increased regularity and size of the wire segments. Figure 5.5 outlines the relationship between multiplier size and interconnect length. The non-linear relationship between the net length and count with respect to the multiplier size is an accurate reflection of the decrease in device regularity with size.

The increase by a factor of six in the total net length of a 64-bit multiplier over a 32-bit multiplier, corresponds to the 3.5 fold increase in capacitive coupling (crosstalk) induced timing delay. As summarized in Table 2.2, the larger multipliers experience greater coupling delay than the smaller, more regular designs. This study is merely a reflection of the current 0.18 micron CMOS technology. The coupling effects are bound to increase as feature sizes continue to shrink.

![Graphs showing interconnect effects](image)

**Figure 5.5** Interconnect effects with respect to multiplier width  
(a) Total chip interconnect length  (b) Average chip interconnect length

The distribution of the interconnections within the metal layers is also of significant importance. It is clear that the smaller multiplier sizes, make use of the lowest metal layers, leaving the upper layers open for global routing lines. By relieving the
interconnect congestion within these layers, the more compact designs allow for greater routability of the overall design. In addition, the dimensions of the upper metal layers, by their fundamental design, are aimed at global distribution signals, and do not present any advantages (if not presenting detrimental inductance effects) to local signal wiring.

On a separate note, it has been demonstrated that the use of 4:2 compressors will reduce the total cell and interconnect counts in partial product reduction arrays (Figure 3.17 and Figure 3.18 on page 61 respectively). As mentioned previously in chapter 3, the 4:2 compressors provide a further advantage in interconnections since approximately one third of the overall wire count is devoted to the short interstage horizontal carry paths. Coupled with their high distribution regularity, the 4:2 compressors makes an ideal candidate for use in future multiplier algorithms, and a reduction subcell which promotes the locally optimized methodology for arithmetic circuit composition.
Chapter 6

Recursive Multiplication

The notion of carrying out multiplication by breaking up the operands into smaller sections has been in existence for several decades. Such schemes offer several advantages over performing standard multiplication. By breaking a large multiplication into recursions of smaller multiplications, the regularity of the design is increased, since smaller multipliers are inherently less complex. In addition, fewer, shorter interconnects are required to carry out the multiplication, with a limited number of global lines used to collect the final outputs of each recursion.

The name “Recursive Multiplier” may at first appear misleading, since in the implementation of this algorithm, there are no recursions, or repeated iterations of the same procedure. The process is simply broken down into smaller sub-processes which are carried out in parallel. However, for the sake of consistency with the authors [18] the same nomenclature is adopted.

This chapter will examine the essence of recursive multiplication, beginning with an overview and proof of this concept. Simulation results will then be presented to support its potential vantage points over conventional schemes.
6.1 Overview of the Recursive Multiplication Algorithm

6.1.1 Background Information

One of the pioneering schemes for "divide and conquer", or recursive, multiplication was proposed by Karatsuba and Ofman in 1962, and translated from Russian into English in 1963 [22]. The Karatsuba-Ofman Algorithm (KOA) gets the multiplication of two long integers by executing multiplications and additions on their divided parts. The KOA as described by Christof Paar [127] allows for a low complexity multiplier in Galois Fields. A field is an algebraic structure in which the operations of addition, subtraction, multiplication, and division (except by zero) can be performed while satisfying the standard rules. A Galois field is a finite field with $p^n$ elements generated as the set of polynomials with coefficients in a modulo of an irreducible polynomial of degree $n$, and $p$ is a prime integer [128].

The discussion of fields and the Karatsuba-Ofman Algorithm are beyond the scope of this thesis; however the fundamental principles of the KOA are used in the recursive algorithm presented by Danysh and Swartzlander [18]. Mathematically, the recursive algorithm may be proven by first considering two unsigned $n$-bit operands, the multiplier $X$ and multiplicand $A$:

$$X = \sum_{k=0}^{n-1} x_k \cdot 2^k$$

$$A = \sum_{k=0}^{n-1} a_k \cdot 2^k$$

By dividing each of the two operands into 2 $m$-bit values, where $m = n/2$, we obtain:

$$X = \sum_{k=0}^{m-1} x_k \cdot 2^k + \sum_{k=m}^{2m-1} x_k \cdot 2^k$$

$$A = \sum_{k=0}^{m-1} a_k \cdot 2^k + \sum_{k=m}^{2m-1} a_k \cdot 2^k$$

$X$ and $A$ may now be redefined as:

$$X = X_L + X_H$$

$$A = A_L + A_H$$
where:

\[
X_L = \sum_{k=0}^{m-1} x_k \cdot 2^k \\
X_H = \sum_{k=m}^{2m-1} x_k \cdot 2^k \\
A_L = \sum_{k=0}^{m-1} a_k \cdot 2^k \\
A_H = \sum_{k=m}^{2m-1} a_k \cdot 2^k
\]

The overall multiplication of A and X is given by:

\[
P = A \cdot X \\
= (A_L + A_H) \cdot (X_L + X_H) \\
= A_L \cdot X_L + A_L \cdot X_H + A_H \cdot X_L + A_H \cdot X_H \\
= P_0 + P_1 + P_2 + P_3
\]

Therefore, the overall multiplication may be reduced to four smaller multiplications, and this process may be repeated using even smaller multipliers for the base multipliers. In order to minimize the delay introduced by subdividing the process, the result of the submultipliers, or the intermediary products, will be kept in carry save form; hence only one final fast adder will be required to yield the final product.

Each of the 4 \(n\)-bit intermediary products in carry save format will occupy a given series of bit positions. By examining the results of the expanded multiplication outlined above, the following relationship for the positions may be deduced:

\[
P_0 \Rightarrow [0 : n-1] \\
P_1 \Rightarrow \left[ \frac{n}{2} - 1 : \frac{3n}{2} - 1 \right] \\
P_2 \Rightarrow \left[ \frac{n}{2} - 1 : \frac{3n}{2} - 1 \right] \\
P_3 \Rightarrow [n : 2n-1]
\]

A dot diagram representation of the multiplication is outlined in Figure 6.1, and a schematic representation is provided in Figure 6.2. It becomes apparent that there will be
3 intermediary products that will overlap from bit \((n/2 - 1)\) to \((3n/2 - 1)\). Consequently, that leads to 6 bits that must be reduced to 2 to provide one final product in carry save form. A 6:2 reduction scheme has been proposed for the recursive multiplier [18][129], which introduces at most an equivalent delay of three full adders. The reduction circuit is formed by an interconnection of variations of the reduction sub-block depicted in Figure 6.3.

**Figure 6.1** Dot diagram of a single level recursive \(n\)-bit multiplication

**Figure 6.2** A schematic of a single level recursive multiplier
6.2 6:2 Reduction Circuitry

The main function of the reduction circuit is to reduce the four results generated by the intermediary multipliers down to one value, in carry save format. The 6:2 reduction block is composed of a chain of full adders generating a two bit output value, along with inter-block carry signals that propagate laterally along the reduction sub-block array. Each reduction sub-blocks will take anywhere from two to six input bits, and generate a two bit output value. In addition, there are various inter-block carry signals that propagate laterally along the reduction sub-block array. Figure 6.3 outlines a typical 6:2 reduction sub-block as proposed in [18]. Similar to the 4:2 compressor, the offset nature of the carry signals negates carry propagation across the reduction macrocells, ensuring a maximum delay of 3 full adders for the complete process.

![Diagram of a 6:2 reduction circuit]

**Figure 6.3** A standard 6:2 reduction macrocell composed of 3 stages of full adders

Kim and Swartzlander [129] introduced a set of enhanced reduction sub-blocks to be used where the reduction process takes in 2, 3, 4 or 5 bit inputs. The circuits as defined in the manuscript are presented in Figure 6.4. It should be noted that although the circuits are
not entirely efficient in their objective, they provide regularity in the reduction chain, and are capable of receiving and transmitting the carry signals without disrupting the chain.

In the architecture, two $n$-bit operands are bisected, resulting in 4 $n/2$-bit sub-multiplications. The overall input to the reduction circuit arrives in a set of four $n$-bit values in carry save format as the output of the four intermediary multipliers. This may be more clearly defined if the dot diagram representing the overall process in Figure 6.1 on page 134 is re-analyzed. It may be intuitively observed that the first $n/2$ bits of the reduction circuit output may be obtained directly from the output of the multipliers. So the reduction circuitry will be required to accommodate $3n/2$ bits of the product.

The reduction pattern leads to the simple expression for the allocation of the reduction sub-blocks for an $n \times n$ bit multiplier as:

- Bits 0 through $n/2 - 1$ are obtained directly as a result of the inputs.
- Bits $n/2$ through $3n/2 - 1$ are obtained via 6-input reduction blocks
- Bits $3n/2$ through $2n - 1$ are obtained using 2-input reduction blocks

A superior implementation of the reduction circuit than that proposed by the original authors has been developed. Observing the lateral carry propagation in the 2-input reduction block in Figure 6.4 on page 137, a carry signal has been grounded, and there is extensive use of Half Adder blocks. With some minor alteration, a simplified version of this block may be formed to accept the lateral carry signals from the 6-input block preceding it, and an omission of a carry signal for the subsequent block. With the exclusion of the lateral carry signal, the remainder of the 2-input reduction blocks may be replaced by an array of half-adders, without any carry propagation delay.

A modified version of the presented reduction scheme is outlined in Figure 6.5 on page 138. The reduction sub blocks having 6 inputs exist from bit position $(n/2 - 1)$ to $(3n/2 - 1)$. The transition to the 2-input sub-blocks is composed of a two input reduction cell, a full adder, and a series of half adders for the remaining bits.
Figure 6.4 6:2 Macrocells capable of receiving a variety of input bits
(a) 5 input  (b) 4 input  (c) 3 input  (d) 2 input
Figure 6.5 Novel 6:2 Reduction Block

The need for the Half-Adder blocks, may not be obvious at first glance, however if the Carry-out of the Full-adder immediately preceding the cells is taken into consideration, then there will be three bits in position $3n/2+1$. Consequently, the use of the Half Adder cells will shift a carry bit laterally down the chain, allowing for the final result to have at most two bits in each position. Furthermore, it should be noted that the final carry out signal of the last Half Adder is omitted since it would be mathematically impossible to obtain a bit in the 129th position of a 64-bit multiplication. This new 6:2 reduction configuration will be used for any further analysis of the recursive multiplication algorithm.

6.3 Analysis Of The Base Multiplier

To examine the optimal size of the base multiplier required, and consequently, the number of recursions, it is imperative to examine the relationship between base multiplier size and the associated delay. This analysis will begin with the delay calculation provided in [18]. Assuming that each Full-Adder has a gate delay of 3 (2 gate delays for an XOR and 1 for
other gates), 1 gate delay is required for the initial partial product generation. Assuming that $2[\log_2 (b) - 1]$ stages are required for a Dadda Multiplier, then we will expect a delay of:

$$D_{Dadda} = 3 \times 2[\log_2 (b) - 1] + 1$$
$$= 6 \log_2 (b) - 5$$

If $n$ is the number of bits of the overall multiplier, and $b$ the bits of the base sub-multiplier, then the number of recursions required is:

$$\log_2 (\frac{n}{b}) = \log_2 (n) - \log_2 (b)$$

Since each recursion will require a 6:2 reduction stage, corresponding to 9 gate delays, the delay relationship for the overall scheme using a $b$-bit base Dadda multiplier will be:

$$D_{recursive} = 9 \log_2 (\frac{n}{b}) + 6 \log_2 (b) - 5$$
$$= 9 \log_2 (n) - 3 \log_2 (b) - 5$$

Similarly the overall delay may be calculated for a $b$-bit Array base multiplier, by defining the delay for an array multiplier being:

$$D_{Array} = 3 \times (b - 1) + 1$$
$$= 3b - 2$$

$$D_{recursive} = 9 \log_2 (\frac{n}{b}) + 3b - 2$$

As per the analysis carried out in Section 3.2.2 on page 38, the simplification for the number of required reduction stages as defined in [18] is not the most accurate. Consequently, the above calculations will be re-examined using the more accurate representations developed in chapter 3.

To begin, the delay analysis will be considered as a function of the delay associated with one (3,2) counter. In this manner, any approximation associated with the standard gate delay is avoided, and as outlined in the analysis of full adders in chapter 4, this would introduce further inaccuracies. Thus, the simplified analysis for the delay of the recursive
multiplier using a Dadda base multiplier is reduced to a function of the number of reduction stages required in the $b$-bit base-multiplier:

$$D_{Dadda} = \left\lfloor \log_{1.5} \left( \frac{3b}{5} \right) \right\rfloor$$

and the number of recursions:

$$\log_2 \left( \frac{n}{b} \right) = \log_2 (n) - \log_2 (b)$$

where each reduction stage has a delay of one full adder, and each recursion will have a delay of 3 full adders. Thus the overall relation for an $n$-bit multiplication is:

$$D_{Recursive_D} = D_{Dadda} + 3 \cdot \log_2 \left( \frac{n}{b} \right)$$

$$\therefore D_{Recursive_D} = \left\lfloor \log_{1.5} \left( \frac{3b}{5} \right) \right\rfloor + 3 \cdot \log_2 \left( \frac{n}{b} \right)$$

The delay associated with a recursive multiplier having an Array base multiplier is somewhat less complicated, and is given as:

$$D_{Recursive_A} = D_{Array} + 3 \cdot \log_2 \left( \frac{n}{b} \right)$$

$$\therefore D_{Recursive_A} = (b - 1) + 3 \cdot \log_2 \left( \frac{n}{b} \right)$$

These simplified representations of the various delays allow an approximation of the relative delay that may be expected from the various configurations of the recursive multiplier architecture. In order to identify the optimal solution, the various delays of this architecture have been analyzed for the combinations of base multiplier size, and base multiplier type. Figure 6.6 (a) depicts the relationship between delay and overall operand size for various sizes of Array base multipliers. Likewise, Figure 6.6 (b) illustrates the relationship between delay and overall operand size for various sizes of Dadda base multipliers. Equivalent Array and Dadda multipliers not using the recursive architecture have been provided for benchmark comparisons in each graph.
Figure 6.6 Delay associated with various base-multiplier sizes  
(a) Array base multipliers (b) Dadda base multipliers
Figure 6.7 provides a three dimensional curve portraying the effects of base multiplier size on overall delay for various overall multiplier sizes using a Dadda base multiplier. From this figure, it becomes evident that the most efficient recursive multiplier corresponds to only one level of recursion; thus adding at most one full adder delay to that of a typical non-recursive Dadda tree.

Furthermore, from the analysis it is intuitive to conclude that the recursive architecture using column compression (or Dadda type) base multipliers demonstrate considerable performance gains over array base multipliers (Figure 6.6). It is apparent that the array base multipliers will exhibit similar delays only in the case where several recursion stages are employed, at which point the benefits of regularity of the array multipliers will be completely overshadowed by the 6:2 reduction stages.

Table 2.1 on page 143 summarizes an analytic comparison of the two base-multiplier styles (Dadda and Array) for a 64-bit recursive multiplier. The complete analysis chart is provided for further reference in Appendix D. The total cell count and interconnect for the two sub-multiplier schemes are presented graphically in Figure 6.8 on page 144. In addition, Figure 6.9 on page 144 depicts the percentage increase of the various base multiplier sizes for each scheme in both cell and interconnect segment count, over a typical non-recursive 64-bit multiplier.

From these results, it can be concluded that there are obvious limitations to the extent of the application of the concept of locally optimized arrays. There is a definitive trade-off in both size and performance once the recursions become excessively small. The increase in the number of interconnect segments also poses a secondary limitation to the utility of the recursive scheme once the base-multiplier becomes too small. As stated in chapter 5, the interconnects are in fact shorter as the multiplier size decreases, however the total number of nets may pose a problem. Since each level of recursion increases the number of required of multipliers required by a factor of four, the area, speed, and power trade-offs in using a significant number of small base-multipliers may in fact be disadvantageous to the overall design.
Table 2.1. Analytic comparison of “Dadda” vs. “Array” base multipliers of different widths for a 64-bit recursive multiplier

<table>
<thead>
<tr>
<th>Base Multiplier Size</th>
<th>64</th>
<th>32</th>
<th>16</th>
<th>8</th>
<th>4</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of Multipliers</td>
<td>1</td>
<td>4</td>
<td>16</td>
<td>64</td>
<td>256</td>
</tr>
<tr>
<td>Total Cell Count</td>
<td>3969</td>
<td>4069</td>
<td>4277</td>
<td>4725</td>
<td>5749</td>
</tr>
<tr>
<td>% Increase - Cell Count</td>
<td>0.00%</td>
<td>2.52%</td>
<td>7.76%</td>
<td>19.05%</td>
<td>44.85%</td>
</tr>
<tr>
<td>Total Interconnections</td>
<td>7938</td>
<td>8073</td>
<td>8357</td>
<td>8981</td>
<td>10453</td>
</tr>
<tr>
<td>% Increase - Interconnections</td>
<td>0.00%</td>
<td>1.70%</td>
<td>5.28%</td>
<td>13.14%</td>
<td>31.68%</td>
</tr>
<tr>
<td>Overall Delay (FAs)</td>
<td>63</td>
<td>34</td>
<td>21</td>
<td>16</td>
<td>15</td>
</tr>
<tr>
<td>% Increase - Delay</td>
<td>0.00%</td>
<td>-46.03%</td>
<td>-66.67%</td>
<td>-74.60%</td>
<td>-76.19%</td>
</tr>
</tbody>
</table>
Figure 6.8 Base multiplier comparisons for a 64-bit recursive multiplier
(a) total cell count  (b) total interconnect segment count

Figure 6.9 Percentage increase in cell count and interconnect count for various sizes of Array and Dadda base multipliers
Chapter 7

Reconfigurable Multiplier Architecture

The successful design of high-speed computational systems is often predicated on the realization of advanced arithmetic circuits in digital hardware. The notions of reconfigurable architectures have been regarded as a means of adapting the hardware to achieve optimal performance under various conditions. This implies a level of intelligence built into the device for physical modification in order to meet operating requirements. The principle advantage of such systems rests in the fact that hardware realizations of computing algorithms outperform their software alternatives.

The intent of a reconfigurable architecture is to provide a means by which the performance of arithmetic hardware may be enhanced according to the desired function. For example, many modern DSP chips offer variable precision [130][131], or fault tolerant arithmetic implemented using software [132]. The operation of these devices may be ameliorated if such functions were executed directly on hardware. Since multiplication is considered to be the dominant computation in most digital signal processing (DSP) algorithms [19], a reconfigurable multiplier architecture may prove to be a desirable augmentation to existing ALUs in the quest for maximizing performance.
7.1 Introduction to DSP Multiplication

Digital signal processing has become such an ingrained element of modern society, that its existence in our daily activities often goes unnoticed. It is present in one shape or form in almost all of aspects of contemporary life. The continued development of this field is an area of significant research. There are typically two facets to modern processors: the software and instruction set that manipulates the signals, and the dedicated hardware that physically carries out the given commands. The hardware dictates the limitations of the instruction set and how those instructions may be carried out, thus predicting the general characteristics and performance of the device.

As outlined in Figure 2.1 on page 13, multiplication is one of the most important operations in advanced signal processing, and the framework of the dedicated hardware multiplier has a severe impact on the overall system. In some specific signal processing applications, linear transforms with predefined, fixed coefficients are used (such as the DCT). In this situation, multiplier-less processing may be carried out using a combination of shift-add algorithms, look-up-tables, and/or distributed arithmetic may be employed [133] (Distributed arithmetic (DA) is a bit-serial operation that performs an inner (dot) product of a pair of vectors). Additionally, unconventional arithmetic techniques and number systems may be employed to take advantage of a particular attribute of the data being processed in order to improve the overall system performance; the double-base number system (DBNS) is an excellent example of this [30].

Special purpose processors offer limited programmability (such as the redefinition of local coefficients) allowing for increased processing speeds; thus presenting a trade-off between generality and lower power, area and/or speed. General purpose microprocessors, microcontrollers, and DSP chips do not often have the luxury of a priori knowledge of the incoming signals, and so must be flexible in their architecture to accommodate a variety of instructions and data patterns. For this reason, conventional arithmetic circuitry must be present. The increased fabrication densities in the past few decades have opened the door for the amalgamation of special purpose and general purpose architectures, such as the
TMS320C8x generation that contains several DSP cores, a RISC microcontroller and a floating point ALU [134]. Hitachi offers yet another example of a combined architecture with their DSP-enhanced microcontroller, the SH-DSP, which adds a 16-bit data path to the original microcontroller for increased functionality [132]. The two distinct data paths in this device rely on one processor core, and so concurrent functionality is not possible.

### 7.1.1 Computation Parallelism

Concurrent, or parallel, processing is an obvious solution to increased productivity within a device. By augmenting a data path with extra execution units, such as the use of two multipliers instead of just one, the amount of work performed per instruction cycle is significantly increased. Superscalar architectures issue and execute multiple instructions simultaneously, and are employed in high-performance processor designs, such as the Pentium and the PowerPC. The immense gains achievable through parallelism and the founding principles of superscalar and massively parallel architectures is thoroughly presented in [135].

Single Instruction Multiple Data (SIMD) techniques are one of the most common examples of parallel processing employed in general purpose processors, such as the MMX and Altivec extensions on the Pentium and PowerPC processors. SIMD is useful in applications with high levels of parallelism where numeric operations may be improved on by dissecting a large data word into shorter ones, and operating on them in parallel using one single instruction. SIMD capabilities are also becoming common in high performance DSP chips [132], since these techniques can greatly increase the rate of computation for vector operations, image processing and solving partial differential equations [125].

### 7.1.2 Variable Data Width

An excellent example of the presence of variable data widths within modern architectures is the SH-DSP chip. This hybrid controller/DSP chip utilizes an extended data bus width to accommodate both operations. Since the hybrid architecture requires 16-bits for the microcontroller and 32-bits for the DSP data buses, the primary processing core must be
able to accommodate both operand widths. A more common example is the Pentium IV processor, where the instructions are SIMD operations that operate on 8, 16, 32 or 64-bit operands [79]. In addition, many modern DSP chips [130][131] offer variable precision arithmetic execution depending on the processor mode. As mentioned earlier SIMD operations may split words into smaller chunks for parallel operations. Some SIMD processors support multiple data widths (16-bit, 8-bit), examples of which include the Lucent DSP16xxx, ADI ADSP-2116x, ADI TigerSHARC [136].

The IEEE Floating Point Standard [6], outlines the requirement for the two standard word widths, namely single and double precision. It is necessary for most DSP floating point cores to be able to support both of the precision lengths (Figure 2.12 on page 28). The sign (s), exponent (e) and fraction or mantissa (f) bits total 32 and 64-bits for single and double precision respectively. The particulars of precision conversion, rounding schemes and other details of this standard are beyond the scope of the current discussion.

As is the case in the Pentium IV [137], the majority of processors achieve double and double extended format floating point multiplication through the iterated use of singleprecision multipliers. Though effective, the formulation of double-precision multiplication using iterative single precision operations is an inefficient compromise to variable precision arithmetic using dedicated hardware. The floating point co-processor outlined in [130], for example, uses a 59-bit double precision multiplier, and its partial product CSA tree is partially decoupled for single precision applications.

7.1.3 Fault Tolerance

No single event has provoked as much interest and debate in fault tolerant arithmetic than the Pentium Division bug [138]. The work of Thomas Nicely, in 1994, led to the discovery that one in every eight billion inputs to the division circuitry yielded inaccurate results. Though this "bug" appears as a design flaw of minor significance, due to its infrequent occurrence, nevertheless Intel was forced to re-issue the processor with the problem corrected. It is imperative for a scientist carrying out an analysis or
mathematically intensive simulation to be able to have a high level of confidence in the end result. The same requirement may be extended to any computation device, especially those carrying out critical processes.

Checking hardware functionality is currently a more practical approach to fault tolerance. Irrespective of high performance demands, the increasing complexity of physical devices, make embedded fault detection and correction units a commercially wise decision. Fault tolerant hardware may be employed to detect a fault, and either correct the situation if a transient fault has occurred, or send a critical error in the case of a permanent physical breakdown. The essence of fault tolerance lies in redundancy [139], whether time redundancy, where an operation is repeated in time, or hardware redundancy, where duplicate hardware carries out parallel versions of an operation. Several schemes will now be presented, including those targeted for fault tolerant multiplication [139-145].

A simple multiplication check, which may be used for basic error detection, is through the comparison of residues [140]. If, for example AxB=C, then (A mod r)x(B mod r)=(AB mod r), for a random, relatively small value r. Thus if AB mod r is equivalent to C mod r, the multiplication has been successful. Another unique approach to error detection limited solely array multipliers is presented in [141]. The authors present the bi-directional operation (BIDO) implementation of an array multiplier where the idle cells are re-used for the repeated calculation of the product. The normal and repeated operations are performed simultaneously by taking advantage of the time and space redundancy in the architecture with little hardware overhead and performance degradation.

One proposal for the correction of physical faults is the through the employment of a fault tolerant technique referred to as Reprocessing with MicrO Delays (REMOD) [142]. This fault detection is based on the principle of node covering, in which the circuit is decomposed into and array of identical cells, each checked by a covering cell. This scheme allows for the detection of faulty cells, and the reconfiguration of the overall circuit to disengage that cell from the data path for future computation. This scheme offers
single fault detection and correction for a 32-bit Wallace tree multiplier with a 43% and 48% increase in time and area respectively.

Time redundancy for error correction simply refers to the repetition of an operation one or more times to detect and restore potential faults. One of the most frequently employed forms of redundant fault tolerance is known as Triple-Modular Redundancy (TMR). This highly reliable scheme is often used for critical applications and has been proposed as a viable choice for general deep submicron systems [143]. The TMR framework involves three copies of an original logic or arithmetic unit, and a majority voting circuitry, used to select the majority output amongst the three units.

REcomputing with Duplication With Comparison (REDWC) may be used to detect errors, or extended to REcomputing with Triplication With Voting (RETWV) for error correction. Duplication, though hardware intensive, offers one of the more infallible means of fault tolerance. A basic RETWV multiplier is outlined in Figure 7.1 on page 151 [144], where an $n \times n$ multiplication is carried out through three iterations of $n/3 \times n$ sub-multiplications. The three sub-multipliers featured carry out three identical versions of the multiplication with their results passed on to a majority voter circuit. By exploiting time redundancy through the iterative procedure, hardware overhead has been limited in this TMR application.

The concept of RETWV is further extended in [145] with the proposal of a time-shared triple modular redundancy with alternating logic scheme. The dilemma concerning the wrong generated output from the majority voter due to two or more of the redundant modules having stuck-at faults in the same position, may be overcome through this methodology. By monitoring cases where the majority voter does not receive identical inputs, and inverting the signals with potential stuck-at faults, 100% permanent fault detection is achievable. this comes at the expense of a 37% delay increase over a typical TMR scheme.
A separate issue from stuck at faults are common mode failures (CMF), which arise as a result of a flaw in the original design. Naturally three copies of the identical circuit having a CMF at the same position will yield results that will go undetected in a TMR scheme. To overcome this problem, the Fault Tolerance by Shifted and Rotated Operands (TOSHIRO) in TMR architectures has been proposed in [143]. The scheme proposes a standard execution in two of the units, and a shifted and rotated version of the same operand in the third unit. The result is then unrotated and shifted back prior to the majority voter.

Figure 7.1  Time redundant RETWV error correcting multiplier [144]
7.2 Reconfigurable Architectures

The principles of reconfigurable hardware was substantiated with the introduction of the first class of the Field Programmable Gate Array (FPGA). The ability to modify hardware via software control gave the end-user an alternative to using cumbersome software subroutines for executing operations on predetermined hardware resources. Programmable or reconfigurable architectures offer the benefits of:

- shorter design time, and faster time to market
- performance gains over software alternatives
- flexibility to make changes, catering to evolving standards
- field upgradability, and modification for prolonged time to obsolescence
- re-use of the same platform for a variety of applications, and variations of the same product
- conformity to targeted application for optimal performance, power and/or functionality

Although variable precision and reconfigurable architectures have been suggested in the past [126, 146-152], the focus has predominantly been on FPGA implementations [146-149]. Such techniques do indeed offer a considerable performance edge without the need for additional software overhead; however, dedicated application specific hardware offers the potential for increased savings in resources, power and latency. The design in [146] is flexible enough to accommodate most basic arithmetic operations especially targeted at multimedia applications. A reconfigurable multiplier is outlined in [147] that uses the Karatsuba Multiplication algorithm, which is similar in style to the recursive multiplier, but is employed in finite field mathematics.

The Reconfigurable Arithmetic Processor (RAP), intended for DSP applications, is presented in [150]. This is one of the first proposals for dedicated hardware reconfigurability, where the functionality of the hardware is dictated by the control criteria. This scheme combines the simplicity of transmitting and executing serial arithmetic, with the performance advantages of functional parallelism. By arranging a sequence of serial arithmetic cells, the RAP is capable of interconnecting the cells in
accordance with a given function, such that entire arithmetic formulas may be calculated without the intermediate results going off chip or to local memory. The RAP is improved upon in [151] by identifying several of the most dominant arithmetic operations for DSP applications, and devising a scheme to carry out these functions using efficient reconfiguration. A shared hardware architecture composed of several interacting modules, configured via control signals, is once again employed.

A reconfigurable fixed hardware parallel inner processor has been suggested [152]. The complexity associated with the multiple levels recommended, as a result of the small base multiplier sizes, is in direct contrast to the derivation of optimum delay for recursive structures presented in previous chapters. However the concepts presented in [147] and [152] correspond to the reconfigurable multiplier architecture that will be presented in the upcoming section.

7.3 Proposed Multiplier

Relaxing the requirement of 100% correctness for devices and interconnects may dramatically reduce costs of manufacturing, verification, and test. Such a paradigm shift is likely forced in any case by technology scaling, which leads to more transient and permanent failures of signals, logic values, devices, and interconnects. Potential solutions are adaptive and self-correcting/self-repairing circuits, and the use of on-chip reconfigurability. - SIA 2001 Technology Roadmap [9]

The proposed multiplication architecture envelopes the concepts of fault tolerant computing, low power design, and high throughput arithmetic into one design. The scheme utilizes a 2-bit control signal to select one of four modes of operation:

- Double Precision (64-bit) Multiplication (default)
- Single Precision (32-bit) Multiplication
- Dual Single Precision Multiplication (double throughput)
- Single Precision Fault Tolerant Multiplication through a Majority Voter
The recursive multiplier architecture with one level of recursion will be used as the foundation for the reconfigurable architecture. The advantage offered by the recursive multiplication scheme is the use of smaller multipliers to implement a larger operation, which is in direct compliance with the presented results in the previous chapters. This structure indirectly promotes the notion of locally optimized arrays through shorter local interconnects, and a more regular integration of the sub-components on a larger scale.

In addition to the basic multipliers, a series of 2:1 multiplexers, will be used to guide the signal flow through the device. Since all of the necessary components for each mode of operation are present in the design, there will be no reconfiguration time required. The device will be capable of switching between modes of operation in real time without the necessity to completely reconfigure the internal layout of a programmable device, as is the case with FPGA devices. A schematic representation of the reconfigurable multiplier is provided in Figure 7.2, in which the 4 sub-multipliers, the reduction circuitry, the voter and the final fast adder are clearly defined.

![Figure 7.2 Outline of the reconfigurable multiplier](image-url)
This architecture lends itself to four modes of operation, and thus requires a 2-bit control signal for selection. The signal and the corresponding modes of operation are summarized in Table 2.1.

<table>
<thead>
<tr>
<th>Control Signal</th>
<th>Mode of Operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>Default - Double precision</td>
</tr>
<tr>
<td>01</td>
<td>Single Precision</td>
</tr>
<tr>
<td>10</td>
<td>Single Precision with Fault Tolerance</td>
</tr>
<tr>
<td>11</td>
<td>Dual Single Precision</td>
</tr>
</tbody>
</table>

7.3.1 Double Precision Mode

The default double precision mode is simply a recursive multiplier with one level of recursion. This mode of operation reaps the benefits of the recursive multiplier architecture, while bearing no delay penalties, and minimal hardware overhead. The majority voter circuitry is disengaged through the multiplexer array (Figure 7.3). The reconfigurable architecture may be of any size, with the restriction that the single precision mode must be exactly one half of the double precision mode. To satisfy the IEEE floating point guidelines, a double precision multiplier having 54-bit operands is suggested, with each of the base multipliers being 27-bits wide.

7.3.2 Single Precision Mode

Single precision mode uses gating techniques to shut down three of the base multipliers, effectively cancelling over 75% of the circuit, in addition to the reduction circuitry and the majority voter (Figure 7.4). The final fast adder is also partitioned in the reconfigurable architecture, such that the upper portion of the adder may be shut down in order to avoid spurious transitions, which consume unnecessary power. The effect of this mode of operation is similar to that of clock gating techniques employed in low power design [65-68], used to black out idle portions of a circuit. Moreover, the overall latency now becomes that of the base multiplier, allowing faster operation in single precision mode than would be possible if the entire circuit was active.
The advantage of this scheme is that the single precision multiplication is carried out using a full single precision multiplier as opposed to shutting portions of a larger partial product reduction tree, as is proposed in other variable precision schemes [130]. In this manner, both single and double precision operations are carried out at maximum efficiency in terms of area, performance and power.

Figure 7.3 Default double precision mode

Figure 7.4 Single precision mode
7.3.3 Dual Single Precision Mode

If the input bus is configured to allow for two sets of operands occupying the low order and high order bits of the bus, then two operations may be carried out on these operands concurrently. This scenario is employed in existing architectures [130][132], especially those with high levels of parallelism as outlined earlier in this chapter. The reconfigurable architecture is ideal for such applications, where two of the base multipliers may operate in parallel on two different sets of operands, while the remaining two multipliers are inactive (Figure 7.5). This effectively doubles the system throughput, with a latency of a single precision multiplier.

Once again, with the fast adder partitioned into two identical halves, linked via a multiplexed carry signal, two single precision fast additions may be carried out in parallel. This configuration comes at little to no delay overhead in most carry look ahead (CLA) and carry skip addition schemes. The gating of signals into the idle multipliers, in addition to the 6:2 reduction and majority voter circuitries allows for power savings. The idle circuits are not entirely disconnected from the power supply in order to allow for rapid and accurate engagement into any other mode of operation.

Figure 7.5 Dual single precision mode
7.3.4 Single Precision Fault-tolerant Mode

Although there are numerous methods of implementing fault tolerance in digital systems, one of the most basic methods is through majority voting between three duplicate values, which is also referred to as RETWV. Since this scheme is composed of four identical base multipliers, three of those may be used in conjunction with an array of 64 XOR gates and 2:1 MUX cells, to form a simple single precision fault tolerant multiplier ( Figure 7.6 ). This architecture lends itself to any variation of TMR fault detection and correction scheme. A RETWV method is employed due to its simplicity and limited hardware overhead, however the majority voter may in fact incorporate any of the other more sophisticated schemes outlined earlier in this chapter.

Figure 7.6 Single precision - fault tolerant mode

With the theoretical framework for the reconfigurable multiplier architecture in place, the next chapter will focus on the implementation and simulation details.
Chapter 8

Modeling and Simulation

For the proper assessment of the performance characteristics of the proposed reconfigurable architecture, a valid model must be created, and compared against a benchmark model representing the state-of-the-art. For this reason a 64-bit reconfigurable multiplier has been designed using TSMC 0.18 μm, 6 metal layer technology. Additionally, a standard 64-bit Booth-recoded Wallace tree multiplier, similar to that employed in many of today's high performance processors, such as the Pentium IV [137], has been developed as a benchmark for comparison purposes.

This chapter will begin with a description of the hierarchical design of the multiplier using the Verilog Hardware Description Language (HDL). The implementation, simulation and comparison of the of the design against the benchmark will be the subsequent topics of discussion.

8.1 HDL Model

Verilog describes a digital design as a set of modules, which are the basic building blocks forming the complete system. This hierarchical design methodology is a fundamental concept in Verilog digital designs. The reconfigurable multiplier design
features four 32-bit Booth-Recoded Wallace-Tree base multipliers, a 48-bit 6:2 reduction block and two 64-bit carry-look-ahead adders. The overhead from the additional features are four arrays of 32 2:1 MUX cells, two arrays of 64 2:1 MUX cells, and a series of 64 XOR gates and 2:1 MUX cells for the majority voter. All of these individual modules are enveloped by the top level module which acts as a general input/output (I/O) interface for the multiplier, as outlined in Figure 8.1 (a). The top level module contains the clocked latching circuitry required for design synthesis (Figure 8.1 (b)), and does not effect the internal configuration of the multiplier itself.

Figure 8.1 (a) Top level module of the HDL model
(b) Top level module outlining the input output clocked-latching circuitry
The multiplier itself is partitioned into the major sections as outlined previously, and depicted graphically in Figure 8.2. The coded model is marginally more complicated than the initial model of the multiplier presented in the previous chapter. This is due to the placement of the multiplexer blocks required to direct the signal paths. Built in Synopsys module definitions for Booth-recoded Wallace tree multipliers and carry-look-ahead adders have been used to model and synthesize portions of the code, ensuring that the most efficient synthesized netlist is obtained.

The Verilog code that defines the various modules, and their interaction is provided in Appendix C. The code in itself is over 1000 lines long, and consists of 17 different modules. A complete breakdown of the hierarchical expansion of the overall reconfigurable architecture, compiled by the Synopsys Design Analyzer, confirms the proper framework of the design. The standard cell components and the gate level configuration of each element may also be referenced from this file.

Figure 8.2 An illustrated representation of the HDL model of the multiplier
8.2 Implementation and Layout

The multiplier has been implemented using the TSMC 0.18 micron CMOS process using standard cell libraries provided by the Canadian Microelectronics Corporation. Semi-custom design makes use of standard cell libraries for the fabrication of custom integrated circuits. Since the designer is limited to the library cells at his disposal, having fixed layout and orientation, the design may not be fully optimized. However, through the use of powerful CAD packages, capable of quickly and effectively compiling a digital design using standard cell libraries, the development time of such chips is far less. This design methodology is by and large the most economic and practical means by which a custom ASIC design can reach the marketplace within a "technology window".

Full custom VLSI design allows for an increased opportunity for performance improvement of a chip, since the placement, geometry, and virtually every aspect of a design on a transistor level is adjustable and amendable by the architect. The end result will usually demonstrate superior characteristics such as low power dissipation, high operating frequencies and/or reduced silicon area. Unfortunately, the development timeline is generally far too long for such design methodologies to produce an economically viable product.

The Cadence Design Suite, including the AreaPdp and Silicon Ensemble tools have been used for the layout placement and routing of the cells. The design has been developed to meet fabrication requirements, however due to the large size, and I/O bound nature of the end result, fabrication was not feasible. An I/O bound design refers to a design having substantially more input and output pad overhead than the actual silicon requirements for the integrated circuit core itself. This leads to inefficient fabrication due to the large masses of silicon that will be squandered. Figure 8.3 on page 163 provides a view of the placed and routed reconfigurable multiplier architecture. Due to the I/O bound nature of the design, the core of the multiplier, without input buffers or output drivers, is the only component that is implemented. Since the intent of this design is to be incorporated into a larger processor, the isolated IC implementation was not explored.
8.3 Simulation Results

Table 8.1 summarizes the breakdown of the area, power and delay for the reconfigurable multiplier. The percentage contributions for each of the components of this architecture are presented in Figure 8.4 on page 164. The four base multipliers account for the majority of the area and power consumption. The final carry-look-ahead fast adders (two 64-bit adder halves forming the 128-bit adder), though minimal in terms of area, are responsible for a considerable portion of the overall latency. The complete simulation report and log files for the reconfigurable multiplier are provided in appendix E.
Table 8.1 Breakdown of the parameters for the Reconfigurable Multiplier

<table>
<thead>
<tr>
<th>Overall</th>
<th>Area (sq um)</th>
<th>Power (mW)</th>
<th>Delay (ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>RecMult</td>
<td>443322.75</td>
<td>753.01</td>
<td>9.00</td>
</tr>
<tr>
<td>Fast Adder</td>
<td>381895.59</td>
<td>613.38</td>
<td>8.74</td>
</tr>
<tr>
<td>Mux Reduction</td>
<td>10944.59</td>
<td>19.78</td>
<td>3.75</td>
</tr>
<tr>
<td>Reduction</td>
<td>9025.44</td>
<td>11.52</td>
<td>0.11</td>
</tr>
<tr>
<td>Mux Majority</td>
<td>28670.47</td>
<td>51.76</td>
<td>1.27</td>
</tr>
<tr>
<td>Majority</td>
<td>6728.42</td>
<td>8.57</td>
<td>0.00</td>
</tr>
<tr>
<td>Multipliers</td>
<td>5724.29</td>
<td>10.98</td>
<td>0.00</td>
</tr>
<tr>
<td>I/O Registers</td>
<td>320802.38</td>
<td>510.77</td>
<td>3.49</td>
</tr>
<tr>
<td>Multipliers</td>
<td>61427.16</td>
<td>139.64</td>
<td>0.26</td>
</tr>
</tbody>
</table>

Figure 8.4 Breakdown of the Reconfigurable Multiplier
(a) Area (b) Power (c) Delay
In order to obtain a valid measure of the characteristics and the performance of the proposed architecture, it is compared to a 64-bit conventional multiplier. A Booth-recoded Wallace-tree implementation of a multiplier having a carry-look-ahead fast adder is compiled in Synopsys, and fully laid out, in a similar fashion to the reconfigurable multiplier. Table 8.2 provides the comparison statistics of the two architectures. The first implementation of the benchmark multiplier, named \textit{Mult64d}, produced results that significantly favoured the reconfigurable multiplier. The new proposal outperformed the conventional design in practically every statistical category from overall latency, coupling capacitance, average net length, area, and silicon area utilization.

A second conventional multiplier (\textit{Mult64e}) has been recompiled using more stringent design constraints for reaffirmation of the results. In this case, the conventional multiplier drastically reduced the overall performance margins, beating the proposed architecture in area and total wiring requirements, thus reducing the capacitive coupling effects. Although the reconfigurable multiplier was not compiled with such high effort settings, it managed to top both conventional implementations in area utilization, power, and delay. Furthermore the new design provides the lowest average net length.

The delay calculations were performed using the Cadence Pearl timing Analyzer tool. This tool assures confidence in the accuracy of the latency calculations, which incorporate the interconnect and layout effects of the two architectures. Pearl’s analysis is path-oriented and reports longest paths and timing violations. To ensure accuracy, Pearl also takes the gate input slew rate and interconnect resistance and capacitance (which can be back-annotated) into account during delay calculations. Because Pearl is a static timing analyzer, it traces and analyzes all paths in the circuit. This thorough path analysis guarantees the reliability of the timing verification. A dynamic timing analyzer can only check paths exercised by the simulation vectors; the quality of its results therefore depend on the quality of the supplied simulation vectors. Appendix F provides the complete simulation reports and logs of the three test multipliers.
Table 8.2 Summary of the 64-bit multiplier implementations

<table>
<thead>
<tr>
<th></th>
<th>Reconfigurable Multiplier</th>
<th>Standard Mult64d</th>
<th>Standard Mult64e</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Summary</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Number of Components</td>
<td>17158.00</td>
<td>13851.00</td>
<td>9882.00</td>
</tr>
<tr>
<td>Number of Pins</td>
<td>87986.00</td>
<td>70539.00</td>
<td>54738.00</td>
</tr>
<tr>
<td>Number of Nets</td>
<td>18908.00</td>
<td>15589.00</td>
<td>11630.00</td>
</tr>
<tr>
<td>Average Number of Pins per Net</td>
<td>4.65</td>
<td>4.52</td>
<td>4.71</td>
</tr>
<tr>
<td>Area of Chip (square DBU)</td>
<td>7.1333E+11</td>
<td>7.6971E+11</td>
<td>5.9581E+11</td>
</tr>
<tr>
<td>Area Required by all Cells</td>
<td>4.4332E+11</td>
<td>3.7372E+11</td>
<td>3.6037E+11</td>
</tr>
<tr>
<td>Area Utilization (%)</td>
<td>62.15</td>
<td>48.55</td>
<td>60.48</td>
</tr>
<tr>
<td><strong>Wiring</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total segments in regular wiring</td>
<td>147400.00</td>
<td>120407.00</td>
<td>110441.00</td>
</tr>
<tr>
<td>Total segments in special wiring</td>
<td>419.00</td>
<td>44.00</td>
<td>1140.00</td>
</tr>
<tr>
<td>Total wirelength in regular wiring (um)</td>
<td>1303570.68</td>
<td>1535632.92</td>
<td>1226989.02</td>
</tr>
<tr>
<td>Total wirelength in special wiring (um)</td>
<td>115260.48</td>
<td>16556.08</td>
<td>96835.80</td>
</tr>
<tr>
<td>Average regular wiring net length (um)</td>
<td>8.88</td>
<td>12.76</td>
<td>11.11</td>
</tr>
<tr>
<td>Average special wiring net length (um)</td>
<td>275.08</td>
<td>376.27</td>
<td>84.94</td>
</tr>
<tr>
<td>METAL 1</td>
<td>162348.96</td>
<td>70292.58</td>
<td>127593.76</td>
</tr>
<tr>
<td>METAL 2</td>
<td>278562.04</td>
<td>318176.90</td>
<td>232618.42</td>
</tr>
<tr>
<td>METAL 3</td>
<td>441000.74</td>
<td>510327.86</td>
<td>432153.96</td>
</tr>
<tr>
<td>METAL 4</td>
<td>334782.50</td>
<td>412856.72</td>
<td>314007.86</td>
</tr>
<tr>
<td>METAL 5</td>
<td>171144.40</td>
<td>181818.78</td>
<td>162043.82</td>
</tr>
<tr>
<td>METAL 6</td>
<td>35992.52</td>
<td>58916.16</td>
<td>55407.00</td>
</tr>
<tr>
<td>TOTAL WIRING (um)</td>
<td>1423831.16</td>
<td>1552389.00</td>
<td>1323824.82</td>
</tr>
<tr>
<td>Average net length (um)</td>
<td>75.30</td>
<td>99.58</td>
<td>113.83</td>
</tr>
<tr>
<td>Max crosstalk induced timing delta (ns)</td>
<td>1.34</td>
<td>2.22</td>
<td>1.19</td>
</tr>
<tr>
<td><strong>Power</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cell Internal Power (mW)</td>
<td>322.65</td>
<td>410.48</td>
<td>417.97</td>
</tr>
<tr>
<td>Net Switching Power (mW)</td>
<td>290.73</td>
<td>337.17</td>
<td>309.52</td>
</tr>
<tr>
<td>Total Dynamic Power (mW)</td>
<td>613.38</td>
<td>747.65</td>
<td>727.49</td>
</tr>
<tr>
<td>Cell Leakage Power (uW)</td>
<td>20.46</td>
<td>16.94</td>
<td>13.61</td>
</tr>
<tr>
<td><strong>Timing</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Data Arrival Time (ns)</td>
<td>9.00</td>
<td>11.18</td>
<td>9.92</td>
</tr>
<tr>
<td>I/O Delay (ns)</td>
<td>0.26</td>
<td>1.10</td>
<td>2.23</td>
</tr>
<tr>
<td>Multiplier (ns)</td>
<td>4.99</td>
<td>4.88</td>
<td>4.37</td>
</tr>
<tr>
<td>Fast Adder (ns)</td>
<td>3.75</td>
<td>5.20</td>
<td>3.32</td>
</tr>
</tbody>
</table>
Table 8.3. Comparative summary of the 64-bit multipliers

<table>
<thead>
<tr>
<th></th>
<th>Reconfigurable Multiplier</th>
<th>Standard Average</th>
<th>Percentage Change</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Summary</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Number of Components</td>
<td>17158.00</td>
<td>11866.50</td>
<td>30.84%</td>
</tr>
<tr>
<td>Number of Pins</td>
<td>87986.00</td>
<td>62638.50</td>
<td>28.81%</td>
</tr>
<tr>
<td>Number of Nets</td>
<td>18908.00</td>
<td>13609.50</td>
<td>28.02%</td>
</tr>
<tr>
<td>Average Number of Pins per Net</td>
<td>4.65</td>
<td>4.62</td>
<td>0.75%</td>
</tr>
<tr>
<td>Area of Chip (square DBU)</td>
<td>7.1333E+11</td>
<td>6.8276E+11</td>
<td>4.29%</td>
</tr>
<tr>
<td>Area Required by all Cells</td>
<td>4.4332E+11</td>
<td>3.6704E+11</td>
<td>17.21%</td>
</tr>
<tr>
<td>Area Utilization (%)</td>
<td>62.15</td>
<td>54.52</td>
<td>12.28%</td>
</tr>
<tr>
<td><strong>Wiring</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total segments in regular wiring</td>
<td>147400.00</td>
<td>115424.00</td>
<td>21.69%</td>
</tr>
<tr>
<td>Total segments in special wiring</td>
<td>419.00</td>
<td>592.00</td>
<td>-41.29%</td>
</tr>
<tr>
<td>Total wirelength in regular wiring (um)</td>
<td>1308570.68</td>
<td>1381410.97</td>
<td>-5.57%</td>
</tr>
<tr>
<td>Total wirelength in special wiring (um)</td>
<td>115260.48</td>
<td>56695.94</td>
<td>50.81%</td>
</tr>
<tr>
<td>Average regular wiring net length (um)</td>
<td>8.88</td>
<td>11.93</td>
<td>-34.41%</td>
</tr>
<tr>
<td>Average special wiring net length (um)</td>
<td>275.08</td>
<td>230.61</td>
<td>16.17%</td>
</tr>
<tr>
<td>METAL 1</td>
<td>162348.96</td>
<td>98943.17</td>
<td>39.06%</td>
</tr>
<tr>
<td>METAL 2</td>
<td>278562.04</td>
<td>275397.66</td>
<td>1.14%</td>
</tr>
<tr>
<td>METAL 3</td>
<td>441000.74</td>
<td>471240.91</td>
<td>-6.86%</td>
</tr>
<tr>
<td>METAL 4</td>
<td>334782.50</td>
<td>363432.29</td>
<td>-8.56%</td>
</tr>
<tr>
<td>METAL 5</td>
<td>171144.40</td>
<td>171931.30</td>
<td>-0.46%</td>
</tr>
<tr>
<td>METAL 6</td>
<td>35992.52</td>
<td>57161.58</td>
<td>-58.82%</td>
</tr>
<tr>
<td>TOTAL WIRING (um)</td>
<td>1423831.16</td>
<td>1438106.91</td>
<td>-1.00%</td>
</tr>
<tr>
<td>Average net length (um)</td>
<td>75.30</td>
<td>106.71</td>
<td>-41.70%</td>
</tr>
<tr>
<td>Max crosstalk induced timing delta (ns)</td>
<td>1.34</td>
<td>1.71</td>
<td>-27.24%</td>
</tr>
<tr>
<td><strong>Power</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cell Internal Power (mW)</td>
<td>322.65</td>
<td>414.23</td>
<td>-28.38%</td>
</tr>
<tr>
<td>Net Switching Power (mW)</td>
<td>290.73</td>
<td>323.35</td>
<td>-11.22%</td>
</tr>
<tr>
<td>Total Dynamic Power (mW)</td>
<td>613.38</td>
<td>737.57</td>
<td>-20.25%</td>
</tr>
<tr>
<td>Cell Leakage Power (uW)</td>
<td>20.46</td>
<td>15.27</td>
<td>25.33%</td>
</tr>
<tr>
<td><strong>Timing</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Data Arrival Time (ns)</td>
<td>9.00</td>
<td>10.55</td>
<td>-17.22%</td>
</tr>
</tbody>
</table>
As previously mentioned, the implementation of the final fast adder makes a significant impact on the overall design. For this reason, the proposed reconfigurable implementation may be further improved upon by compiling and placing the final fast adder using the highest effort settings available on the CAD tools. This was the exact procedure followed in obtaining an improved benchmark multiplier for the previous comparative study.

The most effective comparison of the architecture would be through the analysis of two fully optimized custom layouts. Semi-custom standard cell implementations have an immense interdependence on the CAD tools used to compile, place and route the designs. The designer is thus at the mercy of the software algorithms used to implement the architecture. As was demonstrated in the study, a minor modification of the design constraints and the specification of a higher degree of effort in the initial compiling of the netlist can drastically alter the end-result. Although the most accurate reflection of the proposed architectures performance should be gauged against the initial layout (*Mult64d*), the overall results will use the average of the two Wallace multipliers as a benchmark. The results of the comparison have been summarized in Table 8.3 on page 167.

### 8.4 Design Highlights

The primary considerations which were taken into account while conducting the research and developing the proposed architecture have been the independence of:

- platform
- function
- application
- technology

Many of the "breakthrough" designs in the field of computer arithmetic, especially in digital multiplication designs, have come at the expense of heavy reliance on one of the listed factors above. A design which targets a particular function or application may take advantage of unconventional methodologies for the architecture. This is the case in filter designs. Number systems may be exploited for their various advantages in designs that target a particular platform without the need for conversion. Finally, analog based designs,
such as threshold logic, multivalued logic or neural network based arithmetic all suffer from technology dependence. A layout in one technology may not necessarily work in another.

The obvious desired attributes for the end-product would include one or more of the following characteristics:

- high-speed
- low-power
- fault tolerance
- reduced layout complexity (regularity)

The paradoxical desire to incorporate all of these desires into one design has been a significant challenge for circuit designers. The reconfigurable architecture attempts to embody all of these characteristics, while granting the end-user the ability to determine the criteria of importance. This concept places the final decision regarding the hardware configuration in the hands of the user, without the performance and area overheads associated with FPGAs or software routines.
Chapter 9

Conclusions

9.1 Summary of Contributions

The aim of this thesis has been to provide a thorough analysis of the state-of-the-art in the area of digital multiplication. The investigations have focused on all aspects of the digital multiplier, on all levels of design abstraction. A summary of the contributions to the field of computer arithmetic will be provided in this section.

9.1.1 Algorithmic Contributions

The majority of the elegant algorithms used today in the field of computer arithmetic were developed many decades ago. Over the years, only marginal enhancements and suggestions have been provided. This thesis contributes to the algorithmic analysis, development and modeling of partial product reduction arrays.

A novel relationship between minimum full adder requirements according to column height has been introduced in Chapter 3. This scheme has been shown to more accurately reflect the number of reduction stages required according to the partial product matrix height, while maintaining algebraic simplicity. The new model substantially outperforms existing logarithmic estimates.
A second algorithmic assessment of partial product reduction arrays has been carried out using 4:2 compressors. A relationship between the partial product matrix size and compressor count and stage count has been developed. This scheme is employed in a new optimized compressor distribution scheme that effectively minimizes the number of compressor cells required in a reduction matrix, while maintaining the overall regularity of the matrix.

Finally, the various configurations of the recursive multiplier have been extensively analyzed using algebraic models of the base-multipliers. The use of tree multipliers over array multipliers has been assessed, and the various sizes of the base multipliers have been examined. It has been demonstrated that the most efficient recursive structure is composed of only one level of recursion using tree (or Dadda) base multipliers.

### 9.1.2 Architectural Contributions

In Chapter 5 the topic of wire interconnects was examined in some detail, along with the scaling issues associated with both wires and devices in general. A design methodology employing the “locally optimized array” framework is outlined for digital multipliers. The simulation results of various multiplier widths supports this framework, as well as the recursive multiplier architecture in general. It has been demonstrated that in order to overcome the proven repercussions of interconnects with continued device scaling, high-performance multipliers of the future will need to employ an alternate architecture, such as the one proposed.

The recursive multiplier architecture has been further explored, and a new reduction scheme is proposed which is smaller and more logically efficient than those proposed in the past. The single level recursive multiplier has then been enhanced with some gating multiplexers and a majority voter forming a novel reconfigurable multiplier architecture. This proposal has been implemented and simulated on the latest available CMOS process. Its performance and overall characteristics have been compared against a standard Booth-recoded Wallace multiplier of the same size.
9.1.3 Transistor Level Contributions

Although no new transistor level designs have been suggested, a framework for the selection, design and accurate simulation of digital circuits has been established. The various logic families, and transistor level arrangement have been critically analyzed for both full adders and 4:2 compressors. The various shortcomings of pass-transistor logic styles have been presented and the use of such design methodologies has been cautioned. Furthermore a suggestion has been made for the accurate simulation setup and analysis of arithmetic cells in terms of power consumption, area and delay.

9.2 Conclusions

By targeting many of the contemporary requirements for digital systems such as power, performance, regularity, and fault tolerance, a multiplication architecture that embodies these characteristics has been developed. This scheme takes full advantage of the benefits of reconfigurable architectures, while offering performance, power and area characteristics that are normally associated with custom digital designs.

This design has been implemented using the TSMC 0.18 CMOS standard cell libraries, and its performance has been compared against a typical multiplier architecture, namely a Booth-recoded Wallace Tree multiplier. The simulation results have demonstrated that, on average, the reconfigurable multiplier is extremely efficient in terms of interconnect layout. The new scheme offers over a 5% savings in regular interconnect requirements, resulting in a reduction in excess of 40% in average net lengths. This conservation in wire resources creates the opportunity for a 27% average reduction in coupling capacitance delay. Furthermore, this scheme promotes a refined approach to intercell wiring, promoting the locally optimized array paradigm. This is evident in the 39% increase in low-level, local wiring while eliminating over 58% of the global interconnects from the top-most metal layer.
The reconfigurable multiplier contains an average of 30% more components than the Wallace scheme, yet the overall dimensions are only increased by slightly over 4%. This is a result of the greater area utilization (12%) offered by this design. The small sacrifice in area overhead is more than compensated for by the 20% reduction in dynamic power consumption, and a 17% average decrease in overall latency.

The locally optimized array paradigm introduced in chapter 5, has demonstrated gains in multiplier architectures, offering a potential solution to the interconnect challenges faced by digital designs in deep-submicron technologies. Moreover, a variation of this concept has been implemented and analyzed against current high performance multipliers. The proposed architecture exhibits exceptional interconnect regularity, and promising overall characteristics while offering far greater versatility than any other multiplier architecture currently available.
REFERENCES


[38] "Synopsys Online Documentation", v2001.08, Synopsys Inc. 2001


[102] Shen-Fu Hsiao, Ming-Roun Jiang and Jia-Sien Yeh, "Design of high-speed low-power 3-2 counter and 4-2 compressor for fast multipliers", Electronics Letters, vol. 34 Issue: 4, pp. 341-343, February 1998


[105] Dr. Graham Jullien, personal communication, University of Windsor, November 2002

[106] Dr. James Haslett, personal communication, University of Calgary, December 2002

[107] Dr. Wael Badawy, personal communication, University of Calgary, December 2002

[108] Dr. Farid Najm, personal communication, University of Toronto, January 2003


[128] Peter Cameron, "Encyclopaedia of Design Theory", http://www.maths.qmul.ac.uk/~pjc/design/encyc/topics/, 2002


## Appendix A

### Complete 4:2 Compressor Analysis

<table>
<thead>
<tr>
<th>NFA</th>
<th>QE</th>
<th>QD</th>
<th>QC</th>
<th>QNCT</th>
<th>NC</th>
<th>NCA</th>
<th>DT</th>
<th>DC</th>
<th>Percentage Decrease in 4:2 cell count</th>
<th>Stage Count and Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>23.33% 192.00% 13 13 1</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>6</td>
<td>5</td>
<td>6</td>
<td>5</td>
<td>4</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>20.00% 60.00% 49 17 6</td>
<td></td>
</tr>
<tr>
<td>16</td>
<td>20</td>
<td>14</td>
<td>15</td>
<td>14</td>
<td>12</td>
<td>10</td>
<td>12</td>
<td>2</td>
<td>14.29% 42.86% 13 13 4</td>
<td></td>
</tr>
<tr>
<td>50</td>
<td>30</td>
<td>20</td>
<td>21</td>
<td>21</td>
<td>20</td>
<td>20</td>
<td>20</td>
<td>4</td>
<td>14.29% 42.86% 13 13 4</td>
<td></td>
</tr>
<tr>
<td>125</td>
<td>50</td>
<td>44</td>
<td>52</td>
<td>50</td>
<td>50</td>
<td>50</td>
<td>50</td>
<td>4</td>
<td>11.11% 33.33% 13 13 4</td>
<td></td>
</tr>
<tr>
<td>275</td>
<td>70</td>
<td>54</td>
<td>65</td>
<td>65</td>
<td>65</td>
<td>65</td>
<td>65</td>
<td>4</td>
<td>11.11% 33.33% 13 13 4</td>
<td></td>
</tr>
<tr>
<td>625</td>
<td>110</td>
<td>90</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>4</td>
<td>7.69% 23.08% 304 294 5</td>
<td></td>
</tr>
<tr>
<td>1375</td>
<td>220</td>
<td>170</td>
<td>200</td>
<td>200</td>
<td>200</td>
<td>200</td>
<td>200</td>
<td>4</td>
<td>7.69% 23.08% 304 294 5</td>
<td></td>
</tr>
<tr>
<td>3050</td>
<td>432</td>
<td>320</td>
<td>400</td>
<td>400</td>
<td>400</td>
<td>400</td>
<td>400</td>
<td>4</td>
<td>7.69% 23.08% 304 294 5</td>
<td></td>
</tr>
<tr>
<td>6091</td>
<td>660</td>
<td>500</td>
<td>620</td>
<td>620</td>
<td>620</td>
<td>620</td>
<td>620</td>
<td>4</td>
<td>7.69% 23.08% 304 294 5</td>
<td></td>
</tr>
<tr>
<td>12092</td>
<td>870</td>
<td>1000</td>
<td>1200</td>
<td>1200</td>
<td>1200</td>
<td>1200</td>
<td>1200</td>
<td>4</td>
<td>7.69% 23.08% 304 294 5</td>
<td></td>
</tr>
<tr>
<td>24094</td>
<td>1360</td>
<td>1400</td>
<td>1600</td>
<td>1600</td>
<td>1600</td>
<td>1600</td>
<td>1600</td>
<td>4</td>
<td>7.69% 23.08% 304 294 5</td>
<td></td>
</tr>
<tr>
<td>48098</td>
<td>2800</td>
<td>2800</td>
<td>3200</td>
<td>3200</td>
<td>3200</td>
<td>3200</td>
<td>3200</td>
<td>4</td>
<td>7.69% 23.08% 304 294 5</td>
<td></td>
</tr>
<tr>
<td>960102</td>
<td>5600</td>
<td>5600</td>
<td>6400</td>
<td>6400</td>
<td>6400</td>
<td>6400</td>
<td>6400</td>
<td>4</td>
<td>7.69% 23.08% 304 294 5</td>
<td></td>
</tr>
</tbody>
</table>

**Note:** The table above represents a portion of the data for the complete 4:2 compressor analysis, detailing the theoretical bound on [4:2] codes, optimized [4:2] strategy, improvement over theoretical bound, and analysis of interconnects and stage count with delay. Each row provides specific values for these parameters, allowing for a comprehensive analysis of the compressor's performance.
<table>
<thead>
<tr>
<th>Number of (2,2) cells required</th>
<th>Number of (2,2) cells required</th>
<th>Optimized [4,3] Strategy</th>
<th>Improvement over Theoretical Bound</th>
<th>Analysis of Interconnects</th>
<th>Stage Count and Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>NFA</td>
<td>QD</td>
<td>QC</td>
<td>NCT</td>
<td>NCA</td>
<td>DT</td>
</tr>
<tr>
<td>46</td>
<td>1968</td>
<td>1034</td>
<td>1023</td>
<td>1012</td>
<td>6568</td>
</tr>
<tr>
<td>48</td>
<td>1970</td>
<td>1038</td>
<td>1025</td>
<td>1012</td>
<td>6568</td>
</tr>
<tr>
<td>50</td>
<td>2126</td>
<td>1127</td>
<td>1128</td>
<td>1122</td>
<td>7131</td>
</tr>
<tr>
<td>52</td>
<td>2200</td>
<td>1127</td>
<td>1128</td>
<td>1122</td>
<td>7131</td>
</tr>
<tr>
<td>54</td>
<td>2652</td>
<td>1377</td>
<td>1378</td>
<td>1372</td>
<td>8613</td>
</tr>
<tr>
<td>56</td>
<td>2756</td>
<td>1430</td>
<td>1431</td>
<td>1426</td>
<td>9113</td>
</tr>
<tr>
<td>58</td>
<td>2802</td>
<td>1494</td>
<td>1495</td>
<td>1490</td>
<td>9485</td>
</tr>
<tr>
<td>60</td>
<td>2970</td>
<td>1539</td>
<td>1540</td>
<td>1534</td>
<td>9776</td>
</tr>
<tr>
<td>62</td>
<td>3090</td>
<td>1595</td>
<td>1596</td>
<td>1591</td>
<td>1006</td>
</tr>
<tr>
<td>64</td>
<td>3192</td>
<td>1652</td>
<td>1653</td>
<td>1648</td>
<td>1036</td>
</tr>
<tr>
<td>66</td>
<td>3306</td>
<td>1710</td>
<td>1711</td>
<td>1706</td>
<td>1066</td>
</tr>
<tr>
<td>68</td>
<td>3422</td>
<td>1768</td>
<td>1770</td>
<td>1766</td>
<td>1096</td>
</tr>
<tr>
<td>70</td>
<td>3540</td>
<td>1829</td>
<td>1830</td>
<td>1825</td>
<td>1126</td>
</tr>
<tr>
<td>72</td>
<td>3660</td>
<td>1889</td>
<td>1890</td>
<td>1885</td>
<td>1156</td>
</tr>
<tr>
<td>74</td>
<td>3772</td>
<td>1952</td>
<td>1953</td>
<td>1948</td>
<td>1186</td>
</tr>
<tr>
<td>76</td>
<td>3880</td>
<td>2012</td>
<td>2013</td>
<td>2008</td>
<td>1216</td>
</tr>
<tr>
<td>78</td>
<td>4032</td>
<td>2128</td>
<td>2129</td>
<td>2125</td>
<td>1246</td>
</tr>
<tr>
<td>80</td>
<td>4160</td>
<td>2211</td>
<td>2212</td>
<td>2208</td>
<td>1276</td>
</tr>
<tr>
<td>82</td>
<td>4280</td>
<td>2272</td>
<td>2273</td>
<td>2269</td>
<td>1306</td>
</tr>
<tr>
<td>84</td>
<td>4396</td>
<td>2348</td>
<td>2349</td>
<td>2345</td>
<td>1336</td>
</tr>
<tr>
<td>86</td>
<td>4512</td>
<td>2424</td>
<td>2425</td>
<td>2421</td>
<td>1366</td>
</tr>
<tr>
<td>88</td>
<td>4628</td>
<td>2500</td>
<td>2501</td>
<td>2497</td>
<td>1396</td>
</tr>
<tr>
<td>90</td>
<td>4744</td>
<td>2576</td>
<td>2577</td>
<td>2573</td>
<td>1426</td>
</tr>
<tr>
<td>92</td>
<td>4860</td>
<td>2652</td>
<td>2653</td>
<td>2649</td>
<td>1456</td>
</tr>
<tr>
<td>94</td>
<td>4976</td>
<td>2728</td>
<td>2729</td>
<td>2726</td>
<td>1486</td>
</tr>
<tr>
<td>96</td>
<td>5100</td>
<td>2804</td>
<td>2805</td>
<td>2802</td>
<td>1516</td>
</tr>
<tr>
<td>98</td>
<td>5216</td>
<td>2880</td>
<td>2881</td>
<td>2878</td>
<td>1546</td>
</tr>
<tr>
<td>100</td>
<td>5322</td>
<td>2956</td>
<td>2957</td>
<td>2954</td>
<td>1576</td>
</tr>
</tbody>
</table>

Complete 4.2 Compressor Analysis
Appendix B

Interconnect Analysis of Various Multiplier Sizes

<table>
<thead>
<tr>
<th>Summary</th>
<th>64</th>
<th>54</th>
<th>32</th>
<th>16</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of Components</td>
<td>12916</td>
<td>9391</td>
<td>3687</td>
<td>1069</td>
</tr>
<tr>
<td>Number of Pins</td>
<td>64481</td>
<td>56505</td>
<td>17559</td>
<td>5036</td>
</tr>
<tr>
<td>Number of Nets</td>
<td>14409</td>
<td>10388</td>
<td>3915</td>
<td>1140</td>
</tr>
<tr>
<td>Average Number of Pins per Net</td>
<td>4.48</td>
<td>4.48</td>
<td>4.49</td>
<td>4.42</td>
</tr>
<tr>
<td>Area of Chip (square DBU)</td>
<td>6.522E+11</td>
<td>5.068E+11</td>
<td>2.510E+11</td>
<td>1.167E+11</td>
</tr>
<tr>
<td>Area Required by all Cells</td>
<td>3.999E+11</td>
<td>2.988E+11</td>
<td>1.261E+11</td>
<td>4.646E+11</td>
</tr>
<tr>
<td>Area Utilization (%)</td>
<td>61.32</td>
<td>56.96</td>
<td>51.05</td>
<td>39.8</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Wiring</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Total segments in regular wiring</td>
<td>126059</td>
<td>82116</td>
<td>27057</td>
<td>6834</td>
</tr>
<tr>
<td>Total segments in special wiring</td>
<td>401</td>
<td>338</td>
<td>224</td>
<td>140</td>
</tr>
<tr>
<td>Total wirelength in regular wiring (um)</td>
<td>1512966.26</td>
<td>861731.66</td>
<td>230623.16</td>
<td>49271</td>
</tr>
<tr>
<td>Total wirelength in special wiring (um)</td>
<td>105351.24</td>
<td>78250.08</td>
<td>36160.6</td>
<td>15203.28</td>
</tr>
<tr>
<td>Average regular wiring net length (um)</td>
<td>11.81400311</td>
<td>10.4904774</td>
<td>8.523604243</td>
<td>7.20968686</td>
</tr>
<tr>
<td>Average special wiring net length (um)</td>
<td>262.7212968</td>
<td>231.5091124</td>
<td>161.43125</td>
<td>108.5948571</td>
</tr>
<tr>
<td>METAL 1</td>
<td>146231.92</td>
<td>102118.66</td>
<td>44620.96</td>
<td>17036.44</td>
</tr>
<tr>
<td>METAL 2</td>
<td>273265.32</td>
<td>185962.3</td>
<td>69145.3</td>
<td>17575.84</td>
</tr>
<tr>
<td>METAL 3</td>
<td>475396.2</td>
<td>294621.48</td>
<td>96337.8</td>
<td>25701.42</td>
</tr>
<tr>
<td>METAL 4</td>
<td>388446.9</td>
<td>211710</td>
<td>40386.22</td>
<td>3398.28</td>
</tr>
<tr>
<td>METAL 5</td>
<td>194880.84</td>
<td>111835.58</td>
<td>16193</td>
<td>762.3</td>
</tr>
<tr>
<td>METAL 6</td>
<td>140906.32</td>
<td>33433.72</td>
<td>100.8</td>
<td>0</td>
</tr>
<tr>
<td>TOTAL WIRING (um)</td>
<td>1618317.5</td>
<td>93981.74</td>
<td>266783.76</td>
<td>6447.28</td>
</tr>
<tr>
<td>Average net length (um)</td>
<td>112.3129641</td>
<td>90.487268</td>
<td>68.144</td>
<td>56.55638596</td>
</tr>
<tr>
<td>Max crosstalk induced timing delta (ns)</td>
<td>1.66</td>
<td>1.66</td>
<td>0.467</td>
<td>0.497</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Synopsys</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Power</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cell Internal Power (mW)</td>
<td>41.6141</td>
<td>32.1372</td>
<td>15.3479</td>
<td>6.1922</td>
</tr>
<tr>
<td>Net Switching Power (mW)</td>
<td>312.7236</td>
<td>259.548</td>
<td>149.541</td>
<td>70.0727</td>
</tr>
<tr>
<td>Total Dynamic Power (mW)</td>
<td>354.3378</td>
<td>291.6852</td>
<td>164.8899</td>
<td>76.265</td>
</tr>
<tr>
<td>Timing</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Data Arrival Time (ns)</td>
<td>7.37</td>
<td>7.34</td>
<td>6.44</td>
<td>5</td>
</tr>
<tr>
<td>SE</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total Power (20% toggle rate) (mW)</td>
<td>435.617</td>
<td>359.265</td>
<td>206.188</td>
<td>101.215</td>
</tr>
</tbody>
</table>
64-Bit Multiplier:

***************SILICON_ENSEMBLE DESIGN SUMMARY REPORT***************
Time: 0:12:55, 3 April 2003
Design name: SPIE64
Report file name: SPIE64_apr.summary

Number of macros: 402
Number of components: 12916
Number of pins: 64481
   Number of regular pins: 39776
   Number of special pins: 24305
   Number of unused pins: 353
Number of nets: 14409
Average number of pins per net: 4.48
Number of subnets: 737
   Number of regular pins for subnets: 385
   Number of special pins for subnets: 0
   Number of virtual pins for subnets: 1089
Average number of pins per subnet: 2.00
Number of routing tracks available: 2666
Number of CELLS per layer: 17835

** UTILIZATION OF ALL ROW TYPES

<table>
<thead>
<tr>
<th>Type</th>
<th>Number</th>
<th>Length</th>
<th>Area % Row Space</th>
</tr>
</thead>
<tbody>
<tr>
<td>core Rows</td>
<td>114</td>
<td>80582040</td>
<td>496385366400</td>
</tr>
<tr>
<td>core Cells</td>
<td>12276</td>
<td>64932120</td>
<td>359981859200</td>
</tr>
</tbody>
</table>

Area of chip: 652266014400 (square DBU)
Area required for all cells: 399981859200 (square DBU)
Area utilization of all cells: 61.32%

***************SILICON_ENSEMBLE WIRING REPORT***************
Time: 0:12:59, 3 April 2003
Design name: SPIE64
Report file name: SPIE64_apr.wires

Total vias in regular wiring: 120310
Total segments in regular wiring: 128059
Total vias in special wiring: 1190
Total segments in special wiring: 401

LAYER name: metal1
  Total wire length: 146231.92 microns
  Length of regular wires: 56000.68 microns
  Length of special wires: 90231.24 microns
LAYER name: metal2
  Total wire length: 273265.32 microns
  Length of regular wires: 258145.32 microns
  Length of special wires: 15120.00 microns
LAYER name: metal3
  Total wire length: 475396.20 microns
  Length of regular wires: 475396.20 microns
  Length of special wires: 0.00 microns
LAYER name: metal4
  Total wire length: 388446.90 microns
  Length of regular wires: 388446.90 microns
  Length of special wires: 0.00 microns
LAYER name: metal5
  Total wire length: 194880.84 microns
  Length of regular wires: 194880.84 microns
  Length of special wires: 0.00 microns
LAYER name: metal6
  Total wire length: 140980.32 microns
  Length of regular wires: 140980.32 microns
  Length of special wires: 0.00 microns

Total wire length in regular wiring: 1512966.26 microns
Total wire length in special wiring: 1053851.24 microns

Total wire length in regular-special wiring: 1610371.50 microns

CROSSTALK:
  0 nets claimed N coupling cape, but had different no.
  28 nets had more coupling than total capacitance
  0 pin lists with fewer pins than they said they had.
  0 nets with bizarre cap/unit length
  28 nets had implausible coupling ratios.
  2 wires had implausibly small Rs.
  131 rising drive and 131 falling drive implausibly small.
  14405 nets processed, total wire 1520766.4
  0 unknown nets encountered (maybe with repetitions)
  130 constant nets, length 16591.2 ( 1.1)
  0 nets had errors, length 0.0 ( 0.0)

Interconnect Analysis of Various Multiplier Sizes

189
Max crosstalk induced timing delta is 1.66n, (0 > 1us)
Sum of all error voltages is 0.000 volts

SYNOPSYS SPIES4:
Number of ports: 385
Number of nets: 1025
Number of cells: 641
Number of references: 3

Combinational area: 376566.437500
Noncombinational area: 23417.858469
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 199984.281250
Total area: undefined

Cell Internal Power = 41.6141 mW (12%)
Net Switching Power = 312.7236 mW (88%)

-------------------
Total Dynamic Power = 354.3378 mW (100%)
Cell Leakage Power = 26.5031 uW

54-Bit Multiplier:
****************************SILICON_ENSEMBLE DESIGN SUMMARY REPORT***************************
Time: 12:34:56, 26 March 2003
Design name: SPIES4
Report file name: SPIES4.summary

Number of macros: 342
Number of components: 9391
Number of pins: 46505
  Number of regular pins: 28671
  Number of special pins: 17840
  Number of unused pins: 294
Number of nets: 10388
Average number of pins per net: 4.48
Number of subnets: 618
  Number of regular pins for subnets: 325
  Number of special pins for subnets: 0
  Number of virtual pins for subnets: 911
Average number of pins per subnet: 2.00
Number of routing tracks available: 2350
Number of GCELLS per layer: 13824

** UTILIZATION OF ALL ROW TYPES
Type        Number  Length    Area %_Row_Space
core Rows   99     60570180  373112308x800
core Cells  8770   48514620  298850059200  80.10

Area of chip: 5068727664000 (square DBU)
Area required for all cells: 298850059200 (square DBU)
Area utilization of all cells: 58.96%

****************************SILICON_ENSEMBLE WIRING REPORT***************************
Time: 2:48:06, 31 March 2003
Design name: SPIES4
Report file name: SPIES4.wires

Total vias in regular wiring: 77755
Total segments in regular wiring: 82216
Total vias in special wiring: 728
Total segments in special wiring: 338

LAYER name: metal1
  Total wire length: 102118.66 microns
  Length of regular wires: 33135.46 microns
  Length of special wires: 68963.20 microns
LAYER name: metal2
  Total wire length: 185962.30 microns
  Length of regular wires: 176695.42 microns
  Length of special wires: 9266.88 microns
LAYER name: metal3
  Total wire length: 294621.48 microns
  Length of regular wires: 294621.48 microns
  Length of special wires: .00 microns
LAYER name: metal4
  Total wire length: 211710.00 microns
  Length of regular wires: 211710.00 microns
  Length of special wires: .00 microns
LAYER name: metal5
  Total wire length: 111835.58 microns
  Length of regular wires: 111835.58 microns
  Length of special wires: .00 microns
LAYER name: metal6
  Total wire length: 33733.72 microns
  Length of regular wires: 33733.72 microns
  Length of special wires: .00 microns
Total wirelength in regular wiring: 861731.66 microns
Total wirelength in special wiring: 78250.08 microns
Total wirelength in regular+special wiring: 939981.74 microns

CROSSTALK:
  0 nets claimed M coupling caps, but had different no.
  14 nets had same coupling than total capacitance
  0 pin lists with fewer pins than they said they had.
  0 nets with bizarre cap/unit length
  14 nets had implausible coupling ratios.
  2 wires had implausibly small Rs.
  111 rising drive and 111 falling drive implausibly small.
1036 nets processed, total wire 866884.9
  0 unknown nets encountered (maybe with repetitions)
110 constant nets, length 10532.4, ( 1.2%) 0 nets had errors, length .6, ( 0.0%)
Max crosstalk induced timing delta is 1.66n, (0 > 1ua)

SYNOPSIS SPICE:

Number of ports: 325
Number of nets: 865
Number of cells: 541
Number of references: 3

Combination area: 279092.906250
Noncombination area: 19758.616406
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 298861.718750
Total area: undefined

Cell Internal Power = 32.1372 mW (11%)
Net Switching Power = 259.5460 mW (89%)
-------------------------
Total Dynamic Power = 291.6852 mW (100%)
Cell Leakage Power = 20.1685 uW

<table>
<thead>
<tr>
<th>Point</th>
<th>Incr</th>
<th>Path</th>
</tr>
</thead>
<tbody>
<tr>
<td>clock clk (rise edge)</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>multiplier_1/out_low_70 (INTERMOLTS4)</td>
<td>0.00</td>
<td>7.34 r</td>
</tr>
<tr>
<td>OUT1_reg_70_/D (DFFPQ1)</td>
<td>0.00</td>
<td>7.34 r</td>
</tr>
<tr>
<td>data arrival time</td>
<td>7.34</td>
<td></td>
</tr>
<tr>
<td>clock clk (rise edge)</td>
<td>10.00</td>
<td>10.00</td>
</tr>
<tr>
<td>clock network delay (propagated)</td>
<td>0.00</td>
<td>10.00</td>
</tr>
<tr>
<td>clock uncertainty</td>
<td>-0.50</td>
<td>9.50</td>
</tr>
<tr>
<td>OUT1_reg_70_/CK (DFFPQ1)</td>
<td>0.00</td>
<td>9.50 r</td>
</tr>
<tr>
<td>library setup time</td>
<td>-0.08</td>
<td>9.42</td>
</tr>
<tr>
<td>data required time</td>
<td>9.42</td>
<td></td>
</tr>
<tr>
<td>data required time</td>
<td>9.42</td>
<td></td>
</tr>
<tr>
<td>data arrival time</td>
<td>7.34</td>
<td></td>
</tr>
</tbody>
</table>
slack (MET) | 2.09 |
### 32-Bit Multiplier:

**SILICON_ENSEMBLE DESIGN SUMMARY REPORT**

Time: 11:56:57, 26 March 2003  
Design name: SPIE32  
Report file name: SPIE32.summary

- **Number of macros**: 210  
- **Number of components**: 3687  
- **Number of pins**: 17559  
  - **Number of regular pins**: 10746  
  - **Number of special pins**: 6642  
  - **Number of unused pins**: 11  
- **Number of nets**: 3915  
- **Average number of pins per net**: 4.49  
- **Number of subnets**: 363  
  - **Number of regular pins for subnets**: 193  
  - **Number of special pins for subnets**: 0  
  - **Number of virtual pins for subnets**: 533  
- **Average number of pins per subnet**: 2.00  
- **Number of routing tracks available**: 1654  
- **Number of GCKLUS per layer**: 6840

**UTILIZATION OF ALL ROW TYPES**

<table>
<thead>
<tr>
<th>Type</th>
<th>Number</th>
<th>Length</th>
<th>Area</th>
<th>%_Row_Space</th>
</tr>
</thead>
<tbody>
<tr>
<td>core Rows</td>
<td>65</td>
<td>26040300</td>
<td>160408246000</td>
<td>79.91</td>
</tr>
<tr>
<td>core Cells</td>
<td>1321</td>
<td>20809140</td>
<td>128184302400</td>
<td>61.05%</td>
</tr>
</tbody>
</table>

**SILICON_ENSEMBLE WIRING REPORT**

Design name: SPIE32  
Report file name: SPIE32.wire

- **Total vias in regular wiring**: 25931  
- **Total segments in regular wiring**: 27057  
- **Total vias in special wiring**: 350  
- **Total segments in special wiring**: 224

**LAYER name: metall**

- **Total wire length**: 44620.96 microns  
  - **Length of regular wires**: 12973.96 microns  
  - **Length of special wires**: 31647.00 microns

**LAYER name: meta12**

- **Total wire length**: 69145.30 microns  
  - **Length of regular wires**: 64631.70 microns  
  - **Length of special wires**: 4513.60 microns

**LAYER name: meta13**

- **Total wire length**: 96337.48 microns  
  - **Length of regular wires**: 96337.48 microns  
  - **Length of special wires**: 0.00 microns

**LAYER name: meta14**

- **Total wire length**: 40386.22 microns  
  - **Length of regular wires**: 40386.22 microns  
  - **Length of special wires**: 0.00 microns

**LAYER name: meta15**

- **Total wire length**: 16193.00 microns  
  - **Length of regular wires**: 16193.00 microns  
  - **Length of special wires**: 0.00 microns

**LAYER name: meta16**

- **Total wire length**: 100.80 microns  
  - **Length of regular wires**: 100.80 microns  
  - **Length of special wires**: 0.00 microns

**Total wirelength in regular wiring**: 230623.16 microns  
**Total wirelength in special wiring**: 36160.60 microns

**CROSSTALK**

- 0 nets claimed N coupling caps, but had different no.  
- 10 nets had more coupling than total capacitance  
- 0 pins lists with fewer pins than they said they had.  
- 0 nets with bizarre cap/unit length  
- 10 nets had implausible coupling ratios.  
- 2 wires had implausibly small Rs.  
- 67 rising drive and 67 falling drive implausibly small.  
- 3915 nets processed, total wire 232417.7

- 0 unknown nets encountered (maybe with repetitions)  
- 56 constant nets, length 5419.3, (2.53%)  
- 0 nets had errors, length 0.0, (0.0%)  

Max crosstalk induced timing delta is 0.467n, (0 > 1us)

---

Interconnect Analysis of Various Multiplier Sizes

---

192
SYNOPSIS SPICE2:
Number of ports: 193
Number of nets: 513
Number of cells: 321
Number of references: 3

Combinational area: 116475.921875
Noncombinational area: 11708.927724
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 128114.825662
Total area: undefined

Cell Internal Power = 15.3479 mW (9%)
Net Switching Power = 149.5410 mW (91%)

Total Dynamic Power = 164.8889 mW (100%)
Cell Leakage Power = 9.2980 mW

<table>
<thead>
<tr>
<th>Point</th>
<th>Incr</th>
<th>Path</th>
</tr>
</thead>
<tbody>
<tr>
<td>clock clk [rise edge]</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>multiplier_1/out_low_44 (INTERMULT32)</td>
<td>0.00</td>
<td>6.44 x</td>
</tr>
<tr>
<td>OUT1_reg_44/3 (DFFPOL1)</td>
<td>0.00</td>
<td>6.44 x</td>
</tr>
<tr>
<td>data arrival time</td>
<td></td>
<td>6.44</td>
</tr>
<tr>
<td>clock clk [rise edge]</td>
<td>10.00</td>
<td>10.00</td>
</tr>
<tr>
<td>clock network delay (propagated)</td>
<td>0.00</td>
<td>10.00</td>
</tr>
<tr>
<td>clock uncertainty</td>
<td>-0.50</td>
<td>9.50</td>
</tr>
<tr>
<td>OUT1_reg_44/CK (DFFPOL1)</td>
<td>0.00</td>
<td>9.50 x</td>
</tr>
<tr>
<td>library setup time</td>
<td>-0.08</td>
<td>9.42</td>
</tr>
<tr>
<td>data required time</td>
<td></td>
<td>9.42</td>
</tr>
<tr>
<td>data required time</td>
<td></td>
<td>6.44</td>
</tr>
<tr>
<td>data arrival time</td>
<td></td>
<td>9.42</td>
</tr>
</tbody>
</table>

slack (MFT) 2.98

16-bit Multiplier:
***************************SILICON_ENSEMBLE DESIGN SUMMARY REPORT***************************

Time: 21:34:48, 30 March 2003
Design name: SPIE16
Report file name: SPIE16.summary

Number of macros: 113
Number of components: 1069
Number of pins: 5036
Number of regular pins: 1096
Number of special pins: 1940
Number of unused pins: 0
Number of nets: 1140
Average number of pins per net: 4.42
Number of subnets: 0
Number of routing tracks available: 1128
Number of GCELLS per layer: 3224

** UTILIZATION OF ALL ROW TYPES

<table>
<thead>
<tr>
<th>Type</th>
<th>Number</th>
<th>Length</th>
<th>Area % Row_Space</th>
</tr>
</thead>
<tbody>
<tr>
<td>core Rows</td>
<td>39</td>
<td>59395100</td>
<td>57873816000 80.29</td>
</tr>
<tr>
<td>core Cells</td>
<td>970</td>
<td>7543140</td>
<td>46645742400 80.29</td>
</tr>
</tbody>
</table>

Area of chip: 116751835200 (square DBU)
Area required for all cells: 46465742600 (square DBU)
Area utilisation of all cells: 39.89%

***************************SILICON_ENSEMBLE WIRING REPORT***************************

Time: 21:34:56, 30 March 2003
Design name: SPIE16
Report file name: SPIE16.wires

Total vias in regular wiring: 6674
Total segments in regular wiring: 6834
Total vias in special wiring: 176
Total segments in special wiring: 140

LAYER name: metall
Total wire length: 17036.44 microns
Length of regular wires: 4171.72 microns
Length of special wires: 12864.72 microns
LAYER name: metal2  
Total wire length: 17575.84 microns  
Length of regular wires: 15273.28 microns  
Length of special wires: 2302.56 microns  

LAYER name: metal3  
Total wire length: 25701.42 microns  
Length of regular wires: 25701.42 microns  
Length of special wires: .00 microns  

LAYER name: metal4  
Total wire length: 3398.28 microns  
Length of regular wires: 3398.28 microns  
Length of special wires: .00 microns  

LAYER name: metal5  
Total wire length: 762.30 microns  
Length of regular wires: 762.30 microns  
Length of special wires: .00 microns  

Total wirelength in regular wiring: 49271.00 microns  
Total wirelength in special wiring: 15203.28 microns  
Total wirelength in regular+special wiring: 64474.28 microns  

CROSSTALK:  
0 nets claimed N coupling caps, but had different no.  
2 nets had more coupling than total capacitance  
0 pin lists with fewer pins than they said they had.  
0 nets with bizarre cap/unit length  
2 nets had implausible coupling ratios.  
2 wires had implausibly small Rs.  
35 rising drive and 35 falling drive implausibly small.  
140 nets processed, total wire 48744.3  
0 unknown nets encountered (maybe with repetitions)  
34 constant nets, length 2044.3, | 4.3%  
0 nets had errors, length 0.0, | 0.0%  
Max crosstalk induced timing delta is 0.497ns, (0 > 1us)  

SYNOPSIS SPIK16:  
Number of ports: 97  
Number of nets: 257  
Number of cells: 161  
Number of references: 3  
Combinational area: 40611.406250  
Noncombinational area: 5854.463867  
Net Interconnect area: undefined (Wire load has zero net area)  
Total cell area: 4645.871054  
Total area: undefined  
Cell Internal Power = 6.1922 mW (6%)  
Net Switching Power = 70.0727 mW (99%)  
------------------  
Total Dynamic Power = 76.2650 mW (100%)  
Cell Leakage Power = 3.7045 uW  

<table>
<thead>
<tr>
<th>Point</th>
<th>Incr</th>
<th>Path</th>
</tr>
</thead>
<tbody>
<tr>
<td>clock clk [rise edge]</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>multiplier_1/out_low(21) [INTERMULT16]</td>
<td>0.00</td>
<td>5.00 r</td>
</tr>
<tr>
<td>OUT1_reg_21_/D [DFPPQ]</td>
<td>0.00</td>
<td>5.00 r</td>
</tr>
<tr>
<td>data arrival time</td>
<td>5.00</td>
<td></td>
</tr>
<tr>
<td>clock clk [rise edge]</td>
<td>10.00</td>
<td>10.00</td>
</tr>
<tr>
<td>clock network delay (propagated)</td>
<td>0.00</td>
<td>10.00</td>
</tr>
<tr>
<td>clock uncertainty</td>
<td>-0.50</td>
<td>9.50</td>
</tr>
<tr>
<td>OUT1_reg_21_/CK [DFPPQ]</td>
<td>0.00</td>
<td>9.50 r</td>
</tr>
<tr>
<td>library setup time</td>
<td>-0.67</td>
<td>9.43</td>
</tr>
<tr>
<td>data required time</td>
<td>9.43</td>
<td></td>
</tr>
<tr>
<td>data required time</td>
<td>9.43</td>
<td></td>
</tr>
<tr>
<td>data arrival time</td>
<td>5.00</td>
<td></td>
</tr>
<tr>
<td>slack (MET)</td>
<td>4.43</td>
<td></td>
</tr>
</tbody>
</table>

Interconnect Analysis of Various Multiplier Sizes 194
Appendix C
Verilog HDL Code for the
Reconfigurable Multiplier

******************************************************************************

Author: Pedram Mokrian
Title: RecursiveMultiplierV4.v
Purpose: Main file for Thesis Multiplier

Comments: File revised Feb 5, 2003
Latest file to be used for the Reconfigurable Multiplier

Updates:
March 3, 2002
Change the output size of the fast adder to overcome the previous discrepancies.
Reformatted the text and positioning of the code for enhanced legibility

******************************************************************************

module RecursiveMultiplierV4 (AH,AL,XH,XL,OUT_HIGH,OUT_LOW,CONTROL,clk,reset);

    // High and Low order bits of input A (MULTIPICAND)
    input [32:1] AH, AL;

    // High and Low order bits of input X (MULTIPLIER)
    input [32:1] XH, XL;

    // Clock and Control Signals
    input clk,reset;
    input [2:1] CONTROL;

    // High and Low Order Output Bits
    output [64:1] OUT_HIGH, OUT_LOW;

******************************************************************************

NOMENCLATURE
no suffix -> top module ports
_GATE Çünkü -> clocked signals, that are wires to RecMult3 module
******************************************************************************

    reg [64:1] OUT_LOW, OUT_HIGH;

    wire [64:1] OUT_LOW_GATE, OUT_HIGH_GATE;
    reg [32:1] AH_GATE, AL_GATE, XH_GATE, XL_GATE;

    // The system operates with a high reset signal, and is set to zero if reset=0
always @ (posedge clk or negedge reset)
    begin

Verilog HDL Code for the Reconfigurable Multiplier 195
if (!reset) begin
    AH_GATED <= 32'\texttt{b}0;
    AL_GATED <= 32'\texttt{b}0;
    XH_GATED <= 32'\texttt{b}0;
    XL_GATED <= 32'\texttt{b}0;
    OUT_HIGH <= 32'\texttt{b}0;
    OUT_LOW <= 32'\texttt{b}0;
end // if (!reset)

else begin
    AH_GATED <= AH;
    AL_GATED <= AL;
    XH_GATED <= XH;
    XL_GATED <= XL;
    OUT_HIGH <= OUT_HIGH_GATED;
    OUT_LOW <= OUT_LOW_GATED;
end // else: !if(!reset)
end // always @ (posedge clk or negedge reset)

RecMult recmult
(AH_GATED,AL_GATED,XH_GATED,XL_GATED,OUT_HIGH_GATED,OUT_LOW_GATED,CONTROL);
endmodule

(clicked code snippet)

module RecMult (AH,AL,XH,XL,OUT_HIGH,OUT_LOW,CONTROL);

    // High and Low order bits of input A (MULTIPlicAND)
    input [32:1] AH;
    input [32:1] AL;

    // High and Low order bits of input X (MULTIPLIER)
    input [32:1] XH;
    input [32:1] XL;

    // Control Signal
    input [2:1] CONTROL;

    // High and Low Order Output Bits
    output [64:1] OUT_HIGH;
    output [64:1] OUT_LOW;

Verilog HDL Code for the Reconfigurable Multiplier 196
SOURCE CODE

Verilog HDL Code for the Reconfigurable Multiplier
INTERMULT multiplier_1
(MULT_1_IN_LOW, MULT_1_IN_HIGH, MULT_1_OUT_LOW, MULT_1_OUT_HIGH);
INTERMULT multiplier_2
(MULT_2_IN_LOW, MULT_2_IN_HIGH, MULT_2_OUT_LOW, MULT_2_OUT_HIGH);
INTERMULT multiplier_3
(MULT_3_IN_LOW, MULT_3_IN_HIGH, MULT_3_OUT_LOW, MULT_3_OUT_HIGH);
INTERMULT multiplier_4
(MULT_4_IN_LOW, MULT_4_IN_HIGH, MULT_4_OUT_LOW, MULT_4_OUT_HIGH);

/**************************
The output of the Multipliers and the REDUCTION AND MAJORITY stage
**************************/

wire [64:1] RED_1_OUT_LOW;
wire [64:1] RED_1_OUT_HIGH;
wire [64:1] RED_2_OUT_LOW;
wire [64:1] RED_2_OUT_HIGH;
wire [64:1] RED_3_OUT_LOW;
wire [64:1] RED_3_OUT_HIGH;
wire [64:1] RED_4_OUT_LOW;
wire [64:1] RED_4_OUT_HIGH;
wire [64:1] MAJ_2_OUT_LOW;
wire [64:1] MAJ_2_OUT_HIGH;
wire [64:1] MAJ_3_OUT_LOW;
wire [64:1] MAJ_3_OUT_HIGH;
wire [64:1] MAJ_4_OUT_LOW;
wire [64:1] MAJ_4_OUT_HIGH;
wire [64:1] MAJ_LOW1;
wire [64:1] MAJ_LOW2;
wire [64:1] RED_LOW1;
wire [64:1] RED_LOW2;
wire [64:1] RED_HIGHT1;
wire [64:1] RED_HIGHT2;

****************************************
The outputs of the multipliers are guided by MUX_Reduction and MUX_Majority
multiplexers

MUX_Reduction (input1, input2, input3, input4, input5, input6, input7, input8,
output1, output2, output3, output4, output5, output6, output7,
output8, control)
****************************************

// This MUX provides inputs to the Reduction block if control = 00
MUX_Reduction redmux (MULT_1_OUT_LOW, MULT_1_OUT_HIGH, MULT_2_OUT_LOW,
MULT_2_OUT_HIGH, MULT_3_OUT_LOW, MULT_3_OUT_HIGH, MULT_4_OUT_LOW,
MULT_4_OUT_HIGH, RED_1_OUT_LOW, RED_1_OUT_HIGH, RED_2_OUT_LOW, RED_2_OUT_HIGH,
RED_3_OUT_LOW, RED_3_OUT_HIGH, RED_4_OUT_LOW, RED_4_OUT_HIGH, CONTROL);

// This MUX provides inputs to the Majority block if control = 10
MUX_Majority majmux (MULT_2_OUT_LOW, MULT_2_OUT_HIGH, MULT_3_OUT_LOW,
MULT_3_OUT_HIGH, MULT_4_OUT_LOW, MULT_4_OUT_HIGH, MAJ_2_OUT_LOW, MAJ_2_OUT_HIGH,
MAJ_3_OUT_LOW, MAJ_3_OUT_HIGH, MAJ_4_OUT_LOW, MAJ_4_OUT_HIGH, CONTROL);
MAJORITY Majority_Voter(MAJ_2_OUT_LOW, MAJ_2_OUT_HIGH, MAJ_3_OUT_LOW, MAJ_3_OUT_HIGH, MAJ_4_OUT_LOW, MAJ_4_OUT_HIGH, MAJ_LOW1, MAJ_LOW2);

REDUCTION Reduction_6_to_2 (RED_1_OUT_LOW, RED_1_OUT_HIGH, RED_2_OUT_LOW, RED_2_OUT_HIGH, RED_3_OUT_LOW, RED_3_OUT_HIGH, RED_4_OUT_LOW, RED_4_OUT_HIGH, RED_LOW1, RED_HIGH1, RED_LOW2, RED_HIGH2);

// Register definition for the inputs to the two FAST_ADDERS
reg [64:1] FASTadder_LOW1;
reg [64:1] FASTadder_LOW2;
reg [64:1] FASTadder_HIGH1;
reg [64:1] FASTadder_HIGH2;

// carry over signal used between the two adder sections
wire CARRY_OVER, C_OUT;
wire DUMMY;
assign DUMMY = 1'b0;

wire [64:1] OUT_HIGH;
wire [64:1] OUT_LOW;

// The inputs to the Final Fast Adders are governed by the always@ (control) block

always@ (CONTROL)
begin
  if (CONTROL == 2'b01) // Dual Single Precision Mode
  begin
    FASTadder_LOW1 = MULT_4_OUT_LOW;
    FASTadder_LOW2 = MULT_4_OUT_HIGH;
    FASTadder_HIGH1 = MULT_1_OUT_LOW;
    FASTadder_HIGH2 = MULT_1_OUT_HIGH;
  end

  else if (CONTROL == 2'b10) // Single Precision Fault Tolerant
  begin
    FASTadder_LOW1 = MAJ_LOW1;
    FASTadder_LOW2 = MAJ_LOW2;
    FASTadder_HIGH1 = 64'b0;
    FASTadder_HIGH2 = 64'b0;
  end // 10

  else if (CONTROL == 2'b11) // Single Precision
  begin
    FASTadder_LOW1 = MULT_4_OUT_LOW;
    FASTadder_LOW2 = MULT_4_OUT_HIGH;
    FASTadder_HIGH1 = 64'b0;
    FASTadder_HIGH2 = 64'b0;
  end // 11

  else // Double Precision Mode (DEFAULT MODE)
  begin
    FASTadder_LOW1 = RED_LOW1;
    FASTadder_LOW2 = RED_LOW2;
    FASTadder_HIGH1 = RED_HIGH1;
    FASTadder_HIGH2 = RED_HIGH2;
  end // 00
end // always@ (CONTROL)
// instantiation of the two fast adder modules

FASTADDER
FinalLow(DUMMY, FASTADDER_LOW1, FASTADDER_LOW2, OUT_LOW, CARRY_OVER);

FASTADDER
FinalHigh(CARRY_OVER, FASTADDER_HIGH1, FASTADDER_HIGH2, OUT_HIGH, COUT);

endmodule // RecMult3

module MUX3 (input1, input2, input_solo, output1, output2, control);
    input [1:0] control;
    input [31:0] input_solo, input1, input2;
    output [31:0] output1, output2;

    reg [31:0] output1, output2;
    wire [31:0] ZERO;
    assign ZERO=32'b0;

    always@ (input_solo or input1 or input2 or control)
    begin
        if (control == 2'b00)
            begin
                output1 = input1;
                output2 = input_solo;
            end
        else if (control == 2'b10)
            begin
                output1 = input2;
                output2 = input_solo;
            end
        else
            begin
                output1 = ZERO;
                output2 = ZERO;
            end
    end // always@ (input_solo or input2 or input3 or control)
endmodule // MUX3

module MUX2 (input_solo, input2, input3, output1, output2, control);
    input [1:0] control;
    input [31:0] input_solo, input2, input3;
    output [31:0] output1, output2;

    reg [31:0] output1, output2;
    wire [31:0] ZERO;
    assign ZERO=32'b0;

    always@ (input_solo or input2 or input3 or control)
    begin
        if (control == 2'b00)
            begin
                output1 = input1;
                output2 = input_solo;
            end
        else if (control == 2'b10)
            begin
                output1 = input2;
                output2 = input_solo;
            end
        else
            begin
                output1 = ZERO;
                output2 = ZERO;
            end
    end // always@ (input_solo or input2 or input3 or control)
endmodule // MUX2

Verilog HDL Code for the Reconfigurable Multiplier
output1 = input solo;
output2 = input2;
end
else if (control == 2'b10)
begin
output1 = input solo;
output2 = input3;
end
else
begin
output1 = ZERO;
output2 = ZERO;
end
end // always@ (input solo or input2 or input3 or control)
endmodule // MUX2

module MUX1 (input1, input2, output1, output2, control);
  input [1:0] control;
  input [31:0] input1, input2;
  output [31:0] output1, output2;
reg [31:0] output1, output2;
wire [31:0] ZERO;
assign ZERO=32'b0;
always@ (input1 or input2 or control)
begin
if (control == 2'b00)
begin
output1 = input1;
output2 = input2;
end
else if (control == 2'b01)
begin
output1 = input1;
output2 = input2;
end
else
begin
output1 = ZERO;
output2 = ZERO;
end
end // always@ (input1 or input2 or control)
endmodule // MUX1

/******************************************************************************
The OUTPUTMUX Modules
These mux cells are used to guide the outputs of the 4 multipliers
*******************************************************************************/

module MUX_Reduction (input1, input2, input3, input4, input5, input6, input7,
input8, output1, output2, output3, output4, output5, output6, output7, output8,
control);
  input [1:0] control;
  input [63:0] input1, input2, input3, input4, input5, input6, input7, input8;
  output [63:0] output1, output2, output3, output4, output5, output6, output7, output8;
```verilog
reg [63:0] output1, output2, output3, output4, output5, output6, output7, output8;
wire [63:0] ZERO;
assign ZERO = 64'b0;

always@(input1 or input2 or input3 or input4 or input5 or input6 or input7 or input8 or control)
    begin
        if (control == 2'b00)
            begin
                output1 = input1;
                output2 = input2;
                output3 = input3;
                output4 = input4;
                output5 = input5;
                output6 = input6;
                output7 = input7;
                output8 = input8;
            end
        else
            begin
                output1 = ZERO;
                output2 = ZERO;
                output3 = ZERO;
                output4 = ZERO;
                output5 = ZERO;
                output6 = ZERO;
                output7 = ZERO;
                output8 = ZERO;
            end
    end // always@
endmodule // MUX_Reduction

module MUX_Majority (input1, input2, input3, input4, input5, input6,
                      output1, output2, output3, output4, output5, output6, control);

    input [1:0] control;
    input [63:0] input1, input2, input3, input4, input5, input6;
    output [63:0] output1, output2, output3, output4, output5, output6;

    reg [63:0] output1, output2, output3, output4, output5, output6;
    wire [63:0] ZERO;
    assign ZERO = 64'b0;

    always @(input1 or input2 or input3 or input4 or input5 or input6 or control)
        begin
            if (control == 2'b10)
                begin
                    output1 = input1;
                    output2 = input2;
                    output3 = input3;
                    output4 = input4;
                    output5 = input5;
                    output6 = input6;
                end
            else
                begin
                    output1 = ZERO;
                end
```
output2 = ZERO;
output3 = ZERO;
output4 = ZERO;
output5 = ZERO;
output6 = ZERO;
end // else: !if(control == 2'b10)
end // always@ (input1 or input2 or input3 or input4 or input5 or input6 or
control)
endmodule // MUX_Majority

******************************************************************************
The INTER_MULT module instantiates the DW02_multp_instance module
and truncates the output of the 2 MSB since they are sign bits'
and the to signal should be set low.
*******************************************************************************/

module INTERMULT (in_low, in_high, out_low, out_high);
  input [31:0] in_low, in_high;
  output [63:0] out_low, out_high;

  wire [65:0] multp_low, multp_high;
  wire DUMMY;
  assign DUMMY = 1'b0;

  DW02_multp_instance inter_mult(in_low, in_high, DUMMY, multp_low, multp_high);

  assign out_low = multp_low [63:0];
  assign out_high = multp_high [64:1];
endmodule // INTERMULT

******************************************************************************
This module is used to instantiate the partial product multiplier
supplied in the Synopsys Foundation Design Libraries.
*******************************************************************************/

module DW02_multp_instance (ina, inb, intc, prod0, prod1);

// parameter definition for a 32 bit multiplier
parameter a_width = 32;
parameter b_width = 32;
parameter out_width = 66;

  input [a_width-1 : 0] ina;
  input [b_width-1 : 0] inb;
  input intc;
  output [out_width-1:0] prod0, prod1;

// Instantiation of DW02_multp.v
DW02_multp #(a_width, b_width, out_width) UI
  (.a(ina), .b(inb), .tc(intc), .out0(prod0), .out1(prod1));
endmodule // DW02_multp_instance

Verilog HDL Code for the Reconfigurable Multiplier
module FASTADDER (C_IN, IN1, IN2, OUT, C_OUT);
    input [63:0] IN1, IN2;
    input C_IN;
    output [63:0] OUT;
    output C_OUT;

    reg [64:0] SUMOUT;
    reg C_OUT;
    reg [63:0] OUT;

    always@(IN1 or IN2 or C_IN)
    begin
        SUMOUT = IN1 + IN2 + C_IN;
        C_OUT = SUMOUT[64];
        OUT = SUMOUT[63:0];
    end
endmodule // FASTADDER

module MAJORITY (low1, high1, low2, high2, low3, high3, output_low, output_high);
    input [64:1] low1, low2, low3;
    input [64:1] high1, high2, high3;

    output [64:1] output_low, output_high;
    reg [64:1] output_low, output_high;

    always@(low1 or low2 or low3 or high1 or high2 or high3)
    begin
        // The majority check performs XOR on inputs 1 and 2, if true input 1 is passed/
        // else input 3 is passed as the final result.

endmodule // MAJORITY
output_low[22] = (low1[22] & low2[22]) ? low1[22] : low3[22];
output_low[27] = (low1[27] & low2[27]) ? low1[27] : low3[27];
output_low[29] = (low1[29] & low2[29]) ? low1[29] : low3[29];
output_low[31] = (low1[31] & low2[31]) ? low1[31] : low3[31];
output_low[33] = (low1[33] & low2[33]) ? low1[33] : low3[33];
output_low[34] = (low1[34] & low2[34]) ? low1[34] : low3[34];
output_low[37] = (low1[37] & low2[37]) ? low1[37] : low3[37];
output_low[38] = (low1[38] & low2[38]) ? low1[38] : low3[38];
output_low[40] = (low1[40] & low2[40]) ? low1[40] : low3[40];
output_low[41] = (low1[41] & low2[41]) ? low1[41] : low3[41];
output_low[51] = (low1[51] & low2[51]) ? low1[51] : low3[51];
output_low[52] = (low1[52] & low2[52]) ? low1[52] : low3[52];
output_low[53] = (low1[53] & low2[53]) ? low1[53] : low3[53];
output_low[54] = (low1[54] & low2[54]) ? low1[54] : low3[54];
output_low[56] = (low1[56] & low2[56]) ? low1[56] : low3[56];
output_low[57] = (low1[57] & low2[57]) ? low1[57] : low3[57];
output_low[58] = (low1[58] & low2[58]) ? low1[58] : low3[58];
output_low[59] = (low1[59] & low2[59]) ? low1[59] : low3[59];
output_low[60] = (low1[60] & low2[60]) ? low1[60] : low3[60];
output_low[61] = (low1[61] & low2[61]) ? low1[61] : low3[61];
output_low[63] = (low1[63] & low2[63]) ? low1[63] : low3[63];
output_low[64] = (low1[64] & low2[64]) ? low1[64] : low3[64];

output_high[16] = (high1[16] ~ high2[16]) ? high1[16] : high3[16];
output_high[18] = (high1[18] ~ high2[18]) ? high1[18] : high3[18];
output_high[22] = (high1[22] ~ high2[22]) ? high1[22] : high3[22];
output_high[23] = (high1[23] ~ high2[23]) ? high1[23] : high3[23];
output_high[26] = (high1[26] ~ high2[26]) ? high1[26] : high3[26];
output_high[27] = (high1[27] ~ high2[27]) ? high1[27] : high3[27];
output_high[29] = (high1[29] ~ high2[29]) ? high1[29] : high3[29];
output_high[31] = (high1[31] ~ high2[31]) ? high1[31] : high3[31];
output_high[33] = (high1[33] ~ high2[33]) ? high1[33] : high3[33];
output_high[34] = (high1[34] ~ high2[34]) ? high1[34] : high3[34];
output_high[37] = (high1[37] ~ high2[37]) ? high1[37] : high3[37];
output_high[38] = (high1[38] ~ high2[38]) ? high1[38] : high3[38];
output_high[40] = (high1[40] ~ high2[40]) ? high1[40] : high3[40];
output_high[41] = (high1[41] ~ high2[41]) ? high1[41] : high3[41];
output_high[42] = (high1[42] ~ high2[42]) ? high1[42] : high3[42];
output_high[43] = (high1[43] ~ high2[43]) ? high1[43] : high3[43];
output_high[44] = (high1[44] ~ high2[44]) ? high1[44] : high3[44];
output_high[47] = (high1[47] ~ high2[47]) ? high1[47] : high3[47];
output_high[48] = (high1[48] ~ high2[48]) ? high1[48] : high3[48];
output_high[50] = (high1[50] ~ high2[50]) ? high1[50] : high3[50];
output_high[51] = (high1[51] ~ high2[51]) ? high1[51] : high3[51];
output_high[52] = (high1[52] ~ high2[52]) ? high1[52] : high3[52];
output_high[53] = (high1[53] ~ high2[53]) ? high1[53] : high3[53];
output_high[54] = (high1[54] ~ high2[54]) ? high1[54] : high3[54];
output_high[56] = (high1[56] ~ high2[56]) ? high1[56] : high3[56];
output_high[57] = (high1[57] ~ high2[57]) ? high1[57] : high3[57];
output_high[58] = (high1[58] ~ high2[58]) ? high1[58] : high3[58];
output_high[59] = (high1[59] ~ high2[59]) ? high1[59] : high3[59];
output_high[60] = (high1[60] ~ high2[60]) ? high1[60] : high3[60];
output_high[61] = (high1[61] ~ high2[61]) ? high1[61] : high3[61];
output_high[63] = (high1[63] ~ high2[63]) ? high1[63] : high3[63];
output_high[64] = (high1[64] ~ high2[64]) ? high1[64] : high3[64];

end // always (low1 or low2 or low3 or high1 or high2 or high3)
endmodule // MAJORITY
The reduction module is designed to reduce the 4 64-bit values in carry save format from the intermediary multipliers, down to two 128 bit values. The output is split into 4 sections, where each 128-bit value is divided into two halves. This module is only active in double precision operation.

The RED_BLOCK1 module is the first reduction module used in the REDUCTION module

```verilog
module RED_BLOCK1 (C2,C1,C0,S2,S1,S0,COUT1,COUT2,COUT3,CP,SP);
    input  C2,C1,C0,S2,S1,S0;
    output COUT1,COUT2,COUT3,CP,SP;
    wire   FA1_SUM;
    wire   FA2_SUM;
    FULLLADD fa1(S2,S1,S0,FA1_SUM,COUT1);
    FULLLADD fa2(C2,C1,C0,FA2_SUM,COUT2);
    HALFADD fa3(FA1_SUM,FA2_SUM,SP,COUT3);
    assign CP = 1'b0;
endmodule // RED_BLOCK1
```

The RED_BLOCK6 module is used as part of the REDUCTION module

```verilog
module RED_BLOCK6 (C2,C1,C0,S2,S1,S0,CIN1,CIN2,CIN3,COUT1,COUT2,COUT3,CP,SP);
    input  C2,C1,C0,S2,S1,S0,CIN1,CIN2,CIN3;
    output COUT1,COUT2,COUT3,CP,SP;
    wire   FA1_SUM;
    wire   FA2_SUM;
    wire   FA3_SUM;
    FULLLADD fa1(S2,S1,S0,FA1_SUM,COUT1);
    FULLLADD fa2(C2,C1,C0,FA2_SUM,COUT2);
    FULLLADD fa3(CIN1,FA1_SUM,FA2_SUM,FA3_SUM,COUT3);
    FULLLADD fa4(CIN2,CIN3,FA3_SUM,SP,CP);
endmodule // RED_BLOCK6
```

The RED_BLOCK2 module is used as part of the REDUCTION module

```verilog
module RED_BLOCK2 (S1,S0,CIN1,CIN2,CIN3,COUT,CP,SP);
    input  S1,S0,CIN1,CIN2,CIN3;
    output COUT,CP,SP;
    wire   FA1_SUM;
    FULLLADD fa1(S1,S0,CIN1,FA1_SUM,COUT);
    FULLLADD fa2(CIN2,CIN3,FA1_SUM,SP,CP);
endmodule // RED_BLOCK2
```
/*********************************************************
 The FULLadder module is used as part of the REDUCTION module
 *********************************************************/
module FULLadder (a, b, c, SUM, CARRY);

  input  a;
  input  b;
  input  c;
  output SUM;
  output CARRY;

  reg    SUM, CARRY;

  always @(a or b or c)
  begin
    CARRY = (a & b) | (a & c) | (b & c);
    SUM = (a ^ b ^ c);
  end // initial begin
endmodule // FULLadder

/*********************************************************
 The HALFadder module is used as part of the REDUCTION module
 *********************************************************/
module HALFadder (a, b, SUM, CARRY);

  input  a;
  input  b;
  output SUM;
  output CARRY;

  reg    SUM, CARRY;

  always @(a or b)
  begin
    CARRY = (a & b);
    SUM = (a ^ b);
  end // initial begin
endmodule // HALFadder

/*********************************************************
 The REDUCTION module definition
 *********************************************************/
module REDUCTION (in1_L, in1_H, in2_L, in2_H, in3_L, in3_H, in4_L, in4_H,
                  out1_L, out1_H, out2_L, out2_H);

  input [64:1] in1_L;
  input [64:1] in1_H;
  input [64:1] in2_L;
  input [64:1] in2_H;
  input [64:1] in3_L;
  input [64:1] in3_H;
  input [128:65] in4_L;
  input [128:65] in4_H;

  output [64:1] out1_L;
output [64:1] out1_H;
output [64:1] out2_L;
output [64:1] out2_H;

// in1 is in the lowest bit position (from 1 to 64)
// in2 and in3 are from bit positions 33 to 96
// in4 is in the highest bit position (from 65 to 128)

// these two registers are used to hold the final output values of the reduction
// circuit and then to pass them directly onto the output pins of the module

wire [128:1] redL;
wire [128:0] redH;
wire [96:33] C1, C2, C3;
wire COUT;

// calling the 6 input reduction block module
// module RED_BLOCK6 (C2,C1,C0,S2,S1,S0,CIN1,CIN2,CIN3,COUT1,COUT2,COUT3,CP,SP);

RED_BLOCK6

block1(in1_L[33],in1_H[33],in2_L[1],in2_H[1],in3_L[1],in3_H[1],C1[33],C2[33],C3[33],redH[33],redL[33]);

RED_BLOCK6

block2(in1_L[34],in1_H[34],in2_L[2],in2_H[2],in3_L[2],in3_H[2],C1[33],C2[33],C3[33],C1[34],C2[34],C3[34],redH[34],redL[34]);

RED_BLOCK6

block3(in1_L[35],in1_H[35],in2_L[3],in2_H[3],in3_L[3],in3_H[3],C1[34],C2[34],C3[34],C1[35],C2[35],C3[35],redH[35],redL[35]);

RED_BLOCK6

block4(in1_L[36],in1_H[36],in2_L[4],in2_H[4],in3_L[4],in3_H[4],C1[35],C2[35],C3[35],C1[36],C2[36],C3[36],redH[36],redL[36]);

RED_BLOCK6

block5(in1_L[37],in1_H[37],in2_L[5],in2_H[5],in3_L[5],in3_H[5],C1[36],C2[36],C3[36],C1[37],C2[37],C3[37],redH[37],redL[37]);

RED_BLOCK6

block6(in1_L[38],in1_H[38],in2_L[6],in2_H[6],in3_L[6],in3_H[6],C1[37],C2[37],C3[37],redH[38],redL[38]);

RED_BLOCK6

block7(in1_L[39],in1_H[39],in2_L[7],in2_H[7],in3_L[7],in3_H[7],C1[38],C2[38],C3[38],C1[39],C2[39],C3[39],redH[39],redL[39]);

RED_BLOCK6

block8(in1_L[40],in1_H[40],in2_L[8],in2_H[8],in3_L[8],in3_H[8],C1[39],C2[39],C3[40],redH[40],redL[40]);

RED_BLOCK6

block9(in1_L[41],in1_H[41],in2_L[9],in2_H[9],in3_L[9],in3_H[9],C1[40],C2[40],C3[40],C1[41],C2[41],C3[41],redH[41],redL[41]);

RED_BLOCK6

block10(in1_L[42],in1_H[42],in2_L[10],in2_H[10],in3_L[10],in3_H[10],C1[41],C2[41],C3[41],C4[42],C5[42],C6[42],redH[42],redL[42]);

RED_BLOCK6

block11(in1_L[43],in1_H[43],in2_L[11],in2_H[11],in3_L[11],in3_H[11],C1[42],C2[42],C3[42],C4[43],C5[43],redH[43],redL[43]);

RED_BLOCK6

block12(in1_L[44],in1_H[44],in2_L[12],in2_H[12],in3_L[12],in3_H[12],C1[43],C2[43],C3[43],C4[44],C5[44],redH[44],redL[44]);

RED_BLOCK6

block13(in1_L[45],in1_H[45],in2_L[13],in2_H[13],in3_L[13],in3_H[13],C1[44],C2[44],C3[45],C4[45],C5[45],redH[45],redL[45]);
Verilog HDL Code for the Reconfigurable Multiplier
RED_BLOCK6
block52 (in4_L[84], in4_H[84], in2_L[52], in2_H[52], in3_L[52], in3_H[52], C1[83], C2[83], C3[83], C1[84], C2[84], C3[84], C1[85], C2[85], C3[85], redH[84], redL[84]);
RED_BLOCK6
block53 (in4_L[85], in4_H[85], in2_L[53], in2_H[53], in3_L[53], in3_H[53], C1[84], C2[84], C3[84], C1[85], C2[85], C3[85], redH[85], redL[85]);
RED_BLOCK6
block54 (in4_L[86], in4_H[86], in2_L[54], in2_H[54], in3_L[54], in3_H[54], C1[85], C2[85], C3[85], redH[86], redL[86]);
RED_BLOCK6
block55 (in4_L[87], in4_H[87], in2_L[55], in2_H[55], in3_L[55], in3_H[55], C1[86], C2[86], C3[86], redH[87], redL[87]);
RED_BLOCK6
block56 (in4_L[88], in4_H[88], in2_L[56], in2_H[56], in3_L[56], in3_H[56], C1[87], C2[87], C3[87], redH[88], redL[88]);
RED_BLOCK6
block57 (in4_L[89], in4_H[89], in2_L[57], in2_H[57], in3_L[57], in3_H[57], C1[88], C2[88], C3[88], redH[89], redL[89]);
RED_BLOCK6
block58 (in4_L[90], in4_H[90], in2_L[58], in2_H[58], in3_L[58], in3_H[58], C1[89], C2[89], C3[89], redH[90], redL[90]);
RED_BLOCK6
block59 (in4_L[91], in4_H[91], in2_L[59], in2_H[59], in3_L[59], in3_H[59], C1[90], C2[90], C3[90], redH[91], redL[91]);
RED_BLOCK6
block60 (in4_L[92], in4_H[92], in2_L[60], in2_H[60], in3_L[60], in3_H[60], C1[91], C2[91], C3[91], redH[92], redL[92]);
RED_BLOCK6
block61 (in4_L[93], in4_H[93], in2_L[61], in2_H[61], in3_L[61], in3_H[61], C1[92], C2[92], C3[92], redH[93], redL[93]);
RED_BLOCK6
block62 (in4_L[94], in4_H[94], in2_L[62], in2_H[62], in3_L[62], in3_H[62], C1[93], C2[93], C3[93], redH[94], redL[94]);
RED_BLOCK6
block63 (in4_L[95], in4_H[95], in2_L[63], in2_H[63], in3_L[63], in3_H[63], C1[94], C2[94], C3[94], redH[95], redL[95]);
RED_BLOCK6
block64 (in4_L[96], in4_H[96], in2_L[64], in2_H[64], in3_L[64], in3_H[64], C1[95], C2[95], C3[95], redH[96], redL[96]);

// module RED_BLOCK2 (S1,S0,CIN1,CIN2,CIN3,COUT,CP,SP);
RED_BLOCK2
block65 (in4_L[97], in4_H[97], C1[96], C2[96], C3[96], C1[97], C2[97], C3[97], redH[97], redL[97]);

// module FULLADDER (a,b,c,SUM,CARRY);
FULLADDER block66 (in4_L[98], in4_H[98], COUT, redL[98], redH[98]);

// module HALFDADDER (a,b,CARRY);
HALFDADDER block67 (in4_L[99], in4_H[99], redL[99], redH[99]);
HALFDADDER block68 (in4_L[100], in4_H[100], redL[100], redH[100]);
HALFDADDER block69 (in4_L[101], in4_H[101], redL[101], redH[101]);
HALFDADDER block70 (in4_L[102], in4_H[102], redL[102], redH[102]);
HALFDADDER block71 (in4_L[103], in4_H[103], redL[103], redH[103]);
HALFDADDER block72 (in4_L[104], in4_H[104], redL[104], redH[104]);
HALFDADDER block73 (in4_L[105], in4_H[105], redL[105], redH[105]);
HALFDADDER block74 (in4_L[106], in4_H[106], redL[106], redH[106]);
HALFDADDER block75 (in4_L[107], in4_H[107], redL[107], redH[107]);
HALFDADDER block76 (in4_L[108], in4_H[108], redL[108], redH[108]);

Verilog HDL Code for the Reconfigurable Multiplier 212
HALFADDER block77(in4_L[109], in4_H[109], redL[109], redH[109]);
HALFADDER block78(in4_L[110], in4_H[110], redL[110], redH[110]);
HALFADDER block79(in4_L[111], in4_H[111], redL[111], redH[111]);
HALFADDER block80(in4_L[112], in4_H[112], redL[112], redH[112]);
HALFADDER block81(in4_L[113], in4_H[113], redL[113], redH[113]);
HALFADDER block82(in4_L[114], in4_H[114], redL[114], redH[114]);
HALFADDER block83(in4_L[115], in4_H[115], redL[115], redH[115]);
HALFADDER block84(in4_L[116], in4_H[116], redL[116], redH[116]);
HALFADDER block85(in4_L[117], in4_H[117], redL[117], redH[117]);
HALFADDER block86(in4_L[118], in4_H[118], redL[118], redH[118]);
HALFADDER block87(in4_L[119], in4_H[119], redL[119], redH[119]);
HALFADDER block88(in4_L[120], in4_H[120], redL[120], redH[120]);
HALFADDER block89(in4_L[121], in4_H[121], redL[121], redH[121]);
HALFADDER block90(in4_L[122], in4_H[122], redL[122], redH[122]);
HALFADDER block91(in4_L[123], in4_H[123], redL[123], redH[123]);
HALFADDER block92(in4_L[124], in4_H[124], redL[124], redH[124]);
HALFADDER block93(in4_L[125], in4_H[125], redL[125], redH[125]);
HALFADDER block94(in4_L[126], in4_H[126], redL[126], redH[126]);
HALFADDER block95(in4_L[127], in4_H[127], redL[127], redH[127]);
HALFADDER block96(in4_L[128], in4_H[128], redL[128], redH[128]);

// The first 32 bits are taken directly from the inputs
assign out1_L [64:1] = [redL [64:33] , in1_L [32:1]];
assign out1_H [64:1] = redL [128:65];

// There is no carry signal in the 32nd position thus, the signal is grounded.
assign out2_L [64:1] = {redH [63:33] , l'0b0 , in1_H [32:1]};
assign out2_H [64:1] = redH [127:64];
endmodule // REDUCTION
# Appendix D

## Recursive Multiplier

### Base-Multiplier Analysis

<table>
<thead>
<tr>
<th>( k )</th>
<th>( k = 2 )</th>
<th>( k = 4 )</th>
<th>( k = 8 )</th>
<th>( k = 16 )</th>
<th>( k = 32 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( R )</td>
<td>( N )</td>
<td>( N )</td>
<td>( N )</td>
<td>( N )</td>
<td>( N )</td>
</tr>
<tr>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
<tr>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
<tr>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
<tr>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
<tr>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
<tr>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
</tbody>
</table>

### Table: Recursive Multiplier Analysis

<table>
<thead>
<tr>
<th>( k )</th>
<th>( 2k )</th>
<th>( 2k )</th>
<th>( 2k )</th>
<th>( 2k )</th>
<th>( 2k )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( k )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
<tr>
<td>( k )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
<tr>
<td>( k )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
<tr>
<td>( k )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
<tr>
<td>( k )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
<td>( \frac{N - 1}{2} )</td>
</tr>
</tbody>
</table>

### Notes

- The table above provides a summary of the recursive multiplier analysis for various values of \( k \).
- The \( R \) and \( N \) columns represent the number of required registers and the number of required NAs, respectively.
- The table entries are based on the assumption of a specific design methodology or algorithm.
Appendix E
Component Breakdown of the Reconfigurable Multiplier

Reconfigurable Multiplier Breakdown from Synopsys:

Top I/O interface Module
****************************************************
Report : area
Design : RecursiveMultiplierV4
Version: 2001.08-SP2
Date : Tue Apr 1 12:47:38 2003
****************************************************

Number of ports: 260
Number of nets: 644
Number of cells: 385
Number of references: 4

Combinational area: 424568.093750
Noncombinational area: 18754.662109
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 443322.750000
Total area: undefined

Cell Internal Power = 78.0747 mW (56%)
Net Switching Power = 61.5631 mW (44%)
Total Dynamic Power = 139.6377 mW (100%)
Cell Leakage Power = 25.9288 uW

****************************************************
Report : timing
-path full
-delay max
-max_paths 1
Design : RecursiveMultiplierV4
Version: 2001.08-SP2
Date : Tue Apr 1 12:47:44 2003
****************************************************

Point Incr Path
-------------------
OUT_HIGH_reg_64_/CK (DFPRQ1) 0.00 0.00 r
OUT_HIGH_reg_64_/O (DFPRQ1) 0.37 0.37 f
U449/E (BUFX032) 0.35 0.72 f
OUT_HIGH_64_ (out) 0.00 0.72 f
data arrival time 0.72

(Path is unconstrained)

RecMult Module (actual multiplier module)
*******************************************
Report : area
Design : RecMult
Version: 2001.08-SP2
Date : Wed Apr 2 13:58:29 2003
*******************************************

Number of ports: 258
Number of nets: 2544
Number of cells: 315
Number of references: 20

Combinational area: 381895.593750
Noncombinational area: 0.000000
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 381895.69750
Total area: undefined

Cell Internal Power = 322.6490 mW (53%)
Net Switching Power = 290.7263 mW (47%)
Total Dynamic Power = 613.3752 mW (100%)
Cell Leakage Power = 20.4450 mW

******************************************************************************

Report : timing
-path full
-delay max
-max paths 1
Design : RecMulti
Version: 2001.80.SP2
Date : Wed Apr 2 10:43:41 2003
******************************************************************************

<table>
<thead>
<tr>
<th>Point</th>
<th>Incr</th>
<th>Path</th>
</tr>
</thead>
<tbody>
<tr>
<td>input external delay</td>
<td>0.00</td>
<td>0.00 f</td>
</tr>
<tr>
<td>CONTROL_2_(in)</td>
<td>0.00</td>
<td>0.00 f</td>
</tr>
<tr>
<td>muxi/control_1_ (MUX1)</td>
<td>0.00</td>
<td>0.00 f</td>
</tr>
<tr>
<td>muxi/U23/Z (INVDD1)</td>
<td>0.45</td>
<td>0.45 r</td>
</tr>
<tr>
<td>muxi/U35/Z (AMD2D1)</td>
<td>0.29</td>
<td>0.73 r</td>
</tr>
<tr>
<td>muxi/output1_31_ (MUX1)</td>
<td>0.00</td>
<td>0.73 r</td>
</tr>
<tr>
<td>multiplier_1/in_low_31_ (INTERMULT_0)</td>
<td>0.00</td>
<td>0.73 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/ina_31_ (DW02_mult_instance_0)</td>
<td>0.00</td>
<td>0.73 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/a_31_ (DW02_mult_instance_0)</td>
<td>0.00</td>
<td>0.73 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_BC/A_31_ (DW02_mult_instance_0)</td>
<td>0.00</td>
<td>0.73 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_BC/EMC_15/c (DW02_mult_instance_0)</td>
<td>0.00</td>
<td>0.73 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_BC/EMC_15/U4/3 (EXHOR2D1)</td>
<td>0.48</td>
<td>1.21 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_BC/EMC_15/U10/2 (INB2D1)</td>
<td>0.22</td>
<td>1.44 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_BC/EMC_15/shift1 (DW02_mult_instance_0)</td>
<td>0.00</td>
<td>1.44 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_BC/A_coded_46_ (DW02_mult_instance_0)</td>
<td>0.00</td>
<td>1.44 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U231/2 (BUFD1)</td>
<td>0.79</td>
<td>2.32 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U266/0 (NAND2D1)</td>
<td>0.14</td>
<td>2.37 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U1036/2 (NAND2D1)</td>
<td>0.16</td>
<td>2.53 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U1035/2 (EXHOR2D1)</td>
<td>0.43</td>
<td>2.96 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_WT/PP_array_S15_ (DW02_mult_instance_0)</td>
<td>0.00</td>
<td>2.96 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_WT/U351/2 (ADPULD1)</td>
<td>0.29</td>
<td>3.26 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_WT/U36/2 (MOLCD2D1)</td>
<td>0.64</td>
<td>3.90 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_WT/U165/2 (ADPULD1)</td>
<td>0.44</td>
<td>4.97 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_WT/U322/2 (ADHALFD1)</td>
<td>0.41</td>
<td>5.39 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_WT/U144/2 (ADPULD1)</td>
<td>0.52</td>
<td>5.90 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/U2_WT/out0_36_ (DW02_mult_instance_0)</td>
<td>0.00</td>
<td>5.90 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/U1/cut0_36_ (DW02_mult_instance_0)</td>
<td>0.00</td>
<td>5.90 r</td>
</tr>
<tr>
<td>multiplier_1/inter_mul/prod0_36_ (DW02_mult_instance_0)</td>
<td>0.00</td>
<td>5.90 r</td>
</tr>
<tr>
<td>multiplier_1/out_low_36_ (INTRMULT_0)</td>
<td>0.00</td>
<td>5.90 r</td>
</tr>
<tr>
<td>redmux/input_36_ (MULReduction)</td>
<td>0.00</td>
<td>5.90 r</td>
</tr>
<tr>
<td>redmux/US46/2 (AR82D1)</td>
<td>0.18</td>
<td>6.09 f</td>
</tr>
<tr>
<td>redmux/output1_36_ (MUX_Reduction)</td>
<td>0.00</td>
<td>6.09 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/ini_1_37_ (REDUCTION)</td>
<td>0.00</td>
<td>6.09 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/block5/C2 (RED_BLOCKS_24)</td>
<td>0.00</td>
<td>6.09 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/block5/f2a/a (FULLADDER_98)</td>
<td>0.00</td>
<td>6.09 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/block5/f2a/08/2 (EXHOR2D1)</td>
<td>0.51</td>
<td>6.60 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/block5/f2a/SUM (FULLADDER_98)</td>
<td>0.00</td>
<td>6.60 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/block5/f2a/c (FULLADDER_97)</td>
<td>0.00</td>
<td>6.60 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/block5/f3a/07/2 (OR2D1)</td>
<td>0.23</td>
<td>6.82 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/block5/f3a/09/2 (OA122D2D1)</td>
<td>0.31</td>
<td>7.13 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/block5/f3a/CARRY (FULLADDER_97)</td>
<td>0.00</td>
<td>7.13 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/block5/COVT3 (RED_BLOCKS_24)</td>
<td>0.00</td>
<td>7.13 f</td>
</tr>
<tr>
<td>Reduction_6_to_2/block5/CIN3 (RED_BLOCKS_24)</td>
<td>0.00</td>
<td>7.13 f</td>
</tr>
</tbody>
</table>

Component Breakdown of the Reconfigurable Multiplier

216
Component Breakdown of the Reconfigurable Multiplier

<table>
<thead>
<tr>
<th>Component</th>
<th>Type</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reduction_6_to_2/block6/f4/a/b</td>
<td>(FULLADDER_230)</td>
<td>0.00</td>
</tr>
<tr>
<td>Reduction_6_to_2/block6/f4/U/2</td>
<td>(EXOR2D0)</td>
<td>0.04</td>
</tr>
<tr>
<td>Reduction_6_to_2/block6/f4/SUM</td>
<td>(FULLADDER_230)</td>
<td>0.00</td>
</tr>
<tr>
<td>Reduction_6_to_2/block6/f4/5P</td>
<td>(RED_BLOCK5_57)</td>
<td>0.00</td>
</tr>
<tr>
<td>Reduction_6_to_2/out1_L_38</td>
<td>(REDUCTION)</td>
<td>0.00</td>
</tr>
<tr>
<td>U195/Z</td>
<td>(AG222D2)</td>
<td>0.20</td>
</tr>
<tr>
<td>FinalLow/IN1_37 (FASTADDER_1)</td>
<td></td>
<td>0.00</td>
</tr>
<tr>
<td>FinalLow/add519/B_37 (FASTADDER_1_IN81_add_65_1)</td>
<td>0.00</td>
<td>7.13 f</td>
</tr>
<tr>
<td>FinalLow/add519/U185/Z (INVDD1)</td>
<td></td>
<td>0.08</td>
</tr>
<tr>
<td>FinalLow/add519/U222/Z (NAND2D)</td>
<td></td>
<td>0.13</td>
</tr>
<tr>
<td>FinalLow/add519/U510/Z (INVDD1)</td>
<td></td>
<td>0.09</td>
</tr>
<tr>
<td>FinalLow/add519/U20/Z (NOR2D1)</td>
<td></td>
<td>0.18</td>
</tr>
<tr>
<td>FinalLow/add519/U195/Z (NOR2D2)</td>
<td></td>
<td>0.06</td>
</tr>
<tr>
<td>FinalLow/add519/U43/Z (NOR2D1)</td>
<td></td>
<td>0.07</td>
</tr>
<tr>
<td>FinalLow/add519/U42/Z (NOR2D2)</td>
<td></td>
<td>0.07</td>
</tr>
<tr>
<td>FinalLow/add519/U274/Z (NOR2D2)</td>
<td></td>
<td>0.14</td>
</tr>
<tr>
<td>FinalLow/add519/U273/Z (NOR2D2)</td>
<td></td>
<td>0.08</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn154152/Z (INVDD2)</td>
<td>0.10</td>
<td>8.91 r</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn155559/Z (NOR2D4)</td>
<td>0.07</td>
<td>8.98 f</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn155161/Z (NOR2D4)</td>
<td>0.08</td>
<td>9.06 r</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn155716/Z (NAN3MID2)</td>
<td>0.19</td>
<td>9.43 r</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn155718/Z (NAN3MID2)</td>
<td>0.08</td>
<td>9.51 f</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn155721/Z (INVDD2)</td>
<td>0.15</td>
<td>9.67 r</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn150776/Z (NOR3D4)</td>
<td>0.09</td>
<td>9.76 f</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn150778/Z (NOR2D4)</td>
<td>0.08</td>
<td>9.95 r</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn151092/Z (NOR2D4)</td>
<td>0.13</td>
<td>10.15 r</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn151097/Z (NOR2D4)</td>
<td>0.06</td>
<td>10.21 f</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn151004/Z (NOR2D4)</td>
<td>0.08</td>
<td>10.29 r</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn151006/Z (NOR2D4)</td>
<td>0.06</td>
<td>10.38 f</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn151008/Z (NOR2D4)</td>
<td>0.14</td>
<td>10.49 r</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn155739/Z (INVDD2)</td>
<td>0.06</td>
<td>10.54 f</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn154090_TEMP_0.2_0.3_3/Z (NAN2MID2)</td>
<td>0.16</td>
<td>10.70 r</td>
</tr>
<tr>
<td>FinalLow/add519/add519__cell_19613_syn154090_TEMP_0.2_0.3_3/Z (NAN2MID2)</td>
<td>0.00</td>
<td>10.70 r</td>
</tr>
<tr>
<td>FinalLow/add519/C_OUT (FASTADDER_1)</td>
<td></td>
<td>0.00</td>
</tr>
<tr>
<td>FinalHigh/C_IN (FASTADDER_0)</td>
<td></td>
<td>0.00</td>
</tr>
<tr>
<td>FinalHigh/add519/C1 (FASTADDER_0_IN81_add_65_1)</td>
<td>0.00</td>
<td>10.70 r</td>
</tr>
<tr>
<td>FinalHigh/add519/UI12/Z (INVDD2)</td>
<td></td>
<td>0.10</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn158921/Z (NOR3D4)</td>
<td>0.17</td>
<td>10.97 r</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn160706/Z (NOR2D4)</td>
<td>0.08</td>
<td>11.05 f</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn162381/Z (NAN2D4)</td>
<td>0.16</td>
<td>11.20 r</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn162347/Z (NAN3MID2)</td>
<td>0.12</td>
<td>11.33 f</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn162349/Z (NAN2D4)</td>
<td>0.19</td>
<td>11.52 r</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn162316/Z (NAN2MID2)</td>
<td>0.10</td>
<td>11.62 r</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn162318/Z (NAN3MID2)</td>
<td>0.13</td>
<td>11.75 r</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn16484/Z (NAN2MID2)</td>
<td>0.09</td>
<td>11.84 r</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn16686/Z (NAN2D2)</td>
<td>0.10</td>
<td>11.94 r</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn16687/Z (INVDD2)</td>
<td>0.08</td>
<td>12.02 f</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn157192/Z (NOR2D4)</td>
<td>0.08</td>
<td>12.10 r</td>
</tr>
<tr>
<td>FinalHigh/add519/us87/Z (INVDD2)</td>
<td></td>
<td>0.06</td>
</tr>
<tr>
<td>FinalHigh/add519/us86/Z (NAN2D2)</td>
<td></td>
<td>0.11</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn161703/Z (NAN2D2)</td>
<td>0.07</td>
<td>12.34 f</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn161705/Z (NAN2D2)</td>
<td>0.12</td>
<td>12.46 r</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn161711/Z (NAN2D2)</td>
<td>0.07</td>
<td>12.54 f</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn161713/Z (NAN2D2)</td>
<td>0.10</td>
<td>12.64 r</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn161714/Z (INVDD2)</td>
<td>0.08</td>
<td>12.72 f</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn172212/Z (NOR2D4)</td>
<td>0.10</td>
<td>12.82 r</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn157218/Z (INVDD4)</td>
<td>0.06</td>
<td>12.86 f</td>
</tr>
<tr>
<td>FinalHigh/add519/0105/Z (NAN2D2)</td>
<td></td>
<td>0.09</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn15723/Z (INVDD2)</td>
<td>0.08</td>
<td>13.03 f</td>
</tr>
<tr>
<td>FinalHigh/add519/_cell_20115_syn157223/Z (NOR2D4)</td>
<td>0.08</td>
<td>13.11 r</td>
</tr>
<tr>
<td>FinalHigh/add519/o355/Z (INVDD2)</td>
<td></td>
<td>0.05</td>
</tr>
</tbody>
</table>

University of Windsor
Fast Adder Module:
*******************************************************************************
Report : area
Design : FASTADDER_1
Version: 2001.08-SP2
Date : Tue Apr 1 17:55:05 2003
*******************************************************************************
Number of ports: 194
Number of nets: 194
Number of cells: 1
Number of references: 1

Combinational area: 10944.590629
Noncombinational area: 0.000000
Net interconnect area: undefined (Wire load has zero net area)
Total cell area: 10944.590629
Total area: undefined

Cell Internal Power = 9.4226 mW (48%)
Net Switching Power = 10.3596 mW (52%)
---------
Total Dynamic Power = 19.7821 mW (100%)
Cell Leakage Power = 502.1571 mW

*******************************************************************************
Report : timing
-path full
-delay max
-max_paths 1
Design : FASTADDER_1
Version: 2001.08-SP2
Date : Tue Apr 1 17:55:12 2003
*******************************************************************************
Point        Incr     Path
input external delay
IBI_0 (in)    0.00      0.00 f
add_S19/B_0 (FASTADDER_1_D11601_add_65_1) 0.00 0.00 f
add_S19/U494/Z (NAN2D1) 0.12 0.12 r
add_S19/U494/Z (INV301) 0.09 0.21 f
add_S19/U203/Z (NAN2D1) 0.09 0.30 r
add_S19/U177/Z (AND2D1) 0.23 0.53 r
add_S19/U111/Z (NAN2D1) 0.08 0.61 f
add_S19/U410/Z (NAN2D1) 0.20 0.81 r
add_S19/U215/Z (OR2D0) 0.21 1.02 r
add_S19/U413/Z (NAN3D0) 0.14 1.16 f
add_S19/add_S19_cell_19613_syn155045/Z (NAN3M1D2) 0.10 1.26 r
add_S19/U135/Z (NAN2D1) 0.08 1.34 r
add_S19/U62/Z (NAN3M1D1) 0.14 1.48 r
add_S19/U619/Z (AND2D0) 0.32 1.80 r
add_S19/add_S19_cell_19613_syn151301/Z (NOR3D2) 0.08 1.88 f
add_S19/U175/Z (NOR2D1) 0.18 2.05 r
add_S19/U174/Z (NOR3D2) 0.07 2.12 f
add_S19/U475/Z (INV301) 0.09 2.21 r
add_S19/U474/Z (NAN2D1) 0.06 2.29 f
add_S19/U473/Z (INV301) 0.17 2.46 r
add_S19/add_S19_cell_19613_syn151324/Z (NOR3D2) 0.10 2.55 f
add_S19/add_S19_cell_19613_syn151326/Z (NOR2D4) 0.09 2.64 r
add_S19/U127/Z (NOR2D2) 0.05 2.69 f
add_S19/U465/Z (INV301) 0.07 2.77 r
add_S19/U631/Z (AND2D0) 0.35 3.12 r
add_S19/add_S19_cell_19613_syn151420/Z (NOR3D2) 0.11 3.23 f
add_S19/add_S19_cell_19613_syn151422/Z (NOR2D4) 0.21 3.44 r
add_S19/add_S19_cell_19613_syn151442/Z (NOR3D4) 0.07 3.51 f
add_S19/add_S19_cell_19613_syn151444/Z (NOR2D4) 0.10 3.62 r
add_S19/add_S19_cell_19613_syn151445/Z (NOR2D4) 0.06 3.68 f
Majority Voter Module:

Report : area
Design : MAJORITY
Version: 2001.08-SP2
Date : Tue Apr 1 17:52:00 2003

Number of ports: 512
Number of nets: 640
Number of cells: 256
Number of references: 2

Combinational area: 5724.288086
Noncombinational area: 0.000000
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 5724.288086
Total area: undefined

Cell Internal Power = 5.7532 mW (52%)
Net Switching Power = 5.2250 mW (48%)

Total Dynamic Power = 10.9782 mW (100%)
Cell Leakage Power = 298.5331 nW

6:2 Reduction Module:

Report : area
Design : REDUCTION
Version: 2001.08-SP2
Date : Tue Apr 1 17:56:13 2003

Number of ports: 768
Number of nets: 895
Number of cells: 96
Number of references: 96

Combinational area: 28670.464844
Noncombinational area: 0.000000
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 28670.464844
Total area: undefined

Cell Internal Power = 31.7210 mW (61%)
Net Switching Power = 20.0382 mW (39%)

Total Dynamic Power = 51.7592 mW (100%)
Cell Leakage Power = 1.7344 uW

******************************************************************************
Report: timing
-path full
-delay max
-max_paths 1
Design: REDUCTION
Version: 2001.06-SP2
Date: Tue Apr 1 17:56:27 2003
******************************************************************************

<table>
<thead>
<tr>
<th>Point</th>
<th>Incr</th>
<th>Path</th>
</tr>
</thead>
<tbody>
<tr>
<td>input external delay</td>
<td>0.00</td>
<td>0.00 f</td>
</tr>
<tr>
<td>in1_L_37 (in)</td>
<td>0.00</td>
<td>0.00 f</td>
</tr>
<tr>
<td>block5/faz2/s [FULLadder_98]</td>
<td>0.00</td>
<td>0.00 f</td>
</tr>
<tr>
<td>block5/faz2/18/9 (EXOR3D1)</td>
<td>0.48</td>
<td>0.48 f</td>
</tr>
<tr>
<td>block5/faz2/SUM [FULLadder_98]</td>
<td>0.00</td>
<td>0.48 f</td>
</tr>
<tr>
<td>block5/faz2/c [FULLadder_97]</td>
<td>0.00</td>
<td>0.48 f</td>
</tr>
<tr>
<td>block5/faz2/17/9 (OR2D1)</td>
<td>0.22</td>
<td>0.69 f</td>
</tr>
<tr>
<td>block5/faz2/18/0 (OA122M22D1)</td>
<td>0.29</td>
<td>0.99 f</td>
</tr>
<tr>
<td>block5/faz2/CARRY [FULLadder_97]</td>
<td>0.00</td>
<td>0.99 f</td>
</tr>
<tr>
<td>block5/COUNT (RED_BLOCK6_24)</td>
<td>0.00</td>
<td>0.99 f</td>
</tr>
<tr>
<td>block5/COUNT (RED_BLOCK6_24)</td>
<td>0.00</td>
<td>0.99 f</td>
</tr>
<tr>
<td>block5/faz2/18 (FULLadder_330)</td>
<td>0.00</td>
<td>0.99 f</td>
</tr>
<tr>
<td>block5/faz2/18/9 (EXOR3D1)</td>
<td>0.48</td>
<td>1.46 f</td>
</tr>
<tr>
<td>block5/faz2/SUM [FULLadder_230]</td>
<td>0.00</td>
<td>1.46 f</td>
</tr>
<tr>
<td>block5/faz2/18 (FULLadder_57)</td>
<td>0.00</td>
<td>1.46 f</td>
</tr>
<tr>
<td>out1_L_38 (out)</td>
<td>0.00</td>
<td>1.46 f</td>
</tr>
<tr>
<td>data arrival time</td>
<td>1.46</td>
<td></td>
</tr>
</tbody>
</table>

6:2 Reduction Cell MUX Module:
******************************************************************************
Report: area
Design: MUX_Reduction
Version: 2001.06-SP2
Date: Wed Apr 2 12:53:26 2003
******************************************************************************

Number of ports: 1026
Number of nets: 1075
Number of cells: 561
Number of references: 5

Combinational area: 9025.444336
Noncombinational area: 0.000000
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 9025.444336
Total area: undefined

Cell Internal Power = 5.1077 mW (44%)
Net Switching Power = 6.4080 mW (56%)

Total Dynamic Power = 11.5157 mW (100%)
Cell Leakage Power = 514.8147 mW

******************************************************************************
Report: timing
-path full
-delay max
-max_paths 1
Design: MUX_Reduction
Version: 2001.08-SP2
Date: Wed Apr 2 12:53:46 2003

Component Breakdown of the Reconfigurable Multiplier
220
### Majority Voter MUX module:

*Reasonable report:*

Report : area
Design : MUX_Majority
Version: 2001.08-SP2
Date : Wed Apr 2 12:54:02 2003

- **Number of ports:** 770
- **Number of nets:** 807
- **Number of cells:** 421
- **Number of references:** 5
- **Combinational area:** 6726.423828
- **Noncombinational area:** 0.000000
- **Net Interconnect area:** undefined (Wire load has zero net area)
- **Total cell area:** 6726.423828
- **Total area:** undefined

**Cell Internal Power** = 3.7859 mW (44%),
**Net Switching Power** = 4.7839 mW (56%),

Total Dynamic Power = 8.5698 mW (100%)
**Cell Leakage Power** = 378.6657 mW

- **Report : timing**
  - -path full
  - -delay max
  - -max_paths 1
  - **Design : MUX_Majority**
  - **Version: 2001.08-SP2**
  - **Date : Wed Apr 2 12:54:16 2003**

### Table

<table>
<thead>
<tr>
<th>Point</th>
<th>Incr</th>
<th>Path</th>
</tr>
</thead>
<tbody>
<tr>
<td>input external delay</td>
<td>0.00</td>
<td>0.00 f</td>
</tr>
<tr>
<td>control_0_ (in)</td>
<td>0.00</td>
<td>0.00 f</td>
</tr>
<tr>
<td>U518/2 (OR2D1)</td>
<td>0.21</td>
<td>0.21 f</td>
</tr>
<tr>
<td>U539/2 (INVD1)</td>
<td>0.27</td>
<td>0.27 r</td>
</tr>
<tr>
<td>U542/2 (BUF4D1)</td>
<td>0.22</td>
<td>0.22 r</td>
</tr>
<tr>
<td>U403/2 (BUF4D1)</td>
<td>0.59</td>
<td>1.19 r</td>
</tr>
<tr>
<td>U402/2 (AND2D1)</td>
<td>0.19</td>
<td>1.39 r</td>
</tr>
<tr>
<td>output3_31_ (out)</td>
<td>0.00</td>
<td>1.39 r</td>
</tr>
<tr>
<td>data arrival time</td>
<td></td>
<td>1.39</td>
</tr>
</tbody>
</table>

(Path is unconstrained)
Appendix F
Simulation Reports and Logs

RECONFIGURABLE MULTIPLIER:

************************************************************SILICON_ENSEMBLE DESIGN SUMMARY REPORT************************************************************
Time: 12:45:12, 1 April 2003
Design name: RecursiveMultiplierV4
Report file name: RecursiveMultiplierV4.summary

Number of macros: 308
Number of components: 17158
Number of pins: 87986
  Number of regular pins: 84424
  Number of special pins: 35332
  Number of unused pins: 230
Number of nets: 18908
Average number of pins per net: 4.65
Number of subnets: 486
  Number of regular pins for subnets: 257
  Number of special pins for subnets: 0
  Number of virtual pins for subnets: 715
Average number of pins per subnet: 2.00
Number of routing tracks available: 2788
Number of GCELLS per layer: 19328

************************************************************SILICON_ENSEMBLE DESIGN SUMMARY REPORT************************************************************
Time: 12:45:13, 1 April 2003
Design name: RecursiveMultiplierV4
Report file name: RecursiveMultiplierV4.summary

** NET STATISTICS OF PIN COUNTS

Number of 2-pin nets: 14641
Number of 3-pin nets: 1078
Number of 4-pin nets: 960
Number of 5-pin nets: 64
Number of 6-pin nets: 22
Number of 7-pin nets: 26
Number of 8-pin nets: 22
Number of 9-pin nets: 42
Number of 10-pin nets: 148
Number of 11-pin nets: 137
Number of 12-pin nets: 139
Number of 13-pin nets: 214
Number of 14-pin nets: 24
Number of 15-pin nets: 67
Number of 16-pin nets: 197
Number of 144-pin nets: 1
Number of 257-pin nets: 2
Number of 16667-pin nets: 2

Time: 12:45:13, 1 April 2003
Design name: RecursiveMultiplierV4
Report file name: RecursiveMultiplierV4.summary

** UTILIZATION OF ALL ROW TYPES

<table>
<thead>
<tr>
<th>Type</th>
<th>Number</th>
<th>Length</th>
<th>Area &amp; Row_Space</th>
</tr>
</thead>
<tbody>
<tr>
<td>core Rows</td>
<td>120</td>
<td>89258400</td>
<td>549831744000</td>
</tr>
<tr>
<td>core Cells</td>
<td>16666</td>
<td>71967720</td>
<td>44332155200</td>
</tr>
</tbody>
</table>

Area of chip: 713332065600 (square DBU)
Area required for all cells: 44332155200 (square DBU)
Area utilization of all cells: 62.15%

****************************************************************************************************SILICON_ENSEMBLE WIRING REPORT****************************************************************************************************
Time: 12:45:59, 1 April 2003
Design name: RecursiveMultiplierV4
Report file name: RecursiveMultiplierV4.wires

** (only DETAILLED wiring are reported for REGULAR nets)

Total vias in regular wiring: 141193
Total segments in regular wiring: 147400
Total vias in special wiring: 1250
Total segments in special wiring: 419

LAYER name: metal1
  Total wire length: 162346.96 microns
  Length of regular wires: 62947.66 microns
  Length of special wires: 99401.28 microns
LAYER name: metal2
  Total wire length: 278562.04 microns
  Length of regular wires: 262762.64 microns
  Length of special wires: 15859.20 microns
LAYER name: metal3
  Total wire length: 441000.74 microns
  Length of regular wires: 441000.74 microns
  Length of special wires: .00 microns
LAYER name: metal4
  Total wire length: 334782.50 microns
  Length of regular wires: 334782.50 microns
  Length of special wires: .00 microns
LAYER name: metal5
  Total wire length: 171144.40 microns
  Length of regular wires: 171144.40 microns
  Length of special wires: .00 microns
LAYER name: metal6
  Total wire length: 35992.52 microns
  Length of regular wires: 35992.52 microns
  Length of special wires: .00 microns

Total wirelength in regular wiring: 1308570.68 microns
Total wirelength in special wiring: 315260.48 microns
Total wirelength in regular-special wiring: 1423831.16 microns

CROSSTALK:
  0 nets claimed N coupling caps, but had different no.
  55 nets had more coupling than total capacitance
  0 pin lists with fewer pins than they said they had.
  0 nets with bizarre cap/unit length
  56 nets had implausible coupling ratios.
  2 wires had implausibly small Rs.
  134 rising drive and 134 falling drive implausibly small.
  18908 nets processed, total wire 1318432.0
  0 unknown nets encountered (maybe with repetitions)
  294 constant nets, length 83577.8, ( 6.3%)
  1 nets had errors, length 244.8, ( 0.0%)
Max crosstalk induced timing delta is 1.34n, (0 > 1us)
Sum of all error voltages is 0.083 volts

Mult64:
Standard Multiplier Testbench (Booth Recoded Wallace Tree)

***********************************************************************************************************************
Report : area
Design : Mult64
Version: 2000.11-SP1
Date : Sat Nov 16 20:22:10 2002
***********************************************************************************************************************

Number of ports: 287
Number of nets: 641
Number of cells: 305
Number of references: 3

Combinational area: 358108.718750
Noncombinational area: 15611.904297
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 373720.625000
Total area: undefined

Top Module:
Cell Internal Power = 22.9755 mW (16%)
Net Switching Power = 122.0531 mW (84%)

*************************************************************
<table>
<thead>
<tr>
<th>Point</th>
<th>Incr</th>
<th>Path</th>
</tr>
</thead>
<tbody>
<tr>
<td>clock clk (rise edge)</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>clock network delay (propagated)</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>A_bottom_reg_60_0/CK (DFEFFQ1)</td>
<td>0.00</td>
<td>0.00 r</td>
</tr>
<tr>
<td>A_bottom_reg_60_0/Q (DFEFFQ1)</td>
<td>0.45</td>
<td>0.45 r</td>
</tr>
<tr>
<td>mult/in_low_60_0 (Multiplier64)</td>
<td>0.00</td>
<td>0.45 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/A_59 (Multiplier64_DW02_mult_64_64_0)</td>
<td>0.00</td>
<td>0.45 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/A_59 (Multiplier64_DW02_multp_64_64_130_0)</td>
<td>0.00</td>
<td>0.45 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/A_59 (Multiplier64_DW02_booth_64_1_0)</td>
<td>0.00</td>
<td>1.47 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/ENC_30/0 (Multiplier64_DW_bthenc_8)</td>
<td>0.00</td>
<td>0.45 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/ENC_30/1 (EXOR2D1)</td>
<td>0.64</td>
<td>1.09 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/ENC_30/0 (EXOR2D1)</td>
<td>0.38</td>
<td>1.47 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/ENC_30/shift1 (Multiplier64_DW_bthenc_8)</td>
<td>0.00</td>
<td>1.47 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/A_coded_91 (Multiplier64_DW02_booth_64_1_0)</td>
<td>0.00</td>
<td>1.47 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/U5065/0 (BUPD1)</td>
<td>0.79</td>
<td>2.26 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/U58/0 (NAN2D1)</td>
<td>0.15</td>
<td>2.41 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/U2258/0 (NAN2D1)</td>
<td>0.16</td>
<td>2.57 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/U2258/0 (EXOR2D1)</td>
<td>0.49</td>
<td>3.06 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/pp_array_1990 (Multiplier64_DW_mtee_64_64_0)</td>
<td>0.00</td>
<td>3.06 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/U2_1_70_5/5 (ADFULD1)</td>
<td>0.53</td>
<td>3.59 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/U77/2 (MIXB2D0)</td>
<td>0.71</td>
<td>4.30 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/U4_1_70_0/3 (ADFULD1)</td>
<td>0.66</td>
<td>4.95 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/U4_1_70_3/2 (ADFULD1)</td>
<td>0.52</td>
<td>5.47 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/U4_1_70_4/1 (ADFULD1)</td>
<td>0.49</td>
<td>5.97 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/U4_1_70_5/0 (ADFULD1)</td>
<td>0.54</td>
<td>6.50 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/U1_0_0A/1_70_7/5 (ADFULD1)</td>
<td>0.52</td>
<td>7.02 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/U1_0/0A/0_70_7/5 (ADFULD1)</td>
<td>0.52</td>
<td>7.54 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/U1/0/0A/0_0/70_7/5 (Multiplier64_DW_mtee_64_64_0)</td>
<td>0.00</td>
<td>7.54 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/BC/UW/0_70 (Multiplier64_DW02_multp_64_64_130_0)</td>
<td>0.00</td>
<td>7.54 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/A_69 (Multiplier64_DW01_add_127_0)</td>
<td>0.00</td>
<td>7.54 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/0_1_2_69/2 (NOR2D1)</td>
<td>0.16</td>
<td>3.70 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/1_5_0_70/0 (NOR2D1)</td>
<td>0.20</td>
<td>7.90 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/1_5_1_70/0 (NAN2D1)</td>
<td>0.12</td>
<td>8.01 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/1_4_2_70/2 (OAI1D1)</td>
<td>0.29</td>
<td>8.30 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/1_4_3_78/2 (OAI1D1)</td>
<td>0.28</td>
<td>8.58 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/1_14/0 (OAI1D1)</td>
<td>0.85</td>
<td>9.43 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/1_4_5_95/2 (OAI1D1)</td>
<td>0.18</td>
<td>9.61 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/1_4_6_95/2 (OAI1D1)</td>
<td>0.17</td>
<td>9.78 r</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/0_5_96/2 (EXOR2D1)</td>
<td>0.56</td>
<td>10.34 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/0P/0 (Multiplier64_DW01_add_127_0)</td>
<td>0.00</td>
<td>10.34 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/PRODUCT_97 (Multiplier64_DW02_mult_64_64_0)</td>
<td>0.00</td>
<td>10.34 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/0P/0 (Multiplyer64)</td>
<td>0.00</td>
<td>10.34 f</td>
</tr>
<tr>
<td>mult/mul_42/UI/ADD0/UI/0P/0 (DFEFFQ1)</td>
<td>0.00</td>
<td>10.34 f</td>
</tr>
<tr>
<td>data arrival time</td>
<td>10.34</td>
<td></td>
</tr>
<tr>
<td>clock clk (rise edge)</td>
<td>15.00</td>
<td>15.00</td>
</tr>
<tr>
<td>clock network delay (propagated)</td>
<td>0.00</td>
<td>15.00</td>
</tr>
<tr>
<td>clock uncertainty</td>
<td>0.20</td>
<td>14.50</td>
</tr>
</tbody>
</table>
OUTL_reg_96\_CK (BPPFG1)

library setup time 0.00 14.50 r
data required time -0.17 14.33

----------
data required time 24.33
data arrival time -10.34

slack (NEST) 3.99

********************************************************************
SILICON_ENSEMBLE DESIGN SUMMARY REPORT
********************************************************************
Time: 15:24:10, 31 March 2003
Design name: Mult64
Report file name: Mult64d_routed.summary

Number of macros: 281
Number of components: 13851
Number of pins: 70539
Number of regular pins: 43590
Number of special pins: 26714
Number of unused pins: 235
Number of nets: 15589
Average number of pins per net: 4.52

Number of subnets: 491
Number of regular pins for subnets: 257
Number of special pins for subnets: 0
Number of virtual pins for subnets: 725
Average number of pins per subnet: 2.00
Number of routing tracks available: 2896
Number of OCELLS per layer: 20861

********************************************************************
SILICON_ENSEMBLE DESIGN SUMMARY REPORT
********************************************************************
Time: 15:24:10, 31 March 2003
Design name: Mult64
Report file name: Mult64d_routed.summary

** UTILIZATION OF ALL ROW TYPES**

<table>
<thead>
<tr>
<th>Type</th>
<th>Number</th>
<th>Length</th>
<th>Area % Row_Space</th>
</tr>
</thead>
<tbody>
<tr>
<td>core Rows</td>
<td>126</td>
<td>97879320</td>
<td>602936611200</td>
</tr>
<tr>
<td>core Cells</td>
<td>1357</td>
<td>60668520</td>
<td>373718083200</td>
</tr>
</tbody>
</table>

Area of chip: 769707892800 (square DNU)
Area required for all cells: 373718083200 (square DNU)
Area utilization of all cells: 48.55%

********************************************************************
SILICON_ENSEMBLE WIRING REPORT
********************************************************************
Time: 15:24:15, 31 March 2003
Design name: Mult64
Report file name: Mult64d_routed.wires

Total vias in regular wiring: 114122
Total segments in regular wiring: 120407
Total vias in special wiring: 32
Total segments in special wiring: 44

LAYER name: metal1
Total wire length: 70292.58 microns
Length of regular wires: 66979.36 microns
Length of special wires: 3313.20 microns
LAYER name: metal2
Total wire length: 318176.90 microns
Length of regular wires: 364434.02 microns
Length of special wires: 33242.88 microns

LAYER name: metal3
Total wire length: 510327.86 microns
Length of regular wires: 510327.86 microns
Length of special wires: 0.00 microns

LAYER name: metal4
Total wire length: 412856.72 microns
Length of regular wires: 412856.72 microns
Length of special wires: 0.00 microns

LAYER name: metal5
Total wire length: 181818.78 microns
Length of regular wires: 181818.78 microns
Length of special wires: 0.00 microns

LAYER name: metal6
Total wire length: 58916.16 microns
Length of regular wires: 58916.16 microns
Length of special wires: 0.00 microns

Total wirelength in regular wiring: 152632.92 microns
Total wirelength in special wiring: 16556.08 microns
Total wirelength in regular+special wiring: 155289.00 microns

CROSSTALK:
130 nets had no input C - consult log file for details
19 inconsistencies in HyperExtract output
  0 nets claimed N coupling caps, but had different no.
  19 nets had more coupling than total capacitance
  0 pin lists with fewer pins than they said they had.
  0 nets with bizarre cap/unit length
20 nets had implausible coupling ratios.
  2 wires had implausibly small Rs.
131 rising drive and 131 falling drive implausibly small.
15598 nets processed, total wire 1543183.4
  0 unknown nets encountered (maybe with repetitions)
100 constant nets, length 1155.3, (0.7%)
  0 nets had errors, length 0.0, (0.0%)
Max crosstalk induced timing delta is 2.22ns, (0 > 1us)
Sum of all error voltages is 0.000 volts

Mult564:
Standard Multiplier Testbench (Booth Recoded Wallace Tree)

******************************
Report : area
Design : Mult64
Version: 2001.08-SP2
Date : Wed Apr 2 13:53:12 2003
******************************

Number of ports: 267
Number of nets: 641
Number of cells: 385
Number of references: 4

Combinational area: 34679.187500
Noncombinational area: 15689.156203
Net Interconnect area: undefined (Wire load has zero net area)
Total cell area: 360367.343750
Total area: undefined

Mult64 top level:
  Cell Internal Power = 34.8979 mW (16%)
  Net Switching Power = 180.9021 mW (84%)
  Total Dynamic Power = 215.8000 mW (100%)
  Cell Leakage Power = 19.1866 uW

Multiplier64 (part of Mult64):
  Cell Internal Power = 417.9735 mW (57%)
  Net Switching Power = 309.5211 mW (43%)
  Total Dynamic Power = 727.4946 mW (100%)
  Cell Leakage Power = 13.6060 uW

Simulation Reports and Logs 226
Report : timing
-path full
-delay max
-max_paths 1
Design : Mult64
Version: 2001.08-S2
Date : Wed Apr 2 13:57:02 2003

<table>
<thead>
<tr>
<th>Point</th>
<th>Incr</th>
<th>Path</th>
</tr>
</thead>
<tbody>
<tr>
<td>clock clk (rise edge)</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>clock network delay (ideal)</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>A_bottom_reg_16/CK (DFPQPI)</td>
<td>0.00</td>
<td>0.00 r</td>
</tr>
<tr>
<td>A_bottom_reg_16/0 (DFPQPI)</td>
<td>0.49</td>
<td>0.49 r</td>
</tr>
<tr>
<td>mult/mul_42/A/17 (Multiplier64)</td>
<td>0.00</td>
<td>0.49 r</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/a[17] (Multiplier64_DW02_mulitp_64_64_130_0)</td>
<td>0.00</td>
<td>0.49 r</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1_BC/A[17] (Multiplier64_DW02_booth_64_1_0)</td>
<td>0.00</td>
<td>0.49 r</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1_BC/ENC_8/c (Multiplier64_DW_bthenc)</td>
<td>0.00</td>
<td>0.49 r</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1_BC/ENC_8/U14/Z (INVDD1)</td>
<td>0.08</td>
<td>0.57 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1_BC/ENC_8/U12/Z (AND2D1)</td>
<td>0.17</td>
<td>0.74 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1_BC/ENC_8/U19/Z (M2D1)</td>
<td>0.39</td>
<td>2.13 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1_BC/ENC_8/shift1 (Multiplier64_DW_bthenc)</td>
<td>0.00</td>
<td>1.13 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1_BC/A_coded[25] (Multiplier64_DW02_booth_64_1_0)</td>
<td>0.00</td>
<td>1.13 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U10679/Z (INVDD1)</td>
<td>0.29</td>
<td>1.43 r</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U17173/Z (INVDD2)</td>
<td>0.11</td>
<td>1.53 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U12350/Z (OA123502D1)</td>
<td>0.31</td>
<td>1.84 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U0853/Z (EXOR2D1)</td>
<td>0.48</td>
<td>2.33 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1 WT/PP_array[578] (Multiplier64_DW_mtree_64_64_0)</td>
<td>0.00</td>
<td>2.33 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1 WT/U4_1_66_2/8 (ADPU0D1)</td>
<td>0.54</td>
<td>2.87 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1 WT/U4_1_66_3_6/8 (ADPU0D1)</td>
<td>0.53</td>
<td>3.41 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1 WT/U4_1_66_2_3/8 (ADPU0D1)</td>
<td>0.52</td>
<td>3.93 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1 WT/U4_1_66_3_2/8 (ADPU0D1)</td>
<td>0.52</td>
<td>4.44 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1 WT/U4_1_66_4_1/8 (ADPU0D1)</td>
<td>0.49</td>
<td>4.94 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1 WT/U4_1_66_5_1/8 (ADPU0D1)</td>
<td>0.54</td>
<td>5.47 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1 WT/U4_1_66_6_0/8 (ADPU0D1)</td>
<td>0.52</td>
<td>5.99 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1 WT/U4_1_66_7_0 (ADPU0D1)</td>
<td>0.54</td>
<td>6.52 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/U1 WT/out[66] (Multiplier64_DW_mtree_64_64_0)</td>
<td>0.00</td>
<td>6.53 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_MULTI1/a[65] (Multiplier64_DW01_add_127_0)</td>
<td>0.00</td>
<td>6.53 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/U0_2_65/Z (NORX2D1)</td>
<td>0.30</td>
<td>6.83 r</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/U0_5_66/Z (NORX2D1)</td>
<td>0.12</td>
<td>6.96 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/U1_4_1_66/Z (AOI2D1)</td>
<td>0.42</td>
<td>7.38 r</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/U1_4_2_70/Z (AOI2D1)</td>
<td>0.20</td>
<td>7.58 r</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/U1_4_3_78/Z (AOI2D1)</td>
<td>0.39</td>
<td>7.96 r</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/U1_4_2_78/Z (INVDD1)</td>
<td>0.08</td>
<td>8.04 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/U0_7_1/Z (OA02D1)</td>
<td>0.41</td>
<td>8.44 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/U1_4_5_96/Z (AOI2D1)</td>
<td>0.26</td>
<td>8.70 r</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/U1_4_6_95/Z (OA02D1)</td>
<td>0.25</td>
<td>8.94 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/U0_5_96/Z (EXOR2D2)</td>
<td>0.40</td>
<td>9.35 f</td>
</tr>
<tr>
<td>mult/mul_42/U2_ADDI/SUM[98] (Multiplier64_DW01_add_127_0)</td>
<td>0.00</td>
<td>9.35 f</td>
</tr>
<tr>
<td>mult/mul_42/PRODUCT[97] (Multiplier64_DW02_mulitp_64_64_0)</td>
<td>0.00</td>
<td>9.35 f</td>
</tr>
<tr>
<td>mult/out[98] (Multiplier64)</td>
<td>0.00</td>
<td>9.35 f</td>
</tr>
<tr>
<td>OUT1_reg_98/D (DFPQPI)</td>
<td>0.00</td>
<td>9.35 f</td>
</tr>
<tr>
<td>data arrival time</td>
<td>9.35</td>
<td></td>
</tr>
<tr>
<td>clock clk (rise edge)</td>
<td>10.00</td>
<td>10.00</td>
</tr>
<tr>
<td>clock network delay (ideal)</td>
<td>0.00</td>
<td>10.00</td>
</tr>
<tr>
<td>clock uncertainty</td>
<td>-0.50</td>
<td>9.50</td>
</tr>
<tr>
<td>OUT1_reg_98/CK (DFPQPI)</td>
<td>0.00</td>
<td>9.50 r</td>
</tr>
<tr>
<td>library setup time</td>
<td>-0.15</td>
<td>9.35</td>
</tr>
<tr>
<td>data required time</td>
<td>9.35</td>
<td></td>
</tr>
<tr>
<td>data arrival time</td>
<td>-9.35</td>
<td></td>
</tr>
<tr>
<td>slack (MBT)</td>
<td>0.00</td>
<td></td>
</tr>
</tbody>
</table>

Simulation Reports and Logs 227
***************SILICON_ENSEMBLE DESIGN SUMMARY REPORT***************

Time: 23:36:38, 2 April 2003
Design name: Multi64
Report file name: Multi64e.summary

Number of macros: 297
Number of components: 9862
Number of pins: 54738
  Number of regular pins: 35717
  Number of special pins: 18796
  Number of unused pins: 255
Number of nets: 11630
Average number of pins per net: 4.71
Number of subnets: 861
  Number of regular pins for subnets: 257
  Number of special pins for subnets: 0
  Number of virtual pins for subnets: 705
Average number of pins per subnet: 2.00
Number of routing tracks available: 2548
Number of GCELLs per layer: 16146

** NET STATISTICS OF PIN COUNTS

Number of 2-pin nets: 9567
Number of 3-pin nets: 703
Number of 4-pin nets: 243
Number of 5-pin nets: 100
Number of 6-pin nets: 186
Number of 7-pin nets: 79
Number of 8-pin nets: 21
Number of 9-pin nets: 28
Number of 10-pin nets: 9
Number of 11-pin nets: 20
Number of 12-pin nets: 22
Number of 13-pin nets: 39
Number of 14-pin nets: 53
Number of 15-pin nets: 81
Number of 16-pin nets: 476
Number of 257-pin nets: 1
Number of 9399-pin nets: 2

***************SILICON_ENSEMBLE DESIGN SUMMARY REPORT***************

Time: 23:36:38, 2 April 2003
Design name: Multi64
Report file name: Multi64e.summary

** UTILIZATION OF ALL ROW TYPES

<table>
<thead>
<tr>
<th>Type</th>
<th>Number</th>
<th>Length</th>
<th>Area</th>
<th>%_Row_Space</th>
</tr>
</thead>
<tbody>
<tr>
<td>cores</td>
<td>309</td>
<td>73162980</td>
<td>456683956800</td>
<td>59.96</td>
</tr>
<tr>
<td>core Cels</td>
<td>9398</td>
<td>58501080</td>
<td>360366652800</td>
<td>79.96</td>
</tr>
</tbody>
</table>

Area of chip: 595844049600 (square DBU)
Area required for all cells: 360366652800 (square DBU)
Area utilization of all cells: 80.48%

***************SILICON_ENSEMBLE WIRING REPORT***************

Time: 23:36:45, 2 April 2003
Design name: Multi64
Report file name: Multi64e.wires

Total vias in regular wiring: 105125
Total segments in regular wiring: 110441
Total vias in special wiring: 3140
Total segments in special wiring: 386

LAYER name: metal1
  Total wire length: 127593.76 microns
    Length of regular wires: 45205.96 microns
    Length of special wires: 82387.80 microns
LAYER name: metal2
  Total wire length: 232618.42 microns
    Length of regular wires: 218170.42 microns
    Length of special wires: 14448.00 microns
LAYER name: metal3
  Total wire length: 432153.96 microns
    Length of regular wires: 432153.96 microns
    Length of special wires: .00 microns
LAYER name: metal4
  Total wire length: 314007.86 microns
    Length of regular wires: 314007.86 microns
Length of special wires: .00 microns
LAYER name: metal5
Total wire length: 162943.82 microns
Length of regular wires: 162943.82 microns
Length of special wires: .00 microns
LAYER name: metal6
Total wire length: 55407.00 microns
Length of regular wires: 55407.00 microns
Length of special wires: .00 microns

Total wirelength in regular wiring: 1226999.02 microns
Total wirelength in special wiring: 96835.80 microns
Total wirelength in regular+special wiring: 1323824.82 microns

CROSSTALK:

0 nets claimed N coupling caps, but had different no.
4 nets had more coupling than total capacitance
0 pin lists with fewer pins than they said they had.
0 nets with bizarre cap/unit length
4 nets had implausible coupling ratios.
2 wires had implausibly small Rm.
131 rising drive and 131 falling drive implausibly small.
11630 nets processed, total wire 1233651.7
0 unknown nets encountered (maybe with repetitions)
130 constant nets, length 10179.0, { 0.8%}
0 nets had errors, length 0.0, { 0.04%}
Max crosstalk induced timing delta is 1.19ns, { 0 > 1us}
Sum of all error voltages is 0.000 volts

Mult64e power: 231.069mW

DELAY ANALYSIS: From Pearl

RecursiveMultiplierV4
input register0.26
MUX10.12
Multiplier3.49
MUX20.11
Reduction1.27
FastAdder_low1.75
FastAdder_high2.00

Mult64d
input register1.10
multiplier4.60
adder5.20

Mult64e
input register2.23
multiplier4.37
adder3.32

Simulation Reports and Logs

229
Vita Auctoris

Pedram Mokrian received the B.A.Sc. degree in Electrical Engineering from the University of Windsor, graduating with first class honours (11.92/13) in 2001. A recognized member of the Dean's list and the President's Roll over his entire academic career at Windsor, Pedram has also gained invaluable industrial experience through his enrollment in the co-op program. His professional employment includes positions at Ford Motor Company in Windsor, and research and development placements in Ottawa at JDS Uniphase and at Nortel Networks as a hardware design engineer.

In the Fall of 2001, Pedram pursued a M.A.Sc. degree in Electrical Engineering under the supervision of Dr. Majid Ahmadi at the University of Windsor. During this period, he has received a University of Windsor Tuition Scholarship, an Ontario Graduate Scholarship, an Ontario Graduate Scholarship in Science and Technology, and most recently a National Research Council of Canada Post-Graduate Fellowship. His area of specialization has been the VLSI implementation of arithmetic algorithms, with focus on the impact of technology scaling on digital multiplication architectures.

Pedram intends on commencing his studies towards a Ph.D. degree in the Fall of 2003.