# Digital Integrated Circuits (83-313)

## Lecture 6:

## Arithmetic Circuits

Semester B, 2016-17

Lecturer: Dr. Adam Teman

TAs: Itamar Levi,

Robert Giterman

Emerging Nanoscaled
Integrated Circuits and Systems Labs

7 May 2017



Disclaimer: This course was prepared, in its entirety, by Adam Teman. Many materials were copied from sources freely available on the internet. When possible, these sources have been cited; however, some references may have been cited incorrectly or overlooked. If you feel that a picture, graph, or code example has been copied from you and either needs to be cited or removed, please feel free to email <a href="mailto:adam.teman@biu.ac.il">adam.teman@biu.ac.il</a> and I will address this as soon as possible.

#### **Lecture Content**





## DataPaths





## Intel Microprocessor

Itanium has 6 integer execution units like this





## Bit-Sliced Design

**Control** 



Integer Datapath (IEU) Bupass Control Register File

Fetzer, Orton, ISSCC'02



Design for energy efficiency!



Data-In

## Adders





## Serial Adder Concept

- At time i, read  $a_i$  and  $b_i$ . Produce  $s_i$  and  $c_i+1$
- Internal state stores  $c_i$ . Carry bit  $c_0$  is set as  $c_{in}$





#### Basic Addition Unit – Full Adder

| X |   | Cin  | S | Cout |
|---|---|------|---|------|
|   | ı | Ciri |   | Coul |
| 0 | 0 | 0    | 0 | 0    |
| 0 | 0 | 1    | 1 | 0    |
| 0 | 1 | 0    | 1 | 0    |
| 0 | 1 | 1    | 0 | 1    |
| 1 | 0 | 0    | 1 | 0    |
| 1 | 0 | 1    | 0 | 1    |
| 1 | 1 | 0    | 0 | 1    |
| 1 | 1 | 1    | 1 | 1    |

$$S = x \oplus y \oplus C_{in}$$
$$\Rightarrow S = P \oplus C_{in}$$

$$C_{out} = xy + xC_{in} + yC_{in}$$
 Propagate =  $x \oplus y$   
 $\Rightarrow C_{out} = G + P \cdot C_{in}$   $\approx x + y$ 

$$Kill = \overline{x} \cdot \overline{y}$$

Generate =  $x \cdot y$ 

 $\approx x + y$ 



## **Full-Adder Implementation**

A full-adder is therefore a majority gate and a 3-input XOR:



Total: 32 Transistors



## Ripple Carry Adder



- So it is clear, the  $C_{\rm out}$  output of the Full Adder is on the critical path.
- Can we exploit this to improve the design?

$$\begin{split} S &= A \oplus B \oplus C_{\text{in}} = \\ &= ABC_{\text{in}} + \left(A + B + C_{\text{in}}\right) \overline{C}_{\text{out}} \end{split}$$

$$t_{adder} = (N-1)t_{carry} + t_{sum}$$
  $t_d = O(N-1)t_{carry}$ 



## **Full-Adder Implementation**

#### **24 Transistors**





$$LE_{C_i} = \frac{2+4+2+4+6+3}{3} = 7$$

...BUT ~64 stages to propagate

i.e. 
$$PE_{\text{opt}} = 4^{64}$$



## **Exploiting the Inversion Property**





$$\bar{S}(A,B,C_{i}) = S(\bar{A},\bar{B},\overline{C}_{i})$$

$$\overline{C_o}(A,B,C_i) = C_o(\overline{A},\overline{B},\overline{C_i})$$

Even cell Odd cell



We saved the inverter, so  $PE_{\text{opt}}=4^{32}$ 



## Sizing the Mirror Adder

 $LE_{Cin} = \frac{4+2}{3} = 2$ 

Problem: How can we make a high speed bitslice layout?

- If we upsize each stage according to Logical Effort, we will have non-identical bitslices.
- Such upsizing will result in huge gates.

$$EF_{FA,Cin} = \frac{LE_{Cin} \cdot C_{L,Cout}}{C_{Cin}} \Rightarrow \frac{C_{L,Cout}}{C_{Cin}} \bigg|_{EF=4} = 2$$

$$C_{L,Cout} = 6 + C_{Cin} + 6 + 9 \Rightarrow C_{Cin} = 21$$

- Why not design the adder to inherently achieve optimal Electrical Effort ( $EF_{opt}=4$ )?
  - Assume everything not on the carry path can be sized like a minimum inverter!



## Manchester Carry-Chain Adder



Static Circuits



Dynamic Circuit



$$t_P = 0.69 \sum_{i=1}^{N} C_i \cdot \left(\sum_{j=1}^{i} R_j\right)$$
$$= 0.69 \frac{N(N+1)}{2} RC$$
where  $R_j = R$ ,  $C_i = C$ 

Propagate/Generate Row



## Carry-Skip (Carry Bypass) Adder





M Sections of (N/M) Bits Each

$$t_{adder} = t_{setup} + \left(\frac{N}{M} - 1\right) t_{carry} + \left(M - 1\right) t_{bypass} + t_{sum}$$

Idea: If (P0 and P1 and P2 and P3 = 1) then  $C_{o3} = C_0$ , else "kill" or "generate".





## **Carry-Select Adder**



Let's guess the answer for each value of the carry.



N-bit input with M CSA blocks

$$t_{adder} = t_{setup} + \frac{N}{M}t_{carry} + M \cdot t_{mux} + t_{sum}$$



## Square Root Carry Select





### Carry Lookahead Adder – Basic Idea

- Problem  $C_{o,k}$  takes approximately k gate delays to ripple.
- Question can we calculate the carry without any ripple?





## Logarithmic CLA (Tree Adder)

#### • Can we reduce the complexity of calculating $P_i$ , $G_i$ ?

$$P_{1:0} = P_1 \cdot P_0 \quad G_{1:0} = G_1 + P_1 \cdot G_0$$

$$\Rightarrow C_{out,1} = G_{1:0} + P_{1:0}C_{in,0}$$

$$P_{3:2} = P_3 \cdot P_2 \quad G_{3:2} = G_3 + P_3 \cdot G_2$$



 $\Rightarrow C_{out,3} = G_{3:2} + P_{3:2}C_{in,2}$ 

$$P_{3:0} = P_{3:2} \cdot P_{1:0}$$
  $G_{3:0} = G_{3:2} + P_{3:2} \cdot G_{1:0}$   
 $\Rightarrow C_{out,3} = G_{3:0} + P_{3:0}C_{in,0}$ 







## Logarithmic CLA (Tree Adder)

- Many ways to construct these CLA or tree adders, based on:
  - Radix: How many bits combined in each gate
  - Tree Depth: How many stages of logic to the final carry ( $>=log_{radix}N$ )
  - Fanout: Maximal logic branching in tree





#### Subtraction

To subtract two's complement, just remember that:

$$-x = \overline{x} + 1$$



$$-x = \overline{x} + 1$$
  $\longrightarrow$   $A - B = A + \overline{B} + 1$ 

- So to subtract:
  - Invert one of the operands.
  - Add a carry in to the first bit.
- Therefore, to provide an adder/subtractor:
  - Add an XOR gate to the B-input
  - Use the sub/add selector to the XOR and carry in.



## Multipliers





## Grade School Multiplication



### Multiplication using serial addition





## **Binary Multiplication**





#### Serial Shift and Add





## **Array Multiplier**



## **Many Critical Paths**



$$t_{mult} \approx t_{AND} + \left[ \left( M - 1 \right) + \left( N - 2 \right) \right] t_{carry} + \left( N - 1 \right) t_{sum}$$



## **Carry-Save Multiplier**





## **Multiplier Floorplan**





## **Booth Recoding**

- Multiplying by '0' is redundant.
- Can we reduce the number of partial products?

$$\sum_{i=0}^{n-1} 2^i = 2^n - 1$$

Based on the observation that

$$\begin{array}{r}
1000 (8) \\
-0001 (1) \\
\hline
0111 (7)
\end{array}$$

$$\begin{array}{r} 01000000 & (64) \\ -00001000 & (8) \\ \hline 00111000 & (56) \end{array}$$

- We can turn sequences of 1's into sequences of 0's. For example: 0111=1000-0001
- So we can introduce a '-1' bit and recode the multiplier:
  - For example, the number 56



## Radix-2 Booth Recoding

- Parse multiplier from left to right
  - For each change from 0 to 1, encode a '1'
  - For each change from 1 to 0, encode a '-1'
  - For bit 0, assume bit i=-1 is a 0
- Example:  $0011 \ 0111 \ 0011 = Ox373$



## Modified (Radix-4) Booth Recoding

#### Radix-2 Booth Recoding doesn't work for parallel hardware implementations:

- A worst case (010101010101010) doesn't reduce the number of partial products.
- Variable length recoders (according to the length of '1' strings) cannot be implemented efficiently.
- Instead, just assume a constant length recoder.
  - First apply standard booth recoding.
  - Next encode each pair of bits:
    - 1. Within a sequence:

2. Begin of a 1's-sequence:

3. End of a 1's-sequence:

• This can be summarized in a truth table:

| Partial Product Selection Table |                   |  |  |
|---------------------------------|-------------------|--|--|
| Multiplier Bits                 | Recorded Bits     |  |  |
| 000                             | 0                 |  |  |
| 001                             | + Multiplicand    |  |  |
| 010                             | + Multiplicand    |  |  |
| 011                             | +2 × Multiplicand |  |  |
| 100                             | -2 × multiplicand |  |  |
| 101                             | - Multiplicand    |  |  |
| 110                             | - Multiplicand    |  |  |
| 111                             | 0                 |  |  |

## Modified (Radix-4) Booth Recoding

- For example, let's take our previous example:
  - $0011\ 0111\ 0011 = 01\ 0-1\ 10\ 0-1\ 01\ 0-1$
  - This comes out: 1 -1 2 -1 1 -1.
- We could have done this by using the table:

| Partial Product Selection Table |                   |  |  |
|---------------------------------|-------------------|--|--|
| Multiplier Bits                 | Recorded Bits     |  |  |
| 000                             | 0                 |  |  |
| 001                             | + Multiplicand    |  |  |
| 010                             | + Multiplicand    |  |  |
| 011                             | +2 × Multiplicand |  |  |
| 100                             | -2 × multiplicand |  |  |
| 101                             | - Multiplicand    |  |  |
| 110                             | - Multiplicand    |  |  |
| 111                             | 0                 |  |  |

001101110011



To implement this we need pretty simple hardware:

## **Tree Multipliers**

 Can we further reduce the multiplier delay by employing logarithmic (tree) structures?





## Wallace-Tree Multiplier







### Wallace-Tree Multiplier







## Wallace-Tree Multiplier



## Pipelining Multipliers

• Pipelining can be applied to most multiplier structures:





## **Further Reading**

- Rabaey, et al. "Digital Integrated Circuits" (2<sup>nd</sup> Edition)
- Elad Alon, Berkeley ee141 (online)
- Weste, Harris, "CMOS VLSI Design (4<sup>th</sup> Edition)"

