# Digital VLSI Design

## Lecture 8: Clock Tree Synthesis

Semester A, 2018-19 Lecturer: Dr. Adam Teman

January 12, 2019





Disclaimer: This course was prepared, in its entirety, by Adam Teman. Many materials were copied from sources freely available on the internet. When possible, these sources have been cited; however, some references may have been cited incorrectly or overlooked. If you feel that a picture, graph, or code example has been copied from you and either needs to be cited or removed, please feel free to email <u>adam.teman@biu.ac.il</u> and I will address this as soon as possible.

## Where are we in the design flow?

#### • We have:

- Synthesized our design into a technology mapped gatelevel netlist
- Designed a floorplan for physical implementation.
- And provided a location for each and every gate.

### During all stages:

 We analyzed timing constraints and optimized the design according to these constraints.

#### • However...

- Until now, we have assumed an ideal clock.
- Now we have <u>all sequential elements placed</u>, so we have to provide them with a *real* clock signal.



## **Trivial Approach?**

- Question:
  - Why not just route the clock net to all sequential elements, just like any other net?

Clock

- Answer...
  - Timing
  - Power
  - Area
  - Signal Integrity
  - etc...



### **Lecture Outline**





# Implications of Clocking

Timing, power, area, signal integrity





#### **Implications** Implications on Timing on Timing Implications on Power • Let's remember our famous timing constraints: Implications $T + \delta_{\text{skew}} > t_{\text{CQ}} + t_{\text{logic}} + t_{\text{setup}} + \delta_{\text{margin}}$ • Max Delay: on SI mr $t_{\rm CQ} + t_{\rm logic} - \delta_{\rm margin} > t_{\rm hold} + \delta_{\rm skew}$ • Min Delay: Launch Clock Capture Clock Skew and Jitter Launch Clock Max Delay 0,411 10/0/ **Capture Clock**

## **Clock Parameters**

#### Skew

• Difference in clock arrival time at two different registers.

#### • Jitter

• Difference in clock period between different cycles.

### Slew

- Transition  $(t_{rise}/t_{fall})$  of clock signal.
- Insertion Delay
  - Delay from clock source until registers.





## Implications on Timing

• So skew and jitter eat away at our timing margins:



Implications

on Timing

Implications

on Power

## Implications on Power

- Let's remember how to calculate dynamic power:
  - $P_{\rm dyn} = f \cdot C_{\rm eff} \cdot V_{\rm DD}^2 \qquad C_{\rm eff} \triangleq \alpha \cdot C_{\rm total} = \alpha_{\rm clock} \cdot C_{\rm clock} + \alpha_{\rm others} \cdot C_{\rm others}$
  - The activity factor ( $\alpha$ ) of the clock network is 100%!
- The clock capacitance consists of:
  - Clock generation (i.e., PLL, clock dividers, etc.)
  - Clock elements (buffers, muxes, clock gates)
  - Clock wires
  - Clock load of sequential elements
- Clock networks are huge
  - And therefore, the clock is responsible for a large percentage of the total chip power.



Control

15% 10%

Implications

on Timing

Implications

on Power

on SI

Implications

on Area

I/O Drivers

plications

## Implications on Signal Integrity

**Signal Integrity** is an obvious requirement for the clock network:

- Noise on the clock network can cause:
  - In the worst case, additional clock edges
  - Lower coupling can still slow down or speed up clock propagation
  - Irregular clock edges can impede register operation
- Slow clock transitions (slew rate):
  - Susceptibility to noise (weak driver)
  - Poor register functionality (worse  $t_{cq}$ ,  $t_{setup}$ ,  $t_{hold}$ )
- Too fast clock transitions
  - Overdesign  $\rightarrow$  power, area, etc.
  - Bigger aggressor to other signals
- Unbalanced drivers lead to increased skew.

Implications on Power

Implications on SI

Implications on Area

#### Best practice:

Keep t<sub>rise</sub> and t<sub>fall</sub> between 10-20% of clock period (e.g., 100-200ps @1GHz)

## Implications on Area

- To reiterate, clock networks consist of:
  - Clock generators
  - Clock elements
  - Clock wires
- All of these consume area
  - Clock generators (e.g., PLL) can be very large
  - Clock buffers are distributed all over the place
  - Clock wires consume a lot of routing resources
- Routing resources are most vital
  - Require low RC (for transition and power)
    - Benefit of using high, wide metals
  - Need to connect to every clock element (FF)
    - Distribution all over the chip
    - Need Via stack to go down from high metals



Source: Universitat Bonn

© Adam Teman, 2018



# **Clock Distribution**

So how do we build a clock tree?





## The Clock Routing Problem

Given a source and *n* sinks:

- Connect all sinks to the source by an interconnect network so as to minimize:
  - Clock Skew =  $\max_{i,j} |t_i t_j|$
  - Delay =  $\max_i (t_i)$
  - Total wirelength
  - Noise and coupling effect

#### The Challenge:

#### Synchronize millions (billions) of separate elements

- Within a time scale on order of ~10 ps
- At distances spanning 2-4 cm
  - Ratio of synchronizing distance to element size on order of 10<sup>5</sup>
- Reference: light travels <1 cm in 10 ps



#### Clock Tree goals:

 Minimize Skew
 Meet target insertion delay (min/max)

#### Clock Tree constraints:

- 1) Maximum Transition  $\checkmark$
- 2) Maximum load cap 🗸
- 3) Maximum Fanout 🧹
- 4) Maximum Buffer Levels 🗸

## **Technology Trends**

#### • Timing

- Higher clock frequency → Lower skew
- Higher clock frequency → Faster transitions
- Jitter PLL's get better with CMOS scaling
- but other sources of noise increase
  - Power supply noise more important
  - Switching-dependent temperature gradients

### New Interconnect Materials

- Copper Interconnect → Lower RC → Better slew and potential skew
- Low-k dielectrics → Lower clock power, better latency/skew/slew rates

#### Power

15

- Heavily pipelined design  $\rightarrow$  more registers  $\rightarrow$  more capacitive load for clock
- Larger chips  $\rightarrow$  more wire-length needed to cover the entire die
- Complexity → more functionality and devices → more clocked elements
- Dynamic logic  $\rightarrow$  more clocked elements



### **Approaches to Clock Synthesis**

- Broad Classification:
  - Clock Tree
  - Clock Mesh (Grid)
  - Clock Spines



### **Clock Trees**

- Naïve approach:
  - Route an individual clock net to each sink and balance the RC-delay
  - However, this would burn excessive power and the large RC of each net would cause signal integrity issues.

**RC-Tree** 

 $-\infty$ 

- Instead use a buffered tree
  - Short nets mean lower RC values
  - Buffers restore the signal for better slew rates
  - Lower total insertion delay
  - Less total switching capacitance



Clock Tree

Clock Grid

Clock Spine

## **Building an actual Clock Tree**

- Perfectly balanced approach: H-Tree
  - One large central driver
  - Recursive H-style structure to match wire-lengths
  - Halve wire width at branching points to reduce reflections
- More realistic:

18

- Tapered H-Tree, but still hard to do.
- Standard CTS approach:
  - Flip flops aren't distributed evenly.
  - Try to build a balanced tree





clock / PLL

sequential

elements

Clock Tree

Clock Grid

Clock Spine



IBM, ISSCC 2000

19

## Lower Skew – Clock Grids

### Advantages:

- Skew determined by grid density and not overly sensitive to load position
- Clock signals are available everywhere
- Tolerant to process variations
- Usually yields extremely low skew values

### Disadvantages

- Huge amounts of wiring & power
  - Wire cap large
  - Strong drivers needed pre-driver cap large
  - Routing area large
- To minimize all these penalties, make grid pitch coarser
  - Skew gets worse

20

- Losing the main advantage
- Don't overdesign let the skew be as large as tolerable
- Still grids seem non-feasible for SoC's





Clock Tree

## DEC Alpha – Generations of Grids

#### • 21064 (EV4) – 1992

- 0.75µm, 200MHz, 1.7M trans.
- Big central driver, clock grid
  - 240ps skew

### • 21164 (EV5) - 1995

- 0.5 µm, <u>300MHz</u>, <u>9.3M</u> trans.
- Central driver, two final drivers, clock grid
  - Total driver size 58 cm!
  - 120ps skew
- 21264 (EV6) 1998
  - 0.35 μm, 600MHz, 15.2M trans.
  - 4 skew areas for gating
    - Total driver size: <u>40 cm</u>
    - 75ps skew



© Adam Teman, 2018

21

## **Clock Spines**

- Clock grids are too power (and routing) hungry.
- A different approach is to use spines
  - Build an H-Tree to each spine
  - Radiate local clock distribution from spines
- Pentium 4 (2001) used the clock spine approach.







Clock Tree

## Summary of main clock dist. approaches

Three basic routing structures for global clock:

#### • H-tree

- Low skew, smallest routing capacitance, low power
- Floorplan flexibility is poor.

### Grid or mesh

- Low skew, increases routing capacitance, worse power
- Alpha uses global clock grid and regional clock grids

### Spine

- Small RC delay because of large spine width
- Spine has to balance delays; difficult problem
- Routing cap lower than grid but may be higher than H-tree.

| Structure | Skew    | Cap/area/<br>power | Floorplan<br>Flexibility |
|-----------|---------|--------------------|--------------------------|
| H-Tree    | Low/Med | Low                | Low                      |
| Grid      | Low     | High               | High                     |
| Spine     | High    | Medium             |                          |

## Active Skew Management (Deskewing)

### Alpha EV7



### Intel Itanium



© Adam Teman, 2018

## **Clock Concurrent Optimization**

### • What is the main goal of CTS?

- We are trying as hard as we can to minimize skew.
- And on the way, we are burning power, wasting area, suffering from high insertion delay, etc.
- But is minimal skew our actual goal?
- Why are we minimizing skew in the first place?



### Maybe we should forget about skew and focus on our real goals?

- We need to meet timing and DRV constraints.
- Minimizing skew was just to correlate post-CTS and pre-CTS timing.
- But maybe we should just consider timing, while building our clock tree.
- This new approach is known as Clock Concurrent Optimization (CCOpt)

## **Clock Concurrent Optimization** <sup>1</sup>

### CCOpt Methodology:

- First, build a clock tree, in order to fix DRVs.
- Then check timing (setup and hold) and fix any violations.

### Why is this a good approach?

- Most timing paths are local.
- Therefore, they probably come from the same clock branch and don't need much skew balancing to start with.

small skew

and jitter

### Less skew balancing leads to:

- Lower insertion delay (power, jitter)
- Fewer clock buffers (power, area) /
- Distribution of peak current (less IR Drop)
- A heavy dose of useful skew (performance)

26

large skew and iitter medium

skew

and iitter

leak pour

### **The Chip Hall of Fame**

• Digital introduced a ton of novelty, including clocking, with the Alpha architecture, starting with the

# DEC Alpha 21064

- 21064 = "A 21<sup>st</sup> Century ready, 64-bit, gen. 0 uarchitecture.
- Release date: 1992 64 bit Alpha Architecture
- Process: Digital Equipment Co. CMOS-4 0.75um
- 3-metal layers, 3.3V power supply, 150-200 MHz
- 1.6 Million Transistors , 232 mm<sup>2</sup> die size, price \$3,375
- At the time of introduction, the Alphas were the world's fastest chips.
- Phased out after DEC was acquired by Compaq in 1998 and chose to use Intel Itanium architecture, instead.

Not yet inducted to the IEEE Chip Hall of Fame







# **Clock Tree Synthesis in EDA**





## Starting Point Before CTS

### • So remember:

- Our design is placed, and therefore, we know the location of all clock sinks (register and macro clock pins)
- All clock pins are driven by a single clock source, which we considered ideal until now.
- We may have several clocks and some logic on the clock network, such as clock gates, muxes, clock dividers, etc.

Clock -

### We must now buffer the clock nets to:

- Meet DRV constraints:
  - Max fanout, max capacitance, max transition, max length
- Meet clocking goals:
  - Minimum skew, minimum insertion delay



**DRV** = Design

**Rule Violations** 

Design Import

Floorplan

Placement

CTS

## Some basic understanding

- What makes up a clock tree?
  - A starting point
    - The clock port or the output pin of the clock generator.
  - Leaf/end points
    - Clock pins of registers, memories, etc.
  - Buffers/Inverters
    - Required to ensure clock transitions and meet skew requirements.
  - Special logic
    - Such as multiplexers, integrated clock gates, clock dividers

### We need a "language" to define these (and other) components

• The following slides will introduce you to the CCOpt naming conventions for these elements.



## **CTS Definitions – Sources and Sinks**

### **Clock Source**

- The pin that a clock fans out from.
- This can be:
  - A primary input (port) to our design
  - An output pin of an IP (e.g., PLL)
  - An output pin of a gate (e.g., clock mux, clock gate).

### Clock Sink

- All pins that receive the clock signal.
- This can be:
  - Clock input of a register (FF, latch)
  - Clock input of an IP (e.g., SRAM)
  - Primary output (if the clock is driven outside the block)



## **CTS Definitions – Trees and Skew Groups**

### Clock Tree

- The root of a circuit graph for buffering.
- Skew Group
  - A subset of clock sinks to consider for skew balancing/analysis. By default, all the sinks of a clock tree are in the same skew group.
  - However, they can be divided into several skew groups, sinks of several clocks can belong to the same skew group (e.g., clock and generated clock), and a sink can even belong to more than one skew group.
- Basic CCOpt Commands:

create\_clock\_tree -name clk -source [get\_ports CLK] -no\_skew\_group create\_skew\_group -name clk -sources [get\_ports CLK] -auto\_sinks update\_skew\_group -skew\_group clk -add\_sinks [get\_pins FF1/CK]



am Teman, 2018

## **CTS Definitions – Stop/Ignore/Exclude Pins**

### Stop Pins

- A leaf of a clock tree. The clock net will be buffered up to the stop pin but not beyond it.
- All clock sinks are implicit stop pins, and therefore, will be considered for skew balancing/analysis.
- To define additional pins as stop pins in CCOpt, use:

set\_db pin:INV1/A .cts\_sink\_type stop

Ignore Pins

- Ignore pins are pins on the clock net that will not be considered a sink in any skew group.
- The clock net will be buffered up to the ignore pin but not beyond it.

set\_db pin:div\_ff1/CK .cts\_sink\_type ignore

Implicit

**Stop Pins** 

Implicit

Ignore Pin

## **CTS Definitions – Stop/Ignore/Exclude Pins**

CLKGATE

dam Teman, 2018

Implicit

**Through Pin** 

#### • Exclude Pins

• Exclude pins are similar to ignore pins, but the clock net will not be buffered up to an exclude pin.

set\_db pin:div\_ff1/CK .cts\_sink\_type exclude

#### • Through Pins

- Through pins are pins, which would otherwise be considered stop pins, but we want the clock to propagate through them (for buffering and adding pins to the skew group).
- In the new versions of Innovus, there is no explicit command to define a through pin. Just call it an ignore pin...

### **CTS Definitions – Insertion Delay Pin**

- In some cases, we would like to provide the clock to a certain stop pin earlier or later than the average insertion delay.
  - For example, if a macro block has some internal insertion delay, we would like to provide the clock early for skew balancing.

nsertion delay

Such a pin is also known as a "float pin"

 For example, to provide the clock to an SRAM called "mem"150ps earlier than the average insertion delay :

set\_db pin:mem/CK .cts\_pin\_insertion\_delay 150ps

### **CTS Definitions – Pin Type Summary**

|                                | Part of the<br>Clock<br>Tree | Considered<br>for DRV<br>fixing | Considered for<br>Delay<br>Balancing | Clock<br>propagates<br>beyond it |
|--------------------------------|------------------------------|---------------------------------|--------------------------------------|----------------------------------|
| Stop Pin                       | Yes                          | Yes                             | Yes                                  | No                               |
| Ignore Pin                     | Yes                          | Yes                             | No                                   | No                               |
| Exclude Pin                    | No                           | No                              | No                                   | No                               |
| Float Pin<br>(insertion_delay) | Yes                          | Yes                             | Yes –<br>according to<br>constraint  | No                               |

# **Clock Net Routing**

- Clock nets are very important in terms of signal integrity
  - A glitch on a clock net will cause an <u>additional clock edge</u>! Clock signal
  - Slow transitions will cause <u>deteriorated setup/hold times of registers</u>.
  - Fast transitions are strong aggressors to neighboring nets.
- Therefore:
  - We will usually pre-route the clock nets during CTS.
    - First choice of routing tracks
  - Use higher, thicker metals for clock routing
    - Lower resistance
    - Less capacitance to the substrate
  - Apply shielding to clock nets!
  - Consider adding <u>DeCaps</u> next to clock buffers.



Vdd



Vdd

CLK

GND

### How do we route clock nets?

- In Innovus, we can define special routing rules for specific nets:
  - These are called Non-Default Rules (NDRs)
  - For example, double-width or double-spacing.
- We can tell Innovus to use a certain NDR, by creating a Routing Type
  - The routing type enables us to define preferred layers and shielding.
- In Innovus, we differentiate between three types of clock nets:
  - **Top** The initial branch of the clock tree. Very wide and high.
  - Trunks The main branches of the clock tree. Wide and high.
  - Leaf The bottom levels of the clock tree. Closer to the logic.
- So, we will define NDRs and routing rules and then apply them to clock nets.

### Shielding and Non-Default Routing Rules

#### • First, define non-default routing rules (NDR):

Double width and double spacing

create\_route\_rule -name CTS\_2W2S \
 -spacing\_multiplier 2 -width\_multiplier 2

#### Then, create a routing type for CTS

create\_route\_type -name cts\_trunk -non\_default\_rule CTS\_2W2S \
 -top\_preferred\_layer M7 -bottom\_preferred\_layer M6 \
 -shield\_net VSS -bottom\_shield\_layer M6

#### Finally, apply property to *trunk* type clock nets

set\_db cts\_route\_type\_trunk cts\_trunk

### **Analyzing Clock Trees**

- Before running clock tree synthesis, analyze each clock tree in the design to determine:
  - What the clock root is.
  - What the desired clock sinks and clock tree exceptions are.
  - Whether the clock tree contains preexisting cells, such as clock gating cells.
  - Whether the clock tree converges, either with itself (a convergent clock path) or with another clock tree (an overlapping clock path).
  - Whether the clock tree has timing relationships with other clock trees in the design, such as inter-clock skew requirements.
  - What the DRV constraints are (maximum fanout, maximum transition time, and maximum capacitance).
  - What are the library cells to use for implementing the clock tree.
  - What the routing constraints (routing rules and metal layers) are.



### **Clock Tree Optimizations**

### Issue with Post CTS Interface Timing

- Before CTS, clock is ideal.
  - We define I/O constraints without thinking about the clock influence

```
set_input_delay -clock clk $IN_DLY [all_inputs]
set_output_delay -clock clk $OUT_DLY [all_outputs]
```

- But after CTS:
  - We added positive skew to the in2reg paths.
  - We added negative skew to the reg2out paths.
- Therefore:
  - Add the average insertion delay to the clock paths to the I/O ports.



## **Reducing Clock Distribution Problems**

- Use latch-based design
  - Time borrowing helps reduce impact of clock uncertainty
  - Timing analysis is more difficult
  - Rarely used in fully synthesized ASICs, but sometimes in datapaths of otherwise synthesized ASICs
- Make logical partitioning match physical partitioning
  - Limits global communication where skew is usually the worst
  - Helps break distribution problem into smaller sub-problems
- Use globally asynchronous, locally synchronous design
  - Divides design into synchronous regions which communicate through asynchronous channels
  - Requires overhead for inter-domain communication
- Use asynchronous design
  - Avoids clocks all together
  - Incurs its own forms of control overhead







<sup>©</sup> Adam Teman, 2018



### Additional Subjects: Clock Generation





### **Clock Generation**

#### Where does the clock come from?

- The easiest way to generate a clock is by using a ring oscillator or some other non-stable circuit, but these are susceptible to PVT variations.
- Therefore, clocks are generally generated <u>off-chip</u> using a crystal and an oscillation circuit.
- However, usually only <u>one off-chip clock</u> can be used (a single frequency) and the frequency that can be input to the chip is limited to around 100 MHz.
- Therefore, on-chip local clock generation is employed, usually with a PLL or DLL.



### **Local Clock Generation**

#### • Externally generated clocks suffer from two primary problems:

- Frequency is limited, i.e., a *clock multiplier* is needed.
- Clock phase is uncontrolled, such that communication with the <u>external clock</u> <u>domain</u> is unsynchronized.
   Chip 1
- To solve both of these problems, Data Digital a Phase-Locked Loop (PLL) Digital System System is used. reference If clock multiplication  $f_{system} = N \times f_{crystal}$ clock Divider PLL is not required, a PLL Clock delay-locked loop (DLL) Buffer is a more simple solution. f<sub>crystal</sub> , 200<Mhz Crystal

Oscillator

### Local Clock Generation

#### How does a PLL work?

- A phase detector (PD) produces a signal proportional to the phase difference between the input and output clocks.
- A loop filter (LF) converts the phase error into a control signal (voltage).
- A voltage-controlled oscillator (VCO), creates a new clock signal based on the error signal.
- What about a DLL?
  - Same principle, but instead of <u>changing</u> <u>the frequency</u>, it just delays the clock until the <u>phase is equal</u>.





# Additional Subjects: Clock Domain Crossing





# Clock Domain Crossing (CDC)

- Most system-on-chips have several components that communicate with <u>different external interfaces</u> and run on <u>different clock frequencies</u>.
  - In other words, they have multiple asynchronous clock domains.
- Asynchronous clocks <u>cannot communicate with each other</u> in a straightforward fashion:



# Problems with CDC

The main problems with passing data between asynchronous domains are:

- Metastability:
  - A setup/hold violation in the <u>capture register</u>. May cause:
    - High propagation delay at the fanout.
    - High current flow in the chip (even burnout).
    - Different values of the signal at different parts of the fanout.

#### Data Loss:

50

 New data in the source may be generated w/o the data being captured by the destination.

#### • Data Incoherency:

• Data may be captured late, causing several coherent signals to be in different states.

What is the probability of metastability?

Let us define:

- $T_w$  an error window (setup+hold) around the clock cycle
- f clock frequency
- $f_{\rm D}$  Frequency of data change

Now assume that a data (D) change can come anywhere in the clock cycle relative to the capture clock, so:

 $Rate(meta) = f \cdot f_{\rm D} \cdot T_{\rm w}$ 

Assume:

 $f = 1 \text{GHz} \quad f_{\text{D}} = 0.1 f \quad T_{w} = 20 \text{ ps}$  $Rate(\text{meta}) = \frac{2 \cdot 10^6}{\text{sec}}$ 

© Adam Teman, 2018

### Solutions: Synchronizers

- By cascading two or more flip flops, we create a simple synchronizer.
  - The signal has one (or more) clock cycles to stabilize.
  - However, there is a probability that the signal will not settle within the cycle time (T).



# What is the probability of failure?

The probability for metastability to pass is:

 $P(t>S) = e^{-S/\tau}$ 

with  $\tau$  a parameter of the flip flop

We want the metastability to dissipate at least  $t_{setup}$  before the next clock edge, so:

$$P(\text{failure}) = P(\text{meta}) \cdot P(\text{exit}) = \frac{T_w}{T} \cdot e^{-\frac{T - t_{\text{setup}}}{\tau}}$$

And the mean time between failures (MTBF) is the inverse of the failure rate:

$$MTBF = \frac{1}{Rate(failure)} = \frac{1}{f \cdot f_D \cdot T_w} e^{-\frac{T - t_{setup}}{\tau}}$$

For our previous example, this is about 10<sup>24</sup> years...

© Adam Teman, 2018

### Are synchronizers enough?

#### • No!

- We may have taken care of metastability, but data loss and data incoherence are still there.
- So we need to design our logic accordingly.
- To eliminate data loss:
  - Slow to fast clock we won't lose any data.
  - Fast to slow clock hold source data for several cycles.
- But for data coherence, we need more thinking:
  - Handshake protocols.
  - First-in First-out (FIFO) interfaces
  - Other solutions (Gray code, Multiplexers, etc.)



### Main References

- Berkeley EE141
- Rabaey "Digital Integrated Circuits"
- Synopsys University Courseware
- IDESA
- Gil Rahav
- Dennis Sylvester, UMICH
- MIT 6.375 Complex Digital Systems
- Horowitz, Stanford
- Ginosar, "Metastability and Synchronizers: A Tutorial"