## COMMUNICATION RELIABILITY IN NETWORK ON CHIP DESIGNS

A Thesis

by

## REESHAV KUMAR

### Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of

## MASTER OF SCIENCE

August 2011

Major Subject: Electrical Engineering

## COMMUNICATION RELIABILITY IN NETWORK ON CHIP DESIGNS

A Thesis

 ${\rm by}$ 

## REESHAV KUMAR

### Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of

## MASTER OF SCIENCE

Approved by:

| Chair of Committee, | Seong Gwan Choi       |
|---------------------|-----------------------|
| Committee Members,  | Jiang Hu              |
|                     | Jose Silva Martinez   |
|                     | Eun Jung Kim          |
| Head of Department, | Costas N. Georghiades |

August 2011

Major Subject: Electrical Engineering

#### ABSTRACT

Communication Reliability in Network on Chip Designs. (August 2011)

Reeshav Kumar, B.E., Birla Institute of Technology & Science, India

Chair of Advisory Committee: Dr. Seong Gwan Choi

The performance of low latency Network on Chip (NoC) architectures, which incorporate fast bypass paths to reduce communication latency, is limited by crosstalk induced skewing of signal transitions on link wires. As a result of crosstalk interactions between wires, signal transitions belonging to the same flit or bit vector arrive at the destination at different times and are likely to violate setup and hold time constraints for the design. This thesis proposes a two-step technique: TransSync- RecSync, to dynamically eliminate packet errors resulting from inter-bit-line transition skew. The proposed approach adds minimally to router complexity and involves no wire overhead. The actual throughput of NoC designs with asynchronous bypass designs is evaluated and the benefits of augmenting such schemes with the proposed design are studied. The TransSync, TransSync-2-lines and RecSync schemes described here are found to improve the average communication latency by 26%, 20% and 38% respectively in a 7X7 mesh NoC with asynchronous bypass channel.

This work also evaluates the bit-error ratio (BER) performance of several existing crosstalk avoidance and error correcting schemes and compares them to that of the proposed schemes. Both TransSync and RecSync scheme are dynamic in nature and can be switched on and off on-the-fly. The proposed schemes can therefore be employed to impart unequal error protection (UEP) against intra-flit skewing on NoC links. In the UEP, a larger fraction of the energy budget is spent in providing protection to those parts of the data being transmitted on the link which have a higher priority, while expending smaller effort in protecting relatively less important parts of the data. This allows us to achieve the prescribed level of performance with lower levels of power. The benefits of the presented technique are illustrated using an H.264 video decoder system-on-chip (SoC) employing NoC architecture. We show that for Akyio test streams transmitted over 3mm long link wires, the power consumption can be reduced by as much as 20% at the cost of an acceptable degradation in average peak signal to noise ratio (PSNR) with UEP. To My Parents

#### ACKNOWLEDGMENTS

It is a pleasure to thank the many people who made this thesis possible.

I owe my deepest gratitude to my advisor, Dr. Seong Gwan Choi, for his support and direction over the last two years of my master's study. Without his continued encouragement and sound advice, this work would not have been possible.

I would also like to thank Drs. Hu, Silva and Kim for serving on my committee and for being excellent teachers.

I am indebted to many of my student colleagues for providing a stimulating and fun environment in which I had the opportunity to learn and grow. I am especially grateful to Yoon Seok Yang for teaching me the basics of Networks on Chip. I really enjoyed working with him.

A number of colleagues and friends have made my stay in College Station memorable. I would like to thank Ayush, Ravindra, Akshay, Anurag, Sapan, Dibakar and Hrishikesh for their constant support and affection.

Lastly, and most importantly, I wouldn't have been able to compile this work without the unconditional love and support of my family.

# TABLE OF CONTENTS

## CHAPTER

| Ι   | INTRODUCTION                                                                                        |
|-----|-----------------------------------------------------------------------------------------------------|
| II  | INTRA FLIT SKEW                                                                                     |
|     | A. The Problem of Intra-Flit Skew                                                                   |
|     | B. Crosstalk Coupling on Link Wires                                                                 |
|     | C. Combating Crosstalk                                                                              |
|     | 1. Physical Techniques to Combat Crosstalk 13                                                       |
|     | 2. Crosstalk Avoidance Codes                                                                        |
|     | 3. Variable Cycle Transmission and Re-transmission 14                                               |
|     | 4. Error Correcting Codes                                                                           |
|     | 5. TransSync $\ldots$ 15                                                                            |
|     | 6. TransSync 2 Lines                                                                                |
|     | 7. RecSync $\ldots \ldots 20$ |
|     | D. Results and Discussion                                                                           |
|     | 1. Simulation Environment                                                                           |
|     | 2. Performance Analysis                                                                             |
|     | 3. Overhead Analysis                                                                                |
|     | 4. System Level Results                                                                             |
| III | INTERCONNECT BER IN THE PRESENCE OF CROSSTALK                                                       |
|     | AND VARIATIONS                                                                                      |
|     | A. Sources of Variation                                                                             |
|     | 1. Device Variations $\ldots \ldots \ldots \ldots \ldots \ldots \ldots 33$                          |
|     | a. Film Thickness                                                                                   |
|     | b. Lateral Device Dimensions                                                                        |
|     | c. Doping $\ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots 34$                       |
|     | 2. Interconnect Variations                                                                          |
|     | a. Line Width (W) and Space (S) $\ldots \ldots \ldots 35$                                           |
|     | b. Metal Thickness $(T)$                                                                            |
|     | c. ILD Thickness (H) $\ldots \ldots \ldots \ldots \ldots 36$                                        |
|     | d. Contact Via Size and Resistance                                                                  |
|     | e. Metal Resistivity                                                                                |
|     | 3. Environmental Variations                                                                         |
|     | a. Supply Voltage                                                                                   |

| Page |
|------|
|------|

|          | b. Temperature                                  | 37 |
|----------|-------------------------------------------------|----|
|          | B. BER Analysis                                 | 37 |
|          | 1. Crosstalk Failure Mechanism                  | 39 |
|          | a. Timing Failure                               | 39 |
|          | b. Functional Failure                           | 40 |
|          | 2. Average BER                                  | 41 |
|          | C. Results and Discussion                       | 42 |
|          | 1. Simulation Environment                       | 42 |
|          | 2. Schemes Studied                              | 43 |
|          | 3. Results                                      | 45 |
| IV       | UNEQUAL ERROR PROTECTION (UEP) ON LINK          | 49 |
|          | A. UEP for H.264/AVC Video Streams on NoC Links | 50 |
|          | B. Results and Discussion                       | 52 |
|          | 1. Performance Metric                           | 52 |
|          | 2. Simulation Environment                       | 53 |
|          | 3. PSNR and Merit                               | 54 |
|          | 4. UEP Results                                  | 57 |
| V        | CONCLUSION                                      | 61 |
| REFERENC | CES                                             | 63 |
| VITA     |                                                 | 74 |

# LIST OF FIGURES

| FIGURI | E                                                                                                                                                   | Page |
|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------|------|
| 1      | ITRS 2009 projections for future SOC networking driver MC/AE platform performance.                                                                  | 2    |
| 2      | GALS topology with bypassing of flits on hops between source<br>and destination.                                                                    | 7    |
| 3      | The problem of intra-flit skew on links                                                                                                             | 8    |
| 4      | Plot of transmission delay and intra-flit skew vs. wire length for<br>interconnects that have repeaters driving optimal length wire segments.       | 9    |
| 5      | Plot of intra-flit skew vs. wire length for with different driving repeater strengths when repeaters are present every 1mm length of wire segments. | 10   |
| 6      | Wire model for flits that are bypassed asynchronously over inter-<br>vening hops                                                                    | 11   |
| 7      | Timing diagram showing the operation of TransSync module                                                                                            | 16   |
| 8      | Circuit schematic of TransSync scheme.                                                                                                              | 17   |
| 9      | Circuit schematic of TransSync-2-lines scheme.                                                                                                      | 19   |
| 10     | Timing diagram for the operation of RecSync scheme.                                                                                                 | 20   |
| 11     | Circuit schematic of RecSync logic.                                                                                                                 | 21   |
| 12     | Signal diagram explaining the working of RecSync logic                                                                                              | 22   |
| 13     | Comparison of $L_{max}$ and transmission delay for the different schemes studied.                                                                   | 25   |
| 14     | Comparison of area and energy overhead of the different schemes studied.                                                                            | 26   |
| 15     | Plot of transmission latency vs. number of hops traveled by flits for different schemes                                                             | 28   |

## FIGURE

| 16 | Interconnect structure                                                                                                       | 34 |
|----|------------------------------------------------------------------------------------------------------------------------------|----|
| 17 | Three segment wire model for crosstalk coupling                                                                              | 35 |
| 18 | Energy and delay for different states for 2mm long interconnect wires.                                                       | 39 |
| 19 | Sensitivity of delay and glitch amplitude to individual parameters to on repeater inserted links.                            | 45 |
| 20 | Probability distribution function (PDF) of timing window and delay of different delay states for 1mm long interconnect wires | 46 |
| 21 | Contribution of timing errors to BER vs. link length                                                                         | 46 |
| 22 | BER vs. link length for different design                                                                                     | 47 |
| 23 | Power overhead for different schemes                                                                                         | 48 |
| 24 | Area overhead for different schemes                                                                                          | 48 |
| 25 | H.264 data partition and prioritization                                                                                      | 51 |
| 26 | Simulation setup for evaluating the quality of reconstructed frames with different protection schemes on links.              | 53 |
| 27 | PSNR results for Akiyo test streams with different protection schemes on 2mm and 3mm long links.                             | 54 |
| 28 | PSNR results for Foreman test streams with different protection schemes on 2mm and 3mm long links.                           | 54 |
| 29 | PSNR results for Mobile test streams with different protection schemes on 2mm and 3mm long links.                            | 55 |
| 30 | Merit of different protection schemes on 2mm long link wires. $\ldots$                                                       | 56 |
| 31 | Merit of different protection schemes on 3mm long link wires                                                                 | 57 |
| 32 | Results for 2mm long link wires for the UEP schemes analyzed                                                                 | 59 |
| 33 | Results for 3mm long link wires for the UEP schemes analyzed                                                                 | 60 |

Page

# LIST OF TABLES

| TABLE |                                                                                                                                                                    | Page |
|-------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
| Ι     | Interconnect states                                                                                                                                                | 12   |
| II    | Output of TransSync Module for different crosstalk interaction cases.                                                                                              | 18   |
| III   | Output of TransSync-2-lines logic.                                                                                                                                 | 19   |
| IV    | Interconnect parameters used in the simulations.                                                                                                                   | 30   |
| V     | Number of hops that a flit can travel without requiring syn-<br>chronization, latency overhead and latency formula for different<br>schemes for 'n' hops of travel | 31   |
| VI    | Interconnect states                                                                                                                                                | 38   |
| VII   | Nominal values and variations for circuit parameters                                                                                                               | 42   |
| VIII  | CCD analysis points.                                                                                                                                               | 43   |
| IX    | Distribution of priority parts in test video streams                                                                                                               | 57   |
| Х     | Protection schemes employed on link with data partitions for dif-<br>ferent UEP configurations studied.                                                            | 58   |

#### CHAPTER I

#### INTRODUCTION

Advances in technology have nurtured growth in the performance of integrated circuits in keeping with the predictions of Moore's law over the last three decades. Increase in integration density has for long sustained increased microprocessor performance through higher clock rates and micro-architectural innovations. However, the dramatic increase in power density with device scaling in deep-Deep Sub-micron (DSM) technologies has made it infeasible to scale operating frequencies any more. The fundamental limitations to power dissipation in circuits, diminishing returns from the super-scalar execution model exploiting instruction level parallelism, increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to harness their capabilities effectively.

Many-core and multi-core micro-architectures constructed from several simpler processing elements (PEs) have been touted as the way forward to meet the ever increasing performance needs with manageable power levels and affordable costs. As a result, the development of complex Multi-Processor Systems-on-Chip (MPSoCs) and Chip-Multi-Porcessors (CMPs) has been receiving increasingly more and more attention lately [1], [2], [3]. Fig. 1 shows the International Technology Road-map for Silicon (ITRS) 2009 [4] projections for the anticipated growth over time in the performance and number of cores for the System on Chip (SoC) Networking Driver targeting the mid-range switching/routing workload segment of the embedded networking space, with 30W power envelope. As per these projections, the number of cores is expected to increases by  $1.4 \times$  every year for a constant die area. This implies

The journal model is IEEE Transactions on Automatic Control.



Fig. 1. ITRS 2009 projections for future SOC networking driver MC/AE platform performance.

that the performance of future 22nm system (with 80+ cores) would be >  $20\times$  times the performance of an 8-core implementation at 45nm in 2009.

Present day CMPs and MPSoCs encompass several intellectual property (IP) blocks and customized processing units (PUs). Traditional multi-drop buses do not scale sufficiently to meet the bandwidth demands of such many-core systems [5]. Networks on chip (NoC) designs offer large communication bandwidth and are therefore quickly substituting traditional bus based and ad-hoc on-chip interconnect implementations to become the design paradigm for CMPs/SoCs/MPSoCs [6] [7] [8]. NoCs are comprised of several nodes and each node can have one or more processing elements (PEs). These nodes, which may be placed several milli-metres apart on the die, are connected together by regular point to point links. Each node has a routing element that enables it to communicate with other nodes. The NoC infrastructure thus comprises physical layer, the data link layer and the network layer of the OSI protocol stack where the router plays the key role in facilitating on-chip communication be-

tween nodes [5, 9, 10]. The regular structure of NoCs makes them energy efficient and greatly simplifies the design of modular and reliable high performance systems. A comparison of the pros and cons of several popular NoC topologies can be found in [11][12].

In DSM technology nodes, with the reduced voltage swing, increasing clock frequencies and large propagation delays; signal transitions on interconnect wires have become increasingly susceptible to crosstalk induced performance degradation. Thus the integrity of signal transmitted on not just global wires but also intermediate and local wires also needs careful consideration. This is especially true of NoC designs employing globally asynchronous locally synchronous (GALS) clocking [13, 14, 15, 16], where network traversal latency is very high since flits might need to be synchronized at every hop between the source and the destination.

Several recent works have dwelt on improving network performance for GALS NoC designs by allowing flits to be asynchronously bypassed at intervening hops between the source and destination by incorporating some specialized clock routing [17, 18, 19, 20]. The presence of long distance traffic in 2-D mesh networks where a sizable fraction of the total traffic travels distances well over the mean topology hop distance for TRIPS [22], RAW [23] and cache coherent CMP applications [24] was established in [21].

In NoC designs employing fast bypass paths, sizable fractions of the traffic would therefore move over several hops, often even the entire network dimension, on these bypass paths without undergoing synchronization at any of the intervening hops. There may be considerable amounts of skew between the bits of the same flit or bitvector asynchronously bypassed over several hops in such low latency NoC designs. We refer to the problem of uncertainty in the delay experienced by different signal transitions of a bit vector on link wires and the consequent separation in their arrival times at the receiving node as the problem of "Intra-flit skew". Since GALS designs typically employ source synchronous communication strategies [25, 26, 27, 28] at clock boundaries, intra-flit skew can lead to timing and functional failures from the violation of setup and hold constraints, causing a large fraction of the received packets to be erroneous.

Signal errors that contribute to intra-flit skew can be classified into permanent, intermittent or temporary depending on their duration. Immutable malfunctions caused by permanent physical changes like manufacturing defects, aging or electromigration lead to permanent errors. Long-duration errors occurring in the same position, that are not permanent errors, are known as intermittent errors. Transient or soft errors result in short duration signal integrity loss and reliability degradation. Soft errors on link wires are caused by crosstalk coupling, external radiation, power supply fluctuations, process variations and temperature variations, electromagnetic interference and combinations of these sources. The aforementioned sources can affect signal integrity by inducing voltage glitches in the system and by causing transition delay uncertainty. Voltage glitches that exceed the noise margin and last for sufficient duration can lead to logical errors. Delay uncertainty is the uncertainty in the timing of a signal transition arrival at a node and leads to reduction in the frequency at which the system can operate to guarantee error-free operation.

Although it is not be possible to quantify the exact contribution of each of the above mentioned sources to signal integrity loss due to soft errors, crosstalk noise may be accredited as the most critical factor affecting reliability of on-chip interconnect in large NoC/SoC designs. Fault tolerance schemes like hardware, time and information redundancy which have been traditionally employed to mitigate permanent and intermittent error are often inadequate to deal with crosstalk induced soft errors. This work focuses on the impact of crosstalk on intra-flit skew and presents novel solutions, with minimal energy and area overhead, to improve signal integrity on link for NoC designs employing long bypass paths for low latency communication. The proposed "TransSync" scheme tries to preemptively minimize intra-flit skew by intelligently skewing signal transitions at the transmitting node on-the-fly so that they become aligned at the receiving node as a result of intra-flit skew on links. The "RecSync" scheme tries to eliminate the accrued intra-flit on link wires at the receiving node by forcing all the transitions to become aligned before they are relayed to the receiving buffers. The proposed schemes also provide the designers the liberty of dynamically implementing varying levels of protection against intra-flit skew induced errors on links for different priority data. Considerable amount of energy saving can be achieved at the cost of relatively small loss of system performance by using unequal error protection (UEP) where relatively lesser energy in expended in protecting less important portion of data.

The organization of the remainder of this thesis is as follows: Chapter II formulates the problem of intra-flit skew and explores the benefits of employing TransSync and RecSync schemes on low latency NoC designs. Chapter III presents a formal analysis of the bit error ratio (BER) performance of TransSync-RecSync and compares their performance to that of a host of crosstalk prevention and error correcting schemes. Chapter IV introduces the idea of unequal error protection (UEP) on link against crosstalk induced errors and evaluates the merit of UEP with TransSync and RecSync on a MPSoC video decoder design using a suite of H.264/AVC test video streams.

#### CHAPTER II

#### INTRA FLIT SKEW

We look at the problem of intra-flit skewing due to crosstalk on link wires and evaluate the solutions to this problem in this chapter. The performance of many-core NoCs, which have abundant processing resources, is limited by the latency of communication between PEs. Globally asynchronous locally synchronous (GALS) clocking strategy [13, 14, 15, 16] was proposed as a solution to the problem of ensuring the synchronization of the entire design with a single clock source which limits the development of large many core designs [6]. In a GALS NoC, each node may be running off a different clock source and therefore data flits need to be synchronized to the local clock domain of each intervening hop on their journey between the source and destination nodes. The latency overhead of synchronizations at intervening hops in inter-tile communication can account for a major portion of the communication latency exacerbating larger NoC designs' already substantial packet latency issues [29]. Thus communication latency seriously limits the overall network performance when more and more applications today are communication limited.

Several previous works have tried to reduces communication latency in NoCs by reducing the global hop count( [30] [31] [32] [33]). A popular approach that has been explored by recent NoC research tries to minimize communication latency by incorporating fast paths employing asynchronous bypass between nodes thereby reducing the per hop latency for traffic traveling on these fast paths. Fig. 2 shows an instance in a NoC with asynchronous bypass path where a flit bypasses four hops on its way from source to destination.

A regular k-ary n-cube interconnection network augmented with express channels that allow non-local messages to bypass nodes thereby resulting in network latency



Fig. 2. GALS topology with bypassing of flits on hops between source and destination.

and throughput improvement was proposed by Dally in [17]. Ogras et al. explore the possibility improving network latency and throughput by maximizing the value of the critical traffic workload in regular 2-D mesh NoCs via insertion of applicationspecific long-range links [18]. Krishna et al. in [19] propose express virtual channels (EVCs), which allow packets to virtually bypass intermediate routers along their path in a completely non-speculative fashion, thereby lowering the per-hop delay. Jain et al. propose an asynchronous bypass channel (ABC) router deign for GALS NoC that targets both hop count and per-hop latency to reduce network latency [20]. Improved latency and throughput figures in the above mentioned designs rely on flits being bypassed across intervening hops without suffering any synchronization latency.

#### A. The Problem of Intra-Flit Skew

Not all the signal transitions of the bit vector being transmitted on the link arrive at the destination at the same time. As shown in Fig. 3, all the signal transitions are perfectly aligned when they are transmitted by the source node at time  $t_s$ but different bits on the link experience different amounts of skew to arrive at the



Fig. 3. The problem of intra-flit skew on links.

destination out of synchronization. We refer to this phenomenon as "intra-flit skew". The skewing of data transitions on the bus has to be accounted for in any source synchronous communication scheme with asynchronous bypass that assumes that the signals transitions for bits belonging to the same flit are always synchronized. The problem of intra-flit skew necessitates periodic synchronization of the bit vector on link thereby limiting the expected improvement in latency and throughput from asynchronous bypass in GALS NoCs.

Fig. 4 and Fig. 5 plot the variations of intra-flit skew ( $\delta$ ) versus the length of link for interconnect wires that have been optimized for minimum propagation delay by insertion of optimal sized repeaters and for different driving repeater sizes respectively. From Fig. 4 and Fig. 5, it is evident that the value of maximum skew increases with wire length even with insertion of repeater between optimal length interconnects. The situation is much worse for NoCs since it is not feasible to insert repeaters between optimal length bus wires. In NoCs designs employing special clocking schemes and



Fig. 4. Plot of transmission delay and intra-flit skew vs. wire length for interconnects that have repeaters driving optimal length wire segments.

asynchronous bypass, where the separation between PEs is of the order of millimeters, the skew accumulating between bit lines for flits traveling several hops may be a considerable fraction of the cycle time itself. . Driving buffers may be inserted along the length of wire segments between nodes. As shown in Fig. 6, all the bits in a flit are transmitted by the source at the same time  $t_s$ . Different bits suffer different traversal delay on bus and arrive at the destination at different time. The flit becomes available at the destination at time  $t_d$  only after the last bit arrives. The transition skew and transmission delay accruing on wire segments between the PEs are further aggravated by combination delay incurred by the control and switching logic (multiplexers, etc) on each of the intervening hops, as shown in Fig. 6, which may be different for different bits of a flit. To prevent packet errors, flits which are supposed to be bypassed asynchronously at intervening hops would therefore need synchronization every time they travel a certain number of hops to eliminate the accumulated transition skew. Thus in practice, it is not possible to realize the figures of average network traversal latency expected with asynchronous bypass channels in NoC. Even with synchronization at regular interval, it might not be possible to



Fig. 5. Plot of intra-flit skew vs. wire length for with different driving repeater strengths when repeaters are present every 1mm length of wire segments.

completely eliminate packet errors resulting from skew between bit-lines. Some factors contributing to intra-flit skew on link wires are: crosstalk coupling between link wires, design irregularities/constraints in physical routing implementation, timing variations at switching and multiplexing logic, variations in process parameters, power supply fluctuation and noise, temperature variations, physical changes like electromigration and external radiation. While it may be difficult to quantify the exact contribution of each factor listed above to intra-flit skew, crosstalk coupling between adjacent wires can safely be attributed as the most significant contributor. Different bit transitions on links suffer different crosstalk interaction and therefore experience variable traversal delay to arrive at the destination at different times. We therefore intend to focus on crosstalk to deal with the problem of intra-flit skew on link for NoCs with long asynchronous bypass paths in this work.



Fig. 6. Wire model for flits that are bypassed asynchronously over intervening hops.

#### B. Crosstalk Coupling on Link Wires

Crosstalk noise is the result of unavoidable coupling capacitance between adjacent wires. With technology scaling, wire width and spacing scale at a higher rate than wire thickness to achieve higher integration. Increased aspect ratios in DSM have led to increased coupling capacitance between neighboring bit-lines such that coupling capacitances account for more than half of the total wire capacitance itself. Crosstalk coupling effects can induce significant voltage glitches on a victim line potentially inducing logic errors. Crosstalk-induced delay uncertainty is caused by the dependence of coupling capacitance on signal switching patterns. Besides engendering degradation in the reliability of on-chip interconnect, crosstalk coupling also leads to increased link power consumption. A first order coupling capacitance model shown in Fig. 6 takes into account the switching activity of the line and only its immediate adjacent neighboring lines has traditionally been used to model the crosstalk between bit-lines. In this model, the effective capacitance of a switching line varies with the switching activity of the neighboring wires in accordance with the Miller coupling factor (MCF). Based on this understanding, interconnects can be broadly classified into nine distinct states states [34] that differ from each other primarily in terms of

| Switching                                                                                                                                                                                           | Total wire    | Occurrence     |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|----------------|
| Pattern                                                                                                                                                                                             | Capaciatnce   | Probability    |
| $(\uparrow,\uparrow,\uparrow),(\downarrow,\downarrow,\downarrow)$                                                                                                                                   | $0C_C + C_g$  | $\frac{1}{32}$ |
| $(-,\uparrow,\uparrow),(-,\downarrow,\downarrow), \\ (\uparrow,\uparrow,-),(\downarrow,\downarrow,-)$                                                                                               | $1C_C + C_g$  | $\frac{1}{8}$  |
| $ \begin{array}{c} (\uparrow,\downarrow,\downarrow),(\downarrow,\uparrow,\uparrow),(\uparrow,\uparrow,\downarrow),\\ (\downarrow,\downarrow,\uparrow),(-,\uparrow,-),(-,\downarrow,-) \end{array} $ | $2C_C + C_g$  | $\frac{3}{16}$ |
| $(\downarrow,\uparrow,-),(\uparrow,\downarrow,-),\\(-,\uparrow,\downarrow),(-,\downarrow,\uparrow)$                                                                                                 | $3C_C + C_g$  | $\frac{1}{8}$  |
| $(\uparrow,\downarrow,\uparrow),(\downarrow,\uparrow,\downarrow)$                                                                                                                                   | $4C_C + C_g$  | $\frac{1}{32}$ |
| (-, -, -)                                                                                                                                                                                           |               | $\frac{1}{16}$ |
| $(\uparrow,-,\downarrow),(\downarrow,-,\uparrow)$                                                                                                                                                   | $0C_C + 0C_g$ | $\frac{1}{8}$  |
| $(-,-,\uparrow),(-,-,\downarrow), ((\uparrow,-,-),(\downarrow,-,-))$                                                                                                                                | $1C_C + 0C_g$ | $\frac{1}{4}$  |
| $(\uparrow,-,\uparrow),(\downarrow,-,\downarrow)$                                                                                                                                                   | $2C_C + 0C_g$ | $\frac{1}{16}$ |

Table I. Interconnect states.

the total line capacitance that the signal transition on a wire experiences depending on the switching activity on the wire and its nearest neighbors. Different cross-talk interactions between the wires of an on-chip bus are listed in Table I. In this table  $\uparrow, \downarrow$  and- stand for  $0\rightarrow 1, 1\rightarrow 0$  and no transition on a wire respectively. The total line capacitance seen by a signal transition has two components:  $C_g$  which is due to the capacitance seen by the line with respect to ground and a second component  $C_C$ due to crosstalk coupling between with adjacent the wires, the magnitude of which varies in accordance with MCF. Depending on the switching behavior of a wire and its neighbors, the effective capacitance of a wire can change from  $C_g$  to  $C_g + 4C_C$ . The minimum coupling interaction (MCF = 0) is experienced by lines whose neighbors transmit transitions in the same direction with respect to the line transitions and the maximum crosstalk coupling (MCF = 4) is suffered by lines whose neighbors transmit transitions in the opposite direction. Therefore signal transitions which encounter  $4C_C$  MCF coupling experience largest wire delay while those that encounter  $0C_C$  MCF coupling suffer the smallest wire delay.

#### C. Combating Crosstalk

#### 1. Physical Techniques to Combat Crosstalk

Several techniques have been proposed to minimize or eliminate the impact of capacitive coupling on bus lines. [35, 36] present net ordering, wire segment reordering and buffer insertion techniques for the physical design of interconnect for noise and delay optimization. Hirose et al. in [37] propose intentional skewing of signal transition timing on adjacent wires of repeater inserted bus to minimize crosstalk noise. Such static skewing of signal transition on adjacent wires of a wide link can significantly increase packet transmission latency. Shielding of the wires and increasing the interwire spacing (DBS) are other options explored for reducing the impact of crosstalk by Arunachalam et al. in [38]. Active shielding method proposed by Kaul et al. employs two shielding wires on both sides of the target wire, and keeps the same transition direction as the target wire for fast propagation [39]. These layout level techniques incur a significant increase in link area which is unacceptable considering the dearth and outlay of routing resources in higher metal layers.

#### 2. Crosstalk Avoidance Codes

Victor et al. in [40] present the theory behind crosstalk-prevention codes (CPC) and the methods for generating these codewords. Pande et al. in [41] propose the use of crosstalk avoidance codes(CAC) and modification in the structure of the data packets to incorporate CAC schemes in the NoC data stream and reduction in the number of coding-decoding operations to address both crosstalk and energy dissipation in networks-on-chip (NoC). [42] presents the design of a reliable NoC which maps data into selected crosstalk avoidance code (SCAC) for message transmission and preserves state and controlling registers of routers with triple modular redundancy (TMR) to avoid large crosstalk-induced delay. These encoding schemes, however, require additional wires and circuitry to implement CODECs, thereby incurring sizable area overhead and increase routing requirement.

#### 3. Variable Cycle Transmission and Re-transmission

Lin et al. in [43] propose a crosstalk aware interconnect technique for on-chip processor buses where data is transmitted at different rates depending on data pattern classification and signaling is used on the bus to indicate start and end of transmission. A similar approach put forth by Sankaran et al. [44] detects worst case crosstalk interaction and postpones transmission of current data by one cycle by resetting the wires to ensure that the impact of signal transition pattern on coupling capacitance is always 2Cc or less. These designs employing variable cycle transmission and retransmissions may not be feasible for typical NoC architectures. Also, these approaches aim only at minimizing worst case coupling interactions and not so much at maximizing the time for which data is stable every cycle.

#### 4. Error Correcting Codes

Error correction schemes like Hamming and Cyclic Redundancy Check (CRC) have been traditionally employed to deal with soft-errors introduced into the system by the communication channel between the source and the destination. While the overhead involved in such traditional schemes is easily amortized in wireless communication architecture, they are too expensive to be adopted into the NoC framework. Error correction codes are able to detect and correct the error bits based on an information theoretic model [51]. Unified framework of coding schemes for system on-chip with crosstalk avoidance codes (CAC) and simple error control codes (ECC) to solve delay, power, and reliability problems jointly were proposed in [46], [47], [48]. A joint error correction coding scheme using duplication with parity (DAP) and triplication error correction (TEC) with Green bus coding for crosstalk avoidances to guard against crosstalk induced errors is presented in [49] and [50] respectively. Ying et al. in [42] present the design of a reliable NoC which maps data into selected crosstalk avoidance code (SCAC) for message transmission while preserving state and controlling registers of routers with triple modular redundancy (TMR) to avoid large crosstalk-induced delay. A solution to multiple logic errors and crosstalk-induced delay uncertainty in on-chip interconnects using the combination of Hamming product codes and type-II HARQ with skewed transition was presented in [52].

Now that we have reviewed the merits and shortcomings of the existing solutions to combat crosstalk on adjacent link wires, we present the novel TransSync-RecSync solution to intra-flit skew in the following sections.

#### 5. TransSync

TransSync calculates the expected link traversal delay for each signal transition at the transmitting node before they are sent out and accordingly delays them so that their expected time of arrival at destination is the same as that of a transition suffering worst case (4Cc MCF) cross-talk interaction on link wires. Thus signal transition skew between the bit-lines of link are preemptively eliminated on the fly. At the sending node, the transmission of signal transitions which are likely to suffer the least cross-talk interaction on link wires (0Cc MCF), are delayed temporally so that they



Fig. 7. Timing diagram showing the operation of TransSync module.

reach the destination at the same time as transitions suffering 4Cc MCF interaction would. Signal transitions which are likely to suffer 4Cc MCF interaction on wires are not be delayed at all at the sending node. This scheme of dynamically delaying signal transitions at a sending node can be implemented with a simple combinational logic block added to the routers that uses the value of the current bit-vector on link wires and the value of the bit-vector transmitted in the previous cycle to estimate the expected wire delay for each signal transition. The TransSync scheme achieves this by employing a six-input combinational block for the current and previous values on a bit-line and its immediate neighbors to detect the type of cross-talk interaction (e.g.: 0Cc, 1Cc, 2Cc, 3Cc or 4Cc) for each bit of the bus. The state of each wire of the link is encoded onto the three outputs (p, q and r) of the TransSync module for the wire. Fig. 7 depicts the timing diagram of the working of the proposed TransSync design and Table II lists the output of the TransSync logic block for the different crosstalk interaction cases. A signal transition transmitted by a node at time  $t_s$  are actually transmitted on the link at time  $t_1$  by the TransSync module. It is to be noted that while we study the overhead of the logic to estimate the expected wire delay for every



Fig. 8. Circuit schematic of TransSync scheme.

signal of the bit-vector on link, this computation can be trivially performed by the PE at the sending nodes themselves.

Once the expected transmission delay for all the signal transition of the bit-vector on link has been figured out, their transmissions need to be delayed accordingly at the sending node. The transmission of a signal at the sending node can be delayed by a variety of ways. The transmission of different MCF cases can be delayed accordingly by varying the current available to charge/discharge the wire capacitance by modulating the current in the multi-stage radioed logic for the repeaters driving the link wires [59]. This is similar to the use of current-starved inverters in voltage-controlled oscillators. A simpler and more effective method of dynamically introducing the required delays would be to make all signal transitions charge/discharge shunt capacitors whose capacitance values are controlled by the "p, q, r" outputs of the TransSync logic before the signals are fed into the repeaters driving the link wires. We use this method delaying transmission by modulating load capacitances in our work here. The number and value of the capacitor employed for this purpose determines the accuracy with which de-skewing can be done. Since  $C_C$  and  $C_g$  associated with individual bit-

| Switching                                                                                                                         | Crosstalk      | Outputs |   |   |
|-----------------------------------------------------------------------------------------------------------------------------------|----------------|---------|---|---|
| pattern                                                                                                                           | case           | р       | q | r |
| $(\uparrow,\uparrow,\uparrow),(\downarrow,\downarrow,\downarrow)$                                                                 | $0\mathrm{Cc}$ | 1       | 1 | 1 |
| $(-,\uparrow,\uparrow),(-,\downarrow,\downarrow),(\uparrow,\uparrow,-),(\downarrow,\downarrow,-)$                                 | 1Cc            | 1       | 1 | 0 |
| $(-,\uparrow,-),(-,\downarrow,-)$                                                                                                 | 2Cc            | 1       | 0 | 0 |
| $(\downarrow,\downarrow,\uparrow),(\downarrow,\uparrow,\uparrow),(\uparrow,\downarrow,\downarrow),(\uparrow,\uparrow,\downarrow)$ | 2Cc            | 1       | 0 | 0 |
| $(-,\downarrow,\uparrow),(-,\uparrow,\downarrow),(\downarrow,\uparrow,-),(\uparrow,\downarrow,-)$                                 | 3Cc            | 0       | 1 | 0 |
| $(\downarrow,\uparrow,\downarrow),(\uparrow,\downarrow,\uparrow)$                                                                 | 4Cc            | 0       | 0 | 0 |

Table II. Output of TransSync Module for different crosstalk interaction cases.

lines of links between hops change with process parameters and temperature, it may not possible to obtain a precise estimation of their values during different periods of operation of the NoC. However, a rough estimation of these capacitances, obtained during the design stage of the NoC, is sufficient to limit the intra-flit skew to an acceptable bound. Fig. 8 depicts the schematic of circuit implementing the proposed TransSync scheme.

The header flit can be used to determine the number of hops a particular packet is likely to bypass asynchronously on its way to the destination. To implement the TransSync scheme, every router only needs to incorporate the logic to enable TransSync during the transmission of only those flits which stand to gain from TransSync, i.e. only those flits which are likely to travel distances greater than those for which the skew between bit-lines remains within acceptable bounds.



Fig. 9. Circuit schematic of TransSync-2-lines scheme.

#### 6. TransSync 2 Lines

The overhead of TransSync design can be considerably reduced by taking into account the crosstalk coupling between only on two adjacent wires to predict the expected transmission delay for signal transition on a wire. This reduces the number of inputs to TransSync logic from six to four and the number of outputs from three to two. Table III lists the output of the modified TransSync: "TransSync-2-lines" scheme for different crosstalk interaction cases and Fig. 9 depicts the schematic of the modified TransSync-2-lines design.

Table III. Output of TransSync-2-lines logic.

| Switching pattern                                         | Crosstalk case | Output |   |
|-----------------------------------------------------------|----------------|--------|---|
|                                                           | (MCF)          | р      | q |
| $(\uparrow,\uparrow),(\downarrow,\downarrow)$             | 0Cc            | 1      | 1 |
| $(-,\uparrow),(-,\downarrow),(\uparrow,-),(\downarrow,-)$ | 2Cc            | 1      | 0 |
| $(\uparrow,\downarrow),(\downarrow,\uparrow)$             | 2Cc            | 0      | 0 |

With both TransSync and TransSync-2-lines schemes, the header flit can be used

to determine the number of hops a particular packet is likely to bypass asynchronously on its way to the destination. To implement the TransSync scheme, every router only needs to incorporate the logic to enable TransSync during the transmission of only those flits which stand to gain from TransSync, i.e. only those flits which are likely to travel distances greater than those for which the skew between bit-lines remains within acceptable bounds.

#### 7. RecSync



Fig. 10. Timing diagram for the operation of RecSync scheme.

RecSync module eliminates the intra-flit skew accumulating between link wires at the receiving node by waiting for the last signal transition to arrive before relaying the flits further. The fastest arriving signal transition on link wires (OCc MCF case) triggers the RecSync module to enter a "wait" mode. The wait mode of Rec-Sync module precludes the destination node from receiving the incoming data. The RecSync module is allowed to wake up from the wait mode only after the slowest arriving signal transition suffering 4Cc MCF crosstalk interaction have arrived. Once the RecSync module is out of the wait state, the receiving node, it is guaranteed that all the signal transitions have already reached the destination and the receiving node is allowed to sample the incoming data.



Fig. 11. Circuit schematic of RecSync logic.

Fig. 10 explains the operation of RecSync scheme. A flit transmitted at time  $t_s$  arrives at the RecSync block at the destination hop at  $t_2$  and after the required processing is sampled by the destination node only at time  $t_d$  when the control signal to latch the incoming data is issued by the RecSync module. Fig. 11 shows a circuit implementing the RecSync scheme while Fig 12 depicts the timing diagram for the operation of RecSync. The signal arriving on every bit-line is passed through a delay element that delays the incoming signal by a preprogrammed interval of time  $t_D$ . The logical EXCLUSIVE OR of the current value on the bit-line and its delayed value from the delay element is generated. A logical high on the EXCLUSIVE OR output indicates a transition on the bit-line and the width of the EXCLUSIVE OR output pulse is proportional to the delay introduced by the delay element. If the delay  $t_D$  provided by the delay element on bit-lines (and consequently the width of pulse at the EXCLUSIVE OR outputs of all bit-line undergoing a transition) is greater that the temporal separation between signal transitions suffering 0Cc MCF crosstalk interaction and those suffering 4Cc MCF interaction on link wires, then the separation between the rising edges for the EXCLUSIVE OR output pulses for the



Fig. 12. Signal diagram explaining the working of RecSync logic.

two cases will be less than the width of these pulses. As shown in Fig 12, if the signal transition on wire-p suffers 0Cc coupling while that on wire-q suffers 4Cc crosstalk coupling and if the delay element is designed such that  $t_D$  greater than the separation between the arrival times of the signal on wire-p and wire-q for a given wire length, then the logical NOR of the corresponding EXCLUSIVE OR outputs is low until all signal transitions have arrived. If during every cycle, the destination node is allowed to sample the link wires for incoming data only when the NOR output is high, intra-flit skew induced errors can be avoided. This NOR signal can in fact be used as the clock to latch the incoming data.

Unlike TransSync, the RecSync scheme is also capable of handling transition skews on links resulting from sources other than just crosstalk coupling such as the wire length difference between bit-lines due to routing and other constraints. RecSync scheme needs be availed only when the number of hops that a flit has bypassed latching into FIFO queues is more than that for which the intra-flit skew on link wires is within acceptable bound. Like TransSync, the number of hops that the arriving data has been bypassed over can be inferred from the header flit.

The delay of the delay element in Fig. 11 can be programmed depending on the tolerance required and the number of hops over which flits are desired to be bypassed asynchronously without requiring synchronization. Greater the delay of the delay element, greater the skew between 0Cc and 4Cc interaction cases which can be eliminated by the RecSync module and hence greater is the number of hops over which the flits can be bypassed asynchronously. Increasing the delay of the delay element however increases the de-skewing latency of the RecSync module. To minimize the delay due to the wide NOR gate in RecSync design, a multi-stage logic implementing NOR function can instead be used. Also, for very wide links, RecSync scheme can be applied to overlapping subsets of wires to ensure that the delay of the wide NOR gate in the design does not become prohibitive.

#### D. Results and Discussion

#### 1. Simulation Environment

To obtain the timing information about signal transition skew using the first order wire model, transient analysis was carried out using Spectre in Cadence. The interconnect parameters for 65nm technology from Berkeley Predictive technology Model [45] used in the simulations are listed in Table IV. While the interconnect parameters (R,  $C_C$  and  $C_g$ ) used in the simulations are for PTM 65nm process, PTM 45nm bulk CMOS model [45] were used to construct static CMOS gates. All the gates used in the simulations have the minimum feature size as their transistor channel lengths while the width of the pull-up and pull-down transistors in the static gates were designed for equal rise and fall propagation delay. A power supply of 1.1V was used and simulations were performed for intermediate and global wires. A positive edge triggered system with clock period ( $T_{clk}$ ) of 1ns was simulated. The repeaters used for driving the link wires were taken to be six times the minimum sized inverter in this technology. Transmission delay for a signal transition was measured as the separation between 50% of the final value between the sending and receiving nodes. The transmission latency for a flit that is asynchronously bypassed over intervening hops is the sum of wire latency, combinational delay of the controlling and switching logic at the intervening nodes and the extra/synchronization latency.

The upper bound on intra-flit skew ' $\delta$ ' was obtained using the following formula:

$$\delta \le T_{clk} - (T_{setup} + T_{hold} + T_{Uncertainty}) \tag{2.1}$$

 $T_{setup}$  and  $T_{hold}$  in Eqn. (2.1) are the setup and hold times respectively of the synchronous elements in the design while  $T_{Uncertainty}$  accounts for the uncertainty in the arrival times resulting from variations in process parameters, on-chip variation in temperature and supply voltage, constraints in physical routing implementation amongst other factors. Although it is difficult to precisely quantify the  $T_{uncertainty}$  component in Eqn. (2.1), it is known that this component is proportional to the length of wire  $L_{max}$  and consequently the wire delay. For the purpose of simulations, this component was assumed to be 10% of the clock period. We refer to the wire length for which the intra-flit skew is equal to the bounding value in Eqn. (2.1) as  $L_{max}$ . This is the maximum distance a flit can travel on the fast bypass paths without requiring to be synchronized. If a flit needs to travel a hop distance smaller than  $L_{max}$ , inter bit-line transition skew is within the tolerable limit and no synchronization is needed; whereas for distances of travel greater than  $L_{max}$  TransSync-RecSync schemes have to employed to ensure error-free operation. Link wires in baseline design as mentioned earlier are driven by 6X sized driving repeaters that are present every 1mm length of wire segment. We evaluate the performance of the proposed schemes and compare them with those of DBS [38], CPC [40], SHD and "Buffer" schemes. CPC is a crosstalk prevention coding scheme that encode three bits of information onto five adjacent wires to eliminate 4Cc crosstalk coupling on link wires. In DBS scheme, the



Fig. 13. Comparison of  $L_{max}$  and transmission delay for the different schemes studied.

inter-wire separation is doubled to reduce the coupling capacitance between adjacent wires. In SHD scheme, every bit-line is shielded by two GND lines on either side while in Buffer scheme, the size of the repeaters driving the links are doubled to 12X. While each of the schemes that we compare our designs against are radically different in their own right, they are also orthogonal to the proposed solutions since TransSync and RecSync can be used to reduce crosstalk noise in conjunction with any of these schemes.

#### 2. Performance Analysis

For the baseline design,  $L_{max}$  was found to be 2.23mm and the corresponding value of  $\delta$  was found to be 856.41ps. Fig. 13 compares values of  $L_{max}$  and the average transmission delay on 2mm long wires for the different schemes evaluated. The results presented in Fig. 13 have been normalized to the results obtained for the baseline design. Average delays for the different schemes reported in Fig. 13 have been obtained as the sum of the signal transmission delays for the different interconnect states weighted by their occurrence probabilities shown in Table I. As seen in Fig. 13, the



Fig. 14. Comparison of area and energy overhead of the different schemes studied.

largest  $L_{max}$  is to be achieved with TransSync scheme. The  $L_{max}$  of SHD and DBS greater than those achievable with CPC and SHD and DBS schemes also speed-up the transmission of signal transitions on wires because of the reduced coupling between adjacent wires. Also, it may be observed from Fig. 13 that while the  $L_{max}$  of TransSync-2-lines is lesser than that of TransSync, it is still comparable to that of CPC. The delay element of the RecSync scheme is programmed to eliminate the intra-flit skew accruing over the travel of data on 2mm long link wires. The average transmission latency is also reduced if the size and hence the drive strength of the driving repeaters is doubled (Buffer). However, the Buffer scheme provides very little improvement in  $L_{max}$  over the baseline design.

#### 3. Overhead Analysis

Fig. 14 shows the energy consumption and area of 2mm long links for the schemes studies normalized to those of the baseline design. CPC and DBS lead to a reduction of 13% and 17% respectively in the overall energy consumption on links when compared to the baseline design. The increased  $L_{max}$  with TransSync and RecSync

schemes come at the cost of increased energy consumption. TransSync-2-lines scheme leads to a 4% reduction in the overall link energy consumption. TransSync and Rec-Sync schemes have a small area overhead which is only greater than that of Buffer scheme.

## 4. System Level Results

We have worked with the assumption that the nodes on the chip are placed 1mm apart. Table V lists the summary of the simulation results. ABC in Table V stands for a heneric NoC architecture like [17, 18, 19, 20]) that incorporates asynchronous bypass to reduce communication latency. Flits that are bypassed over more than two hops in such designs therefore require synchronization every two hops, each synchronization costing 2.5 cycles [20], to eliminate the accrued intra-flit skew. TransSync allows this hop count to increase to 4 and incurs an extra latency of 0.37 cycles for every 4 hops of travel. If RecSync is employed every two hops, then the synchronization latency is reduced to 0.77 cycles. Table V summarizes the number of hops a flit can travel without synchronization and the extra latency incurred by the different schemes. The synchronization latency of RecSync is relatively high because of the wide NOR gate included in the design. This latency however, is still smaller than that incurred by synchronization using conventional bi-FIFIO schemes every two hops.

Fig. 15 represents the information listed in Table V as the plot of latency vs. the number of hops traveled by a flit under no load and no congestion network traffic on a 7X7 mesh NoC using the assumptions in [20]. The maximum number of hops that a flit can travel on the 7X7 mesh ix 12 and from [20] the sum of wire traversal latency and the delay of switching and forwarding logic is taken to be 0.75 cycle per hop of travel. The study has been carried out under no load and no congestion traffic conditions since the asynchronous bypass paths are more likely to help in



Fig. 15. Plot of transmission latency vs. number of hops traveled by flits for different schemes.

bringing down the down communication latencies under these conditions. In presence of congestion in the network or under high injection rates, the high-speed bypass paths would fail to serve their purpose since most of the data would have to be queued into to the FIFO buffers at intermediate nodes between source and destination nodes, while they wait for allocation of resources and are forwarded downstream. The communication latency on a conventional GALS NoC is very high since flits have to be synchronized to a new clock domain every hop. The asynchronous bypass paths in ABC designs promise to improve performance by eliminating the need for unnecessary synchronizations. The average network latency for an Ideal ABC design on 7X7 mesh was found to be 12.10 cycles. It is evident from Fig. 15 that the actual average communication latency in ABC is 22.20 cycles which is far from the expected value due to the problem of inter-bit-line skewing. CPC, DBS, SHD and Buffer schemes have similar performance with the average communication latencies in these schemes are in the vicinity of 17 clock cycles. The average communication latencies for TransSync and TransSync-2-lines schemes are 16.27 cycles and 17.74 cycles respectively. Although the performance of TransSync-2-lines is slightly worse than that of the original TransSync design, the overheads associated with it are much smaller. It may also be observed that the RecSync scheme is very effective as it helps in bringing down the average communication latency 13.79 cycles which is very close to that expected of an ideal ABC design. It may also be noted that there is no sizable improvement in latency when RecSync is coupled with either TransSync or TransSync-2-lines schemes.

| the simulations.                         |
|------------------------------------------|
| in t                                     |
| used                                     |
| parameters                               |
| Interconnect parameters used in the sim- |
| Ν.                                       |
| Table                                    |

| Wire type    | width( $\mu m$ ) | $space(\mu m)$ | thickness( $\mu m$ ) | height <sub>ILD</sub> ( $\mu m$ ) | $\mathbf{k}_{ILD}$ | $C_c(fF/mm)$ | $C_g(fF/mm)$ | $  R_l(\Omega/mm)  $ |
|--------------|------------------|----------------|----------------------|-----------------------------------|--------------------|--------------|--------------|----------------------|
| Local        | 0.1              | 0.1            | 0.2                  | 0.2                               | 2.2                | 53.683       | 19.419       | 1100                 |
| Intermediate | 0.14             | 0.14           | 0.35                 | 0.2                               | 2.2                | 57.649       | 26.888       | 448.979              |
| Global       | 0.45             | 0.45           | 1.2                  | 0.2                               | 2.2                | 73.222       | 82.031       | 40.74                |

Table V. Number of hops that a flit can travel without requiring synchronization, latency overhead and latency formula for different schemes for 'n' hops of travel.

| Latency Formula    | for 'n' hops of | travel                    | $0.75 \times n, if \ n < 3 \\ 0.75 \times n + 2.5 \times floor(\frac{n-1}{2}), \ n \ge 3$ | $0.75 \times n + 0.16, \ if \ n < 3$ $0.75 \times n + 0.16 \times floor(\frac{n}{3}) + 2.5 \times floor(\frac{n-1}{3}), \ n \ge 3$ | $0.63 \times n, if n < 3$ $0.63 \times n + 2.5 \times floor(\frac{n-1}{3}), n \ge 3$ | $0.61 \times n, \ if \ n < 3 \\ 0.61 \times n + 2.5 \times floor(\frac{n-1}{3}), \ n \ge 3$ | $0.71 \times n, if \ n < 3 \\ 0.71 \times n + 2.5 \times floor(\frac{n-1}{3}), \ n \ge 3$ | $0.75 \times n + 0.37 \times floor(\frac{n}{4}) + 2.5 \times floor(\frac{n-1}{4}), \ n \ge 4$ | $0, if n < 3 \\ 0.75 \times n + 0.22 \times floor(\frac{n}{3}) + 2.5 \times floor(\frac{n-1}{3}), n \ge 3$ | $\begin{array}{c} 0.75 \times n, \ if \ n < 3 \\ 0.75 \times n + 0.77 \times floor(\frac{n-1}{2}), \ n \geq 3 \end{array}$ | $0.75 \times n, \ if \ n < \frac{4}{1000} \\ 0.75 \times n + 0.37 \times floor(\frac{n}{4}) + 0.77 \times floor(\frac{n-1}{4}), \ n \ge 4$ | $0.75 \times n, if \ n < 3 \\ 0.75 \times n + 0.37 \times floor(\frac{n}{3}) + 0.77 \times floor(\frac{n-1}{3}), \ n \ge 3$ |
|--------------------|-----------------|---------------------------|-------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| Synchronization    | latency (after  | $\mathcal{L}_{max}$ hops) | 2.5 cycles                                                                                | 2.5 cycles                                                                                                                         | 2.5 cycles                                                                           | 2.5 cycles                                                                                  | 2.5 cycles                                                                                | 2.5 cycles                                                                                    | 2.5 cycles                                                                                                 | 0                                                                                                                          | 0                                                                                                                                          | 0                                                                                                                           |
| Latency            | overhead of     | $\operatorname{scheme}$   | I                                                                                         | 0.16 cycles                                                                                                                        | I                                                                                    | I                                                                                           | I                                                                                         | 0.37 cycles                                                                                   | 0.22 cycles                                                                                                | 0.77 cycles                                                                                                                | RecSync +<br>TransSync                                                                                                                     | RecSync +<br>TransSync-2-lines                                                                                              |
| $\mathbf{L}_{max}$ | (# of           | hops)                     | 2                                                                                         | 33                                                                                                                                 | °.                                                                                   | 3                                                                                           | °                                                                                         | 4                                                                                             | 33                                                                                                         | 2                                                                                                                          | 4                                                                                                                                          | 3                                                                                                                           |
|                    | Scheme          |                           | ABC                                                                                       | CPC                                                                                                                                | CHS                                                                                  | DBS                                                                                         | Buffer                                                                                    | TransSync                                                                                     | TransSync-2-lines                                                                                          | $\operatorname{RecSync}$                                                                                                   | RecSync +<br>TransSync                                                                                                                     | RecSync +<br>TransSync-2-lines                                                                                              |

#### CHAPTER III

# INTERCONNECT BER IN THE PRESENCE OF CROSSTALK AND VARIATIONS

Crosstalk noise is the most critical factor affecting communication reliability in onchip interconnect in NoC/MPSoC designs. In Chapter II, we explored how intra-flit skew limits transmission of signal transitions on long parallel wires on chip. Maximum intra-flit skew which is the separation between transitions suffering 0Cc and 4Cc MCF interaction was used as a metric to study signal integrity on wires and the impact of crosstalk on packet errors. In this chapter, we present a formal methodology to estimate bit-error rate (BER) on interconnect wires in presence of crosstalk and variatios. Since variability greatly impacts the extent of crosstalk coupling between parallel wires, it is imperative to review the source of variability and their effect on the integrity of signal transitions in long wires.

## A. Sources of Variation

The performance of integrated circuits varies with any deviation in the circuits structural or electrical parameters from the desired values. Environmental factors that arise during the operation of a circuit and affect its performance are variations in power supply, switching activity and temperature. Physical factors brought about by limitations in processing and masking during manufacture result in randomly distributed spatial variations in device and interconnect structures. Variations in structural parameters affect both performance and yield in integrated circuits. These variations have both inter-die and intra-die components ([60], [61]). Inter-die variations are the difference in parameter values across similar dies on the same wafer while intra-die variations are randomly distributed spatial variations in parameters on the same die. The contribution of each component along with a model of each variation is presented by in [62] and [63]. A brief review of the sources of variation and their impact on interconnect reliability is given in the proceeding section.

# 1. Device Variations

Some important sources of variation relating to the geometric structure and internal material parameters of MOSFETs, resistors and capacitors in the circuit are:

a. Film Thickness

Oxide film thickness  $(t_o x)$  is very critical to a MOSFETs operation and any variations in the oxide thickness can lead variations in  $V_T$  which can in turn lead to large deviations from the MOSFETs nominal transfer characteristics. Oxide thickness, however, is a relatively well controlled parameter in most processes. In addition to geometric sources, the presence of any mobile charges in the gate oxide can introduce bias dependent variation in  $V_T$  which can be about 10% of the nominal device  $V_T$  for the technology [64].

#### b. Lateral Device Dimensions

Some factors which give rise to variations in length and width of devices are: photolithography proximity effects due systematic pattern dependencies; photo system variations due to mask errors; lens irregularities and plasma etch variations. The resulting variations may be spatially correlated. MOSFETs are particularly sensitive to effective channel length since this parameter directly impacts its output current characteristics [65].



Fig. 16. Interconnect structure.

#### c. Doping

Junction depth and doping profiles may vary with implant dose, energy or angle variations. This has a direct bearing on the effective channel length and therefore the threshold voltage of the devices. Random placement and dopant concentration fluctuations due to discrete location of dopant atoms in the channel and source/drain regions can also lead to variation of  $V_T$  across devices on the chip. Variations in  $V_T$  lead to variations in leakage currents and noise margins. Techniques like halo implant [66] and retrograde channel doping [67] employed in n deep sub-micron technologies further increase the impact of implant and diffusion variation. Variations in thermal anneal and gate doping can cause variation in effective gate oxide thickness which in turn can result in loss of matching.

## 2. Interconnect Variations

Like devices, deviations in material properties as well as vertical and lateral dimensions of interconnect structures give rise to variation in interconnect behavior. Some important source of interconnect variations are:



Fig. 17. Three segment wire model for crosstalk coupling.

# a. Line Width (W) and Space (S)

Photolithography and etch dependencies give rise to deviations in width and spacing of patterned lines. Fig. 16 depicts the interconnect structure for local and intermediate wires. Deviations in line width and spacing lead to variations in line resistance (R in Fig. 17), capacitance with respect to the layer above or below (Cg in Fig. 17) as well as the line to line coupling capacitance (Cc in Fig. 17). Since poly and connecting wires features often have non-rectangular shapes [68], patterning variations affect electrical properties such as timing and noise even after OPC [69].

# b. Metal Thickness (T)

Although the thickness of sputtered or deposited metal films in conventional metal interconnects is well controlled, it can vary between and across wafers. On the other hand, variations in dishing and erosion can have significant impact on the final thickness of sputtered lines in Copper polishing process [71].

## c. ILD Thickness (H)

Difference in effective density of raised components in different regions of the chip in CMP/MPSoC can give rise to strong variations in the density of deposited and polished oxide films across the die [72]. This variation in inter-layer dielectric thickness can lead to variations in the behavior of interconnect wires across the die.

The expression for line resistance and capacitance per unit length as a function of wire width (W), wire spacing (S), metal thickness (T) and ILD thickness (H) for multi-level interconnect structures in given in [70].

#### d. Contact Via Size and Resistance

The resistance of ohmic contacts and vias is very sensitive to the etching and cleaning process and can vary between wafers. Silicide formation and grain structure of poly and metal lines depend greatly on the deposition and anneal process. Any variations in these processes can also lead to large deviations in contact and line resistance.

#### e. Metal Resistivity

The resistivity and dielectric constant of metal lines can vary between wafers. Pattern dependent and directional effect may become especially important. However resistivity of metal lines are generally well controlled.

#### 3. Environmental Variations

# a. Supply Voltage

Supply voltage variation (IR drop, electro-migration) Variations in supply voltage across the die are due ohmic voltage drop [73], electro-migration [74] and Ldi/dt voltage drop (ground bounce and Vdd sag) [75]. While ohmic drop results in a

static and often deterministic variation in the supply voltage across the chip, factors like ground bounce and Vdd sag vary dynamically with switching activity and are more difficult to predict. Electro-migration can lead to variations in power supply to different parts of the chip with time as the circuit is operated. Electrical properties of a circuit like delay and noise margin are very strongly dependent on the supply voltage.

#### b. Temperature

Temperature fluctuations across the die present a major performance and packaging challenge. Power density can vary across a microprocessor die depending on the functionality of the logic blocks on different parts of the chip [62]. The mobility of charge carriers in Silicon substrate and the resistivity of metal lines vary with temperature. Therefore the performance of both device and interconnect depend on temperature, with higher temperature causing performance degradation. Variations in temperature between two communicating block placed on different parts of the chip may cause performance mismatch and functional failures. Higher power density results in the formation on hot-spots on chip which also exert more demand on the power distribution grid resulting in resistive and inductive voltage drops. This in turn leads to time dependent, dynamic, supply voltage variations. Temperature variations across die lead to variations in sub-threshold leakage across the die, and dynamic variations in power delivery demand across the power distribution grid.

## B. BER Analysis

Bit error rate refers to the probability that a bit transmitted on interconnect wire will be received incorrectly. The total capacitance of a wire changes with the switching

| State    | Switching                                                                                                                         | MCF  | Occurrence          | Failure    |
|----------|-----------------------------------------------------------------------------------------------------------------------------------|------|---------------------|------------|
|          | pattern                                                                                                                           | (Cc) | probability $(o_i)$ | mechanism  |
| $A_1$    | $(\uparrow,\uparrow,\uparrow),(\downarrow,\downarrow,\downarrow)$                                                                 | 0    | 1/32                | Timing     |
| $A_2$    | $(-,\uparrow,\uparrow),(-,\downarrow,\downarrow),(\uparrow,\uparrow,-),(\downarrow,\downarrow,-)$                                 | 1    | 1/8                 | Timing     |
| $A_3$    | $(\uparrow,\downarrow,\downarrow),(\downarrow,\uparrow,\uparrow),(\uparrow,\uparrow,\downarrow),(\downarrow,\downarrow,\uparrow)$ | 2    | 1/16                | Timing     |
| $A_4$    | $(-,\uparrow,-),(-,\downarrow,-)$                                                                                                 | 2    | 1/8                 | Timing     |
| $A_5$    | $(\downarrow,\uparrow,-),(\uparrow,\downarrow,-),(-,\uparrow,\downarrow),(-,\downarrow,\uparrow)$                                 | 3    | 1/8                 | Timing     |
| $A_6$    | $(\uparrow,\downarrow,\uparrow),(\downarrow,\uparrow,\downarrow)$                                                                 | 4    | 1/32                | Timing     |
| $A_7$    | (-,-,-)                                                                                                                           | 0    | 1/16                |            |
| $A_8$    | $(\uparrow,-,\downarrow),(\downarrow,-,\uparrow)$                                                                                 | 0    | 1/8                 | Functional |
| $A_9$    | $(-,-,\uparrow),(-,-,\downarrow),(\uparrow,-,-),(\downarrow,-,-)$                                                                 | 1    | 1/4                 | Functional |
| $A_{10}$ | $(\uparrow,-,\uparrow),(\downarrow,-,\downarrow)$                                                                                 | 2    | 1/16                | Functional |

Table VI. Interconnect states.

activity on wire and its neighboring wires. Depending on the type of switching taking place on a wire and its immediate adjacent neighbors, the interconnect wire can be uniquely mapped to any of the ten distinct state [58] listed in Table VI. Fig. 18 shows the energy and delay associated with the different states on 2mm long interconnect wires.

A simplified independent bit-error model for noise effects on a wire assumes a Gaussian distribution noise voltage. The probability of a single wire being erroneous during a transition in this model is expressed by a Gaussian pulse function [51]. This Gaussian error model was extended by [52] to include spatial burst errors on wires in DSM technologies where multiple adjacent wires are erroneous due to same noise source. This information theoretic error model underestimates the errors caused due to crosstalk coupling between wires. Halak et al. in [77] presented a methodology to evaluate the reliability of interconnect by studying the impact of parametric variability on crosstalk induced wire delay and static noise in interconnect. We will use the approach put forth in [77] to evaluate the performance of the proposed RecSync-TransSync techniques.



Fig. 18. Energy and delay for different states for 2mm long interconnect wires.

## 1. Crosstalk Failure Mechanism

Error resulting from crosstalk interactions between adjacent wires on long NoC links, especially with asynchronous bypass paths, can be classified into two categories: timing (delay) errors  $(A_1 - A_6)$  and functional (glitch) errors  $(A_7 - A_{10})$ .

## a. Timing Failure

Timing failures  $(A_1 - A_6)$  result from the uncertainty in the arrival time of a signal transition on link wires Statistical timing analysis [76] will be used to estimate the contribution of timing errors to BER using an approach similar to that used in [77]. However, since we are performing the analysis for globally asynchronous locally synchronous (GALS) NoCs with long asynchronous bypass channels, the approach in [77] is modified slightly. Such NoC designs typically employs source synchronous communication, where sending node transmits the bit information along with the source clock information on interconnect wires. The receiving node uses this source clock information from the bus to synchronize the incoming data to its clock domain. Let the skew accumulated at the receiving node between a data transition and the fastest arriving signal transition (0Cc MCF) on link wires be  $\delta$ . The incoming data transition is latched correctly in a flip-flop at the receiving node only if  $\delta$  satisfies the following condition

$$0 \le \delta < T_{window} \tag{3.1}$$

where 
$$T_{window} = T_{Clk} - T_{seq}$$
 (3.2)

 $T_{Clk}$  is the clock period and the sequencing overhead  $(T_{seq})$  is comprised of setup and hold time for the synchronous elements in the design.  $\delta$  for each of the delay error states  $(A_1 - A_6)$  has a normal distribution with unique  $\mu_{\delta}$  and  $\sigma_{\delta}$  values respectively. Also,  $T_{window}$  is normally distributed about a mean  $\mu_{T_{window}}$  and standard deviation  $\sigma_{T_{window}}$ . If the probability distribution function (PDF) of  $\delta_i$  is  $g_i$  and the cumulative distribution function (CDF) of  $T_{window}$  is f, then the probability of error for delay state 'i' is given by:

$$p_i = \int_{-\infty}^{\infty} g_i(x) f(x) \, dx \tag{3.3}$$

## b. Functional Failure

Functional failures  $(A_7 - A_{10})$  arise when the amplitude of glitch induced on a quiet wire by switching activity on neighboring wires exceeds the switching threshold. If the threshold voltage is  $V_{Th}$  and the amplitude of crosstalk induced glitch  $A_{Glitch}$ , then functional failures result if the following condition does not hold:

$$|A_{Glitch}| < |V_{Th}| \tag{3.4}$$

The error probability for the glitch  $(p_i)$  states can also be calculated using Eqn 3.3 where for glitch states,  $g_i$  and f are the PDF of  $A_{Glitch}$  and the CDF of  $V_{Th}$  respectively.

## 2. Average BER

The average bit error rate can now be calculated using the error probability  $(p_i)$  and occurrence probability  $(o_i$  in Table VI) for interconnect states as:

$$BER = \sum_{i=1}^{10} p_i \cdot o_i \tag{3.5}$$

Obtaining BER estimates for crosstalk induced errors necessitates the estimation of nominal delay and glitch amplitude values for the different interconnect states. Analytical formulas of SPICE iterations at nominal or mean values of the parameters listed in Section B. The DoE method [78] which allows development of response surface models (RSM) for outputs as second order functions of variability parameters, is used to obtain the variance of  $\sigma_{\delta}$  and  $\sigma_{A_{Glitch}}$ . DoE helps in devising the sequence of experimental iterations to be performed to determine the relative effect of each parameter variations on output. Central composite design (CCD) which requires  $(2^{K} + 2K + 1)$  runs for 'k' inputs is obtain the design point for the experimental runs as in [77]. A second order regression model can then be built for the dependence of output (y) on different input parameters  $(x_i)$ .

$$y = \beta_0 + \sum_{i=1}^k \beta_i x_i + \sum_{i=1}^k \beta_{ii} x_i^2 + \sum_{i_{i(3.6)$$

The sensitivity of output to each input parameter  $(S_i)$  is obtained as the absolute value of partial derivative of Eqn 3.6 with respect to the input parameter normalized to the mean value of output. The variance of output is then calculated using Eqn 3.7.

$$\sigma^2 = \sum_{i=1}^k S_i^2 \sigma_i^2 \tag{3.7}$$

| Parameter                        | Nominal Value                  | Variation           |
|----------------------------------|--------------------------------|---------------------|
| $\operatorname{Height}_{ILD}(h)$ | $0.2 \mu \mathrm{m}$           | 20%                 |
| k <sub>ILD</sub>                 | 2.2                            | 20%                 |
| Wire resistivity $(\rho)$        | $2.2\text{E-}08\Omega\text{m}$ | 20%                 |
| Wire width $(w)$                 | $0.14 \mu { m m}$              | 20%                 |
| Wire space $(s)$                 | $0.14 \mu \mathrm{m}$          | 20%                 |
| Wire thickness $(t)$             | $0.35 \mu { m m}$              | 20%                 |
| Transistor length $(L_{eff})$    | 45nm                           | 10%                 |
| Power supply $(V_{DD})$          | 1.1V                           | 10%                 |
| Temperature $(T)$                | $27^{o}C$                      | $12^{o}C - 42^{o}C$ |

Table VII. Nominal values and variations for circuit parameters.

#### C. Results and Discussion

#### 1. Simulation Environment

Simulations were performed in Cadence Analog Environment using Spectre to obtain  $\delta$  and  $A_{Glitch}$  for the different interconnect states. 45nm Bulk CMOS card for devices and 65nm interconnect parameters from Berkeley PTM [45] were used in the simulations. BER values were obtained for different protection schemes on NoC links as repeater inserted intermediate wires. The driving repeaters on links are 4X times the unit inverter size in this technology and are inserted after 1mm of wire length. The distributed element nature of wires was modeled using 3-segment  $\pi$  model for wires with first order model for crosstalk coupling as shown in Fig 17. A power supply of 1.1V and mesochronous GALS clocking schemes with 1GHz clock frequency was assumed for the design. To obtain the variance of  $\delta$  and  $A_{Glitch}$ , CCD analysis was performed with parameter values listed in Table VII. To reduce the number of input parameters for the CCD analysis, all the interconnect parameters were mapped to a single variable 'x1' in Table VIII using expressions for D2M delay and static noise peak amplitude from [79]. The variance of D2M delay and static noise peak amplitude from [79].

| Observation # | x1 | $L_{eff}$ : x2                    | V <sub>DD</sub> : x3 | T: x4 |
|---------------|----|-----------------------------------|----------------------|-------|
| 1             | -1 | <i>L<sub>eff</sub></i> : x2<br>-1 | -1                   | -1    |
| 2             | -1 | -1                                | -1                   | 1     |
| 3             | -1 | -1                                | 1                    | -1    |
| 4             | -1 | -1                                | 1                    | 1     |
| 5             | -1 | 1                                 | -1                   | -1    |
| 6             | -1 | 1                                 | -1                   | 1     |
| 7             | -1 | 1                                 | 1                    | -1    |
| 8             | -1 | 1                                 | 1                    | 1     |
| 9             | 1  | -1                                | -1                   | -1    |
| 10            | 1  | -1                                | -1                   | 1     |
| 11            | 1  | -1                                | 1                    | -1    |
| 12            | 1  | -1                                | 1                    | 1     |
| 13            | 1  | 1                                 | -1                   | -1    |
| 14            | 1  | 1                                 | -1                   | 1     |
| 15            | 1  | 1                                 | 1                    | -1    |
| 16            | 1  | 1                                 | 1                    | 1     |
| 17            | -1 | 0                                 | 0                    | 0     |
| 18            | 1  | 0                                 | 0                    | 0     |
| 19            | 0  | -1                                | 0                    | 0     |
| 20            | 0  | 1                                 | 0                    | 0     |
| 21            | 0  | 0                                 | -1                   | 0     |
| 22            | 0  | 0                                 | 1                    | 0     |
| 23            | 0  | 0                                 | 0                    | -1    |
| 24            | 0  | 0                                 | 0                    | 1     |
| 25            | 0  | 0                                 | 0                    | 0     |

Table VIII. CCD analysis points.

tude were found to be 9.628% and 6.315%. CCD analysis was then performed with only four input parameters at the design points listed in Table VIII to obtain  $\sigma_{\delta}$  and  $\sigma_{A_{Glitch}}$ . In Table VIII, the points 0, +1 and -1 stand for the nominal value, (nominal value+ $\sigma$ ) and (nominal value- $\sigma$ ) respectively for parameters.

# 2. Schemes Studied

We compare the performance and overhead of TransSync and RecSync to that of several crosstalk avoidance and error correcting schemes. The baseline design is only comprised on links with driving repeaters of size 4X times the size of the smallest inverter with equal rise and fall times in technology used. The repeaters are inserted after every 1mm of wire length. All the schemes analyzed apply some sort of protection over the baseline design. In DBS [38], the spacing between interconnect wires is doubles, that is the pitch increased to 1.5X to minimize crosstalk coupling between adjacent wires. CPC [40] is crosstalk prevention coding scheme that encodes three bits onto five bits while 'Green' [49] is a similar coding technique that maps three bits onto four bits. XOR scheme interleaves adjacent bit-lines with the logical exclusive or of signals on the bit-lies eliminating 4Cc interaction between wire triplets. SHD provides passive shielding by alternating each bit of useful information in link wires with a ground line. DAP [49] requires duplication of each bit on bus along with a parity bit while TEC [50], the same information is transmitted on three adjacent wires such that the data on the middle wire experience minimum crosstalk interaction. The impact of increasing the size of driving repeaters on crosstalk induced errors on link wires is explored in *Buffer* scheme. In *Buffer* the size of the driving repeaters is 8X, which is twice the size of the driving repeaters in the baseline design.

'TransSync' scheme was presented in Chapter II for mitigating crosstalk induced signal transition skew. The power and area overhead of 'TransSync' design can be reduced by taking into account crosstalk coupling due to signal transitions only on two adjacent wires instead of considering the switching on a wire and its immediate adjacent neighbors as proposed. TransSync 2 lines is a modified 'TransSync' scheme that reduces the number of inputs to from six to four by considering crosstalk interaction between only two adjacent wires to minimize transition skew on link wires. For 'RecSync', the delay element was programmed to have a delay equivalent to maximum signal transition skew for the given link length plus a tolerance of 15%. CDF of the delay provided by the delay element in 'RecSync' design is used instead of CDF



Fig. 19. Sensitivity of delay and glitch amplitude to individual parameters to on repeater inserted links.

of  $T_{window}$  in Eqn 3.3 to estimate the BER.

#### 3. Results

Fig 19 shows the sensitivity of delay and glitch amplitude to individual parameters on repeater inserted links. Both delay and glitch amplitude due to crosstalk noise on repeater inserted links was found to be most sensitive to power supply variations with sensitivities of 28.59% and 32.94% respectively. The contribution of interconnect parameters to delay and glitch amplitude was found to be 23.94% and 19.44% respectively, while varying the operating temperature between  $12^{\circ}C - 42^{\circ}C$  had the least impact with contributions of 0.035% and 0.014% respectively.

 $\sigma_{\delta}$  and  $\sigma_{A_{Glitch}}$  were found to be 4.36% and 4.72% of the nominal values respectively. Jitter or variations in clock period depend on the type of clock distribution network and result from the same sources that result in variations in signal transition delay and glitch amplitude on links. Since the correlation between the sources of variability and clock jitter is not as straightforward as that for delay or glitch amplitude, we have assumed a normal distribution for clock period with  $\mu$ =1ns and a 3 $\sigma$ variance of 10%.



Fig. 20. Probability distribution function (PDF) of timing window and delay of different delay states for 1mm long interconnect wires.

Fig. 20 shows the probability distribution function (PDF) of  $T_{window}$  and delay of the timing error states A1 - A6 VI for 1mm long links. Similar plots were obtained for  $V_{Th}$  and  $A_{Glitch}$  of functional error states A8 - A10 and average BER values were obtained for link lengths of 0.5mm, 1mm, 2mmand3mm. The contribution of timing errors to average BER for different link lengths is shown in Fig. 21.



Fig. 21. Contribution of timing errors to BER vs. link length.

It may be seen from Fig. 21 that for links longer than 1mm, timing errors constitute a majority of the total crosstalk induced error on wires. This vindicates the significance of TranSync and RecSync schemes which aim to reduce BER in long asynchronous bypass paths by reducing timing errors.

Fig. 22, Fig. 23 and Fig. 24 present the average BER, power overhead and area



Fig. 22. BER vs. link length for different design.

overhead vs. link length for the schemes explained in Section D.2. The advantage of '*TranSync*' and '*RecSync*' schemes can be seen in links longer than 2mm. While '*DBS*', '*CPC*' and '*Green*' actually consume lesser power than the baseline design, the average BER in these schemes is still very high. It may also be noticed that the BER improves only slightly on doubling the size of driving repeaters.

For 2mm long links, BER with 'TransSync' and 'TransSync 2 lines' schemes is similar to that of 'TEC' scheme but with a much smaller power and area overhead. The power and area overhead of 'RecSync' design is smaller than that of 'TransSync' at the cost slightly increase BER. It may also be noticed that 'TransSync 2 lines' offers almost the same BER performance as that of 'TransSync' with a much smaller power and area cost. For packets that are bypassed asynchronously over distances of 3mm or longer, only 'RecSync' scheme may be able to deliver an acceptable BER performance. The area and power overhead of 'RecSync' scheme are quite similar to Buffer.



Fig. 23. Power overhead for different schemes.



Fig. 24. Area overhead for different schemes.

#### CHAPTER IV

# UNEQUAL ERROR PROTECTION (UEP) ON LINK

The previous chapters expound how TransSync and RecSync schemes can help in reducing crosstalk induced errors on long link wires and also explore the promise shown these schemes in enabling true low latency communication in large NoC designs. In this chapter we, present the concept of unequal error protection (UEP) against crosstalk induced errors on links. For any application running on NoC/MPSoC setting, data transmitted on links between nodes can be classified into different priority categories. Very few errors tolerated in high priority parts while relatively lesser effort may be spent in protecting low priority parts against crosstalk induced errors on link wires.

The design of TransSync and RecSync schemes allows them to be switched on and off dynamically as and when required. UEP on link tries to exploit this fact by employing TransSync and RecSync to provide the high priority data parts with higher levels of protection against crosstalk induced errors using; while switching-off the protection schemes for the low priority parts of the data since errors in these data parts have a relatively small bearing on the overall quality of the application. We show that overall energy consumption can be substantially reduced at the cost of slight degradation in performance.

It may not always be possible to place all the nodes that communicate frequently very close to each other in a many/multi-core system. Client-server applications where several nodes might be running the front-end of the applications while back-end on several separate nodes are an example. The concept of UEP on link against crosstalk induced errors using TransSync and RecSync techniques can be used for reducing the energy consumption all such applications running on a SoC where different parts of the application run on separate nodes that are placed several millimeters apart on die and these nodes need to communicate with each other. We evaluate the merits of UEP on links for a video decoder on NoC with a suite of H.264/VC test data streams. We choose H.264/AVC as the case study because of the inherent nature of the codec that allows data partitioning and prioritization making it a very suitable candidate for UEP.

#### A. UEP for H.264/AVC Video Streams on NoC Links

H.264/MPEG-4 Advanced Video Coding (AVC) [53] is a commonly used standard for video compression employed for recording and distribution of high definition video. The block-oriented, motion-compensation-based H.264/MPEG-4 AVC is commonly used as the codec standard for blu-ray discs, streaming of video content over the internet, digital video broadcasting and real-time video-conferencing.

In equal error protection (EEP) schemes, all received frames are treated equally. Unequal error protection (UEP) has been used to improve energy efficiency and error robustness in mobile multimedia devices and services. UEP schemes employ data partitioning (DP) to prioritize a normal slice in up to three parts (data partition A, B, and C). The use of unequal error protection (UEP) with the MPEG-4 error resilience tools for video streams sent over high error rate wireless channels was presented in [54]. Stockhammer et. al in [55] showed that by using data partitioning using in H.264/AVC, the percentage of entirely lost frames, can be lowered and the probability of poor quality decoded video can be reduced when compared non-scalable video coding.

We propose the extension of UEP with data partitioning to deal with crosstalk induced errors in video streams on NoC links. More resources are expended in de-



Fig. 25. H.264 data partition and prioritization.

coding higher priority data so as to lower bit errors in high priority data while fewer resources are utilized in decoding lower priority data. By providing different levels of protection to different parts of the video data, which have unequal degrees of importance, UEP helps in reducing the energy consumption in decoding process without a significant loss in reconstructed video quality. In order to achieve the UEP, layered video coding schemes, which encode the video streams into two or more layers with different priorities, are supported by most video compression standards, such as MPEG-2, MPEG-4 and H.264.

Fig. 25 shows NAL unit structure which consists of partitioned H.264 video data (partition A, partition B, partition C, and parameter sets) for a UEP scheme. Frames belonging to partition A during transmission on NoC links necessitate the use of power expensive crosstalk error reduction schemes which are capable of completely removing such errors. On the other hand, either less exorbitant schemes can be employed for crosstalk avoidance and error correction or the protection schemes can be switched-off during the transmission of partition B and C data. This can result in considerable power saving without significantly affecting the reconstructed video quality. Most protection schemes for error avoidance and error correction can be turned on and off on the fly as and when required. A learning phase can be incorporated into the design during which video streams are first transmitted with the highest possible protection. The quality of the decoded video is then compared to the requirements dictated by the application. If the application can endure degradation in video quality compared to what is offered by the current protection level, the next least power expensive method is tried. Varying degrees of protection are employed for different priority data streams in each level. The search returns the minimum levels of protection required for each partition priority so that the quality of the decoded video meets the required standards at the least power expense.

#### B. Results and Discussion

## 1. Performance Metric

Peak Signal to Noise Ratio (PSNR) is a measure of distortion in still images due to effects such as lossy compression. We use average PSNR as a metric to evaluate the quality of the reconstructed video stream due to crosstalk induced errors. PSNR of a reconstructed image can be calculated using Eqn. 4.1 and Eqn. 4.2.

$$PSNR = 10\log_{10}\frac{2^n - 1}{MSE}$$
(4.1)

$$MSE = \frac{1}{M \times N} \sum_{x,y} [p(x,y) - p'(x,y)]^2$$
(4.2)

where MSE is mean square error computed based on the pixel values of original image (p(x, y)) and reconstructed image (p'(x, y)). n in 4.1 is the number of bits describing the color of each pixel, and M and N in 4.2 represent the width and height of the image respectively.



Fig. 26. Simulation setup for evaluating the quality of reconstructed frames with different protection schemes on links.

## 2. Simulation Environment

We first evaluate the PRNR of the reconstructed video streams with equal error protection (EEP) when the different crosstalk avoidance and error correction schemes described in Chapter III are employed to guard against crosstalk induced errors on link for all data partitions. A suite of test video streams are encoded and partitioned through an H.264 video encoding software [56]. The encoded frames are then impaired by randomly flipping bits in accordance with the BER of protection scheme employed on the link wires before they are sent to the sender/receiver circuit.. The BER due to crosstalk induced errors on link wires with different protection schemes can be found in the Chapter III. The received frames are sent to an H.264 software decoder [56] and the decoder reconstructs video frames to measure the average PSNRs. Damaged macro-blocks are concealed using motion copy [57] during the frame reconstruction



Fig. 27. PSNR results for Akiyo test streams with different protection schemes on 2mm and 3mm long links.



Fig. 28. PSNR results for Foreman test streams with different protection schemes on 2mm and 3mm long links.

process. Fig. 26 depicts the schematic of the setup used for evaluating the quality of reconstructed video streams with different protections schemes on link under EEP.

## 3. PSNR and Merit

Fig. 27, Fig. 28 and Fig. 29 show the average PSNR of the reconstructed Akiyo, Foreman and Mobile test video streams respectively; when different schemes are used to guard against crosstalk induced errors on 2mm and 2mm long link wires. In the baseline design, the link wires are driven by 4X sized driving repeaters that are present



Fig. 29. PSNR results for Mobile test streams with different protection schemes on 2mm and 3mm long links.

every 1mm of link length. A description of the different schemes for which average PSNRs have been reported here can be found in III. From Fig. 27, Fig. 28 and Fig. 29, it may be observed that average PSNR of the reconstructed video streams are unacceptable in the baseline design due to its high BER making it imperative to use some protection scheme to reduce errors due to crosstalk. Amongst all the designs studied, the reconstructed frames had the highest average PSNR values when TransSync and RecSync schemes were used to combat crosstalk on link wires. This trend is more apparent when PEs on the network are separated by 3mm long links. While RecSync clearly performs better than all other schemes, the average PSNR of the reconstructed streams with XOR and TEC protection schemes is somewhat similar to that of TransSync. XOR and TransSync however come at the cost of an increased area and energy consumption overhead. To get a more holistic picture of the performance of any protection scheme and its associated overhead, we come up with a design metric called *Merit* (Eqn. 4.3); which is the ratio of the average PSNR of the reconstructed video stream with the protection scheme employed on link and the product of the normalized area and normalized energy consumption in the scheme. The area and energy consumption of all schemes are normalized to that of the baseline



Fig. 30. Merit of different protection schemes on 2mm long link wires.

design to obtain the Merit figures.

$$Merit = \frac{PSNR}{NormalizedArea \times NormalizedEnergy}$$
(4.3)

The design metric *Merit* proposed here accords equal importance to both area and energy overhead of the protection schemes. The metric may be easily modified depending on the design requirements when only one of the two overheads is of importance and the scheme with the best *Merit* may be chosen to obtain the best performance with the given constraints on area or power budget. Fig. 30 and Fig. 31 show the merit of different protection schemes for 2mm and 3mm long link wires. From these figures, it may be noted that although the PSNR performance of TransSync and RecSync schemes is better than that of TransSync 2 lines scheme, because of its small energy and area overheads, the TransSync 2 lines scheme has the best PSNR performance for all the test video streams studied. Also, while the PSNR performance of XOR and TEC is similar to those TransSync and RecSync schemes, the *Merit* of these schemes is small because of the large area and energy overhead.



Fig. 31. Merit of different protection schemes on 3mm long link wires.

| Data          | Dis     | stributio | n      |
|---------------|---------|-----------|--------|
| Partition     | Foreman | Akiyo     | Mobile |
| Priority $_A$ | 44.8%   | 44.6%     | 47.1%  |
| Priority $_B$ | 29.6%   | 40.4%     | 19.8%  |
| Priority $_C$ | 25.4%   | 14.6%     | 32.9%  |

Table IX. Distribution of priority parts in test video streams.

## 4. UEP Results

Now that we have analyzed the PSNR performance and the overheads of different protection schemes on link with equal error protection (EEP) for all data priority partitions, we now investigate the benefits of unequal error protection (UEP) on link. The partition and prioritization for H.264 data is shown in Fig. 25 and the distribution of priority parts in the test video streams studied is shown in Table IX.

Table X shows the different configurations of protection schemes studied to evaluate the benefits of UEP on link against crosstalk induced errors. We choose to implement UEP with only TransSync 2 lines, TransSync and RecSync schemes since these schemes have the highest *Merit* (Fig. 30, Fig. 31) amongst all the schemes studied. Also, it much simpler to dynamically switch on/off these protection schemes for

| Configuration | $Priority_A$ | $Priority_B$ | Priority $_C$ |
|---------------|--------------|--------------|---------------|
| Case 1        | Baseline     | Baseline     | Baseline      |
| Case 2        | TransSync    | Baseline     | Baseline      |
| Case 3        | TransSync    | TransSync    | Baseline      |
| Case 4        | TransSync    | TransSync    | TransSync     |
| Case 5        | TransSync 21 | Baseline     | Baseline      |
| Case 6        | TransSync 2l | TransSync 2l | Baseline      |
| Case 7        | TransSync 2l | TransSync 2l | TransSync 21  |
| Case 8        | RecSync      | Baseline     | Baseline      |
| Case 9        | RecSync      | RecSync      | Baseline      |
| Case 10       | RecSync      | RecSync      | RecSync       |
| Case 11       | RecSync      | TransSync    | Baseline      |
| Case 12       | RecSync      | TransSync    | TransSync     |
| Case 13       | RecSync      | RecSync      | TransSync     |
| Case 14       | RecSync      | TransSync 21 | Baseline      |
| Case 15       | RecSync      | TransSync 21 | TransSync 21  |
| Case 16       | RecSync      | RecSync      | TransSync 21  |

Table X. Protection schemes employed on link with data partitions for different UEP configurations studied.

different priority data on the same link. In Table X, Applying UEP on link wires to the transmitted H.264 video data only makes sense if the BER performance of the scheme employed for protecting higher priority data parts is better than that of protection schemes used for lower priority data parts. Since the BER performance of scheme employed for protecting higher priority parts is better than those of the schemes used with lower priority parts, more energy is typically spent on securing higher priority parts of the data. Table X lists all possible UEP configurations obtained by combining RecSync, TransSync and TransSync 2 lines; such that protection schemes for lower priority parts. Case 4, Case 7 and Case 10 are EEP schemes employing the same level of protection for all priority data parts. Case 15 is an example of UEP which uses RecSync module for eliminating intra-flit skew for



Fig. 32. Results for 2mm long link wires for the UEP schemes analyzed.

only  $Priority_A$  data parts. During the transmission of  $Priority_B$  and  $Priority_C$  data parts in Case 15, the RecSync module is switched off at the receiving node and TransSync 2 lines circuit is switched on at the transmitting node.

Fig. 32 and Fig. 33 present the BER performance of the different UEP configurations analyzed, arranged in the increasing order of their average power consumption from left to right for 2mm and 3mm long links respectively. The lowest tolerable average BER levels vary between different applications. Fig. 32 and Fig. 33 enable the designer to select the UEP configuration with the least energy consumption to meet a specified BER performance. For example, if the PEs are placed 3mm apart on the die and the application requires a minimum average PSNR of 35dB (dotted yellow line in Fig. 33) for Akiyo streams, then the leftmost UEP configuration in Fig. 33 for which the average PSNR is greater than 35dB should be chosen as the solutions. For the given example, Case 15 offers an average PSNR performance of 35.04dB at approximately 20% lesser overall energy consumption when compared to Case 15 which offers 38dB of average PSNR. We have therefore presented a design methodology which allows the designers to achieve the required performance levels



Fig. 33. Results for 3mm long link wires for the UEP schemes analyzed.

with the least energy consumption using unequal error protection on link wires against crosstalk induced errors.

#### CHAPTER V

#### CONCLUSION

With the advent of Network on Chip (NoC) designs where communication processing elements (PEs) are placed on separate nodes, communication latency critical to the total application execution time and more and more new algorithms are communication limited rather than processing limited. Several recent works have proposed NoC designs with fast asynchronous bypass paths to reduce the average latency for communication between interacting nodes. Such designs are however plagued by intra-flit skew which leads to violation of setup-hold constraints and therefore packet errors. We have presented the TransSync and RecSync to combat intra-flit skew on long ink wires. The TransSync design skews signal transitions intelligently at the transmitter before they are transmitted to minimize intra-flit skew while RecSync eliminates the accrued intra-flit skew at the receiver by waiting for the last arriving transition.

Chapter II presents the TransSync and RecSync schemes for combating intraflit skew in flits that have to travel several hops on fast asynchronous bypass paths in low latency NoC designs. In Chapter III, a formal analysis of the bit error ratio (BER) performance of a host of crosstalk prevention and error correcting schemes has been carried out and their performance and overheads have been compared to those of TransSync-RecSync. Chapter IV studies the benefits of unequal error protection (UEP) on link on a MPSoC video decoder design using a suite of H.264/AVC test video streams.

To summarize, this work demonstrates the following:

• TransSync-RecSync help in reducing errors resulting from crosstalk induced timing failure.

- True low latency communication leveraging asynchronous bypass on long paths in NoC design is possible with much smaller overheads and router modifications when compared to most existing solutions.
- BER performance in presence of crosstalk and variations for several existing crosstalk prevention and error correcting schemes has been analyzed and compared to that of TransSync and RecSync schemes.
- It has been shown that unequal protection against crosstalk induced errors on link wires can result in considerable power savings with acceptable degradation in performance using UEP.
- UEP on link using TransSync, TransSync 2 lines and RecSync schemes have been demonstrated for video decoder on NoC with H.264 video test streams. For Akiyo test streams transmitted over 3mm long link wires, UEP can lead to as much as 20% of power savings with 3dB of degradation in average PSNR.

## REFERENCES

- K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson and K. Chan, "The case for a single-chip multiprocessor," in *Proc. of 7th Int. Symp. on Architectural Support* for Programming Languages and Operating Systems (ASPLOS VII), Cambridge, MA, Oct. 1996, pp. 2-11.
- [2] J. Kang, K. Rim, "On-chip multiprocessor design," in Tech. Papers of 1995 Int. Symp. on VLSI Technology, Systems, and Applications, Taipei, Taiwan, June 1995, pp. 230-234.
- [3] W. Wolf, A. A. Jerraya, G. Martin, "Multiprocessor system-on-chip (MPSoC) technology," in *IEEE Trans. on Computer-Aided Design of Integrated Circuits* and Systems, vol. 27, no. 10, pp. 1701-1713, 2008.
- [4] International Technology Roadmap for Semiconductors (ITRS). (2009, June).
   [Online]. Available: http://www.itrs.net/Links/2009ITRS/Home2009.htm/
- [5] P. Guerrier and A. Greiner, "A generic architecture for on-chip packet-switched interconnections," in *Proc. of Design, Automation and Test in Europe Conference*, Paris, France, 2000, pp. 250-256.
- [6] L. Benini, G. De Micheli, "Networks on chips: a new SoC paradigm," *Computer*, IEEE Computer Society, vol. 35, no. 1, pp. 70-78, Jan. 2002.
- [7] E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, A. Kolodny, "The power of priority: NoC based distributed cache coherency," *First Int. Symp. on Networks-on-Chip* (NOCS'07), Princeton, NJ, May 7-9, 2007, pp. 117-126.

- [8] A. Abousamra, R. Melhem, A. Jones, "NoC-aware cache design for chip multiprocessors," in Proc. of 19th Int. Conf. on Parallel Architectures and Compilation Techniques (PACT), Vienna, Austria, 2010, pp. 565-566.
- [9] W.J. Dally, B. Towles, "Route packets, not wires: on-chip interconnection networks," in *Proc. of 38th Design Automation Conference (DAC)*, Las Vegas, NV, Sep. 2001, pp. 684-689.
- [10] S. Kumar, A. Jantsch, J. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, A. Hemani, "A network on chip architecture and design methodology," *IEEE Computer Society Annual Symp. on VLSI*, Pittsburgh, PA, vol. 18, no. 11, 2002, pp. 105-112.
- [11] M. Mirza-Aghatabar, S. Koohi, S. Hessabi, M. Pedram, "An empirical investigation of mesh and torus NoC topologies under different routing algorithms and traffic models," 10th Euro-micro Conf. on Digital System Design Architectures, Methods and Tools (DSD), Lubeck, Germany, Aug. 2007, pp. 19-26.
- [12] L. Bononi, N. Concer, M. Grammatikakis, M. Coppola, R. Locatelli, "NoC topologies exploration based on mapping and simulation models," 10th Euromicro Conf. on Digital System Design Architectures, Methods and Tools (DSD), Lubeck, Germany, Aug. 2007, pp. 543-546.
- [13] D. Chapiro, "Globally-asynchronous locally-synchronous systems," Ph.D. dissertation, Dept. of Comp. Sci., Stanford University, Palo Alto, CA, 1984.
- [14] K. Yun and R. Donohue, "Pausible clocking: A first step toward heterogeneous systems," in Proc. of IEEE Int. Conf. on Computer Design: VLSI in Computers and Processors (ICCD), Austin, TX, 1996, pp. 118-127.

- [15] T. Meincke, A. Hemani, S. Kumar, P. Ellervee, J. Oberg, T. Olsson, P. Nilsson, D. Lindqvist, and H. Tenhunen, "Globally asynchronous locally synchronous architecture for large high-performance ASICs," in *Proc. of Int. Symp. on Circuits* and Systems (ISCAS), Orlando, FL, 1999, pp. 512-515.
- [16] J. Muttersbach, T. Villiger, and W. Fichtner, "Practical design of globallyasynchronous locally-synchronous systems," in *Proc. of 6th Int. Symp. Advanced Research in Asynchronous Circuits and Systems (ASYNC)*, Eilat, Israel, 2000, pp.52-59.
- [17] W.J. Dally, "Express Cubes: Improving the performance of K-ary N-cube interconnection networks," *IEEE Trans. on Computers*, vol. 40, no. 9, pp. 1016-1023, Sep.1991.
- [18] U.Y. Ogras and R. Marculescu, "Application-specific network-on-chip architecture customization via long-range link insertion," in *Proc. of Int. Conf. on Computer-Aided Design (ICCAD)*, San Jose, CA, 2005, pp. 246-253.
- [19] T. Krishna, A. Kumar, L. Peh, J. Postman, P. Chiang, M. Erez, "Express virtual channels with capacitively driven global links," 42nd Annual Int. Symp. on Microarchitecture (MICRO-42), New York, NY, vol. 29, no. 4, 2009, pp. 48-61.
- [20] T. Jain, P. Gratz, A. Sprintson, and G. Choi, "Asynchronous bypass channels: improving performance for multi-synchronous NoCs," in *Int. Symp. on Networks*on-Chip (NOCS), Grenoble, France, 2010.
- [21] V. Soteriou, W. Hangsheng, L. Peh, "A statistical traffic model for on-chip interconnection networks," 14th IEEE Int. Symp. on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Monterey, CA, Sept. 2006, pp. 104-116.

- [22] K. Sankaralingam, R. Nagarajan, L. Haiming, K. Changkyu, H. Jaehyuk, D. Burger, S.W. Keckler, C.R. Moore, "Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture," *Proc. of 30th Annual Int. Symp. on Computer Architecture*, San Diego, CA, June 2003, pp. 422-433.
- [23] M.B. Taylor, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, A. Agarwal, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson, J. Kim, "Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams," *Proc. of 31st Annual Int. Symp. on Computer Architecture*, Munich, Germany, June 2004, pp. 2-13.
- [24] M. Vachharajani, N. Vachharajani, D.A. Penry, J.A. Blome, D.I August, "Microarchitectural exploration with Liberty," *Proc. of 35th Annual IEEE/ACM Int. Symp. on Microarchitecture (MICRO-35)*, Istanbul, Turkey, 2002, pp. 271-282.
- [25] A.T. Tran, D.N. Truong, B.M. Baas, "A low-cost high-speed source-synchronous interconnection technique for GALS chip multiprocessors," in *IEEE Int. Symp.* on Circuits and Systems (ISCAS), Taipei, Taiwan, May 2009, pp. 996-999.
- [26] A.T. Tran, D.N. Truong, B.M. Baas, "A GALS many-core heterogeneous DSP platform with source-synchronous on-chip interconnection network," in 3rd ACM/IEEE Int. Symp. on Networks-on-Chip (NoCS), Grenoble, France, May 2009, pp. 214-223.
- [27] A.T. Tran, D.N. Truong, B.M. Baas, "A reconfigurable source-synchronous onchip network for GALS many-core platforms," in *IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems*, vol. 29, no. 6, pp. 897-910, June 2010.

- [28] D. Ludovici, A. Strano, G.N. Gaydadjiev and D. Bertozzi, "Mesochronous NoC technology for power-efficient GALS MPSoCs," in *Proc. of the Fifth Int.* Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip (INA-OCMC), Heraklion, Greece, May 2011, no. 4, pp. 27-30.
- [29] P. Gratz, C. Kim, R. McDonald, S. Keckler, and D. Burger, "Implementation and evaluation of on-chip network architectures," in *Proc. of Int. Conf. on Computer Design (ICCD)*, San Jose, CA, 2006, pp. 1-1.
- [30] J. Kim, W. Dally, B. Towles, and A. Gupta, "Microarchitecture of a high radix router," in *Proc. of 32nd Int. Symp. on Computer Architecture (ISCA)*, Madison, WI, 2005, pp. 420-431.
- [31] J. Kim, J.D. Balfour, and W. J. Dally, "Flattened butterfly topology for on-chip networks," in Proc. of 40th Annual IEEE/ACM Int. Symp. on Microarchitecture (MICRO-40), Chicago, IL, 2007, pp. 172-182.
- [32] W. Hu, S. Lee, and N. Bagherzadeh, "Dmesh: a diagonally-linked mesh networkon-chip architecture," in 1st Int. Workshop on Network on Chip Architectures (NoCArc), Lake Como, Italy, 2008.
- [33] B. Grot, J. Hestness, S. Keckler, and O. Mutlu, "Express cube topologies for on-chip interconnects," in Proc. of 15th IEEE Int. Symp. on High Performance Computer Architecture (HPCA), Raleigh, NC, 2009, pp. 163-174.
- [34] C. Duan, A. Tirumala and S.P. Khatri, "Analysis and avoidance of crosstalk in on-chip buses," in *Hot Interconnects*, Palo Alto, CA, Aug 2001, pp. 133-138.
- [35] C.J. Alpert, A. Devgan and S.T. Quay, "Buffer insertion for noise and delay optimization," *IEEE Trans. on Computer-Aided Design of Integrated Circuits*

and Systems, vol. 18, no. 11, pp. 1633-1645, Nov.1999.

- [36] Z. Hai and D.F. Wong, "Global routing with crosstalk constraints," *IEEE Trans.* on Computer-Aided Design of Integrated Circuits and Systems, vol. 18, no. 11, pp. 1683-1688, Nov.1999.
- [37] K. Hirose and H. Yasuura, "A bus delay reduction technique considering crosstalk," in Proc. of Design, Automation and Test in Europe Conf. and Exh. (DATE), Paris, France, 2000, pp. 441-445.
- [38] R. Arunachalam, E. Acar and S.R. Nassif, "Optimal shielding/spacing metrics for low power design," *IEEE Computer Society Annual Symp. on VLSI*, Tampa, FL, 2003, pp. 167-172.
- [39] H. Kaul, D. Sylvester and D. Blaauw, "Performance optimization of critical nets through active shielding," *IEEE Trans. on Circuits and Systems I: Regular Papers*, vol. 51, no. 12, pp. 2417-2435, Dec. 2004.
- [40] B. Victor and K. Keutzer, "Bus encoding to prevent crosstalk delay," in Proc. of Int. Conf. on Computer-Aided Design (ICCAD), San Jose, CA, 2001, pp. 57-57.
- [41] P. Pande, H. Zhu, A. Ganguly, and C. Grecu, "Crosstalk-aware energy reduction in NoC communication fabrics," in *Proc. of IEEE Int. SOC Conf.*, 2006, Austin, TX, pp. 225-228.
- [42] Z. Ying, L. Huawei and L. Xiaowei, "Reliable network-on-chip router for crosstalk and soft error tolerance," in 17th Asian Test Symp. (ATS), Sapporo, Japan, 2008, pp. 438-443.
- [43] L. Lin, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, "A crosstalk aware interconnect with variable cycle transmission," *Design, Automation and Test in Europe*

Conf. and Exh. (DATE), Paris, France, 2004, pp. 102-107.

- [44] H. Sankaran and S. Katkoori, "On-chip dynamic worst-case crosstalk pattern detection and elimination for bus-based macro-cell designs," *Quality of Electronic Design (ISQED)*, Santa Clara, CA, pp. 33-39, 2009.
- [45] Nanoscale Integration and Modeling(NIMO) Group, ASU. (cited 2010, June). Predictive Technology Model (PTM) [Online]. Available: http://ptm.asu.edu/
- [46] S.R. Sridhara, N.R. Shanbhag, "Coding for reliable on-chip buses: a class of fundamental bounds and practical codes," *IEEE Trans. on CAD of Integrated Circuits and Systems*, vol. 26, pp. 977-582, May. 2007.
- [47] P. Pande, A. Ganguly, B. Feero, B. Belzer, C. Grecu; "Design of low power and reliable networks on chip through joint crosstalk avoidance and forward error correction coding," *IEEE Int. Symp. on Defect and Fault-Tolerance in VLSI Systems*, Arlington, VA, Oct. 2006, pp. 466-476.
- [48] A. Ganguly, P. Pande, B. Belzer, C. Grecu, "Addressing signal integrity in networks on chip interconnects through crosstalk-aware double error correction coding," *Proc. of the IEEE Computer Society Annual Symp. on VLSI*, Porto Alegre, Brazil, Mar. 2007, pp. 799-808.
- [49] S. Sridhara, and N. Shanbhag, "Coding for system-on-chip networks: a unified framework," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, vol. 13, no. 6, pp. 655-667, June 2005.
- [50] P. Huang, W. Fang, Y. Wang and W. Hwang, "Low power and reliable interconnection with self-corrected green coding scheme for network-on-chip,"

2nd ACM/IEEE International Symp. on Networks-on-Chip (NoCS), April 2008, Newcastle University, UK, 2008, pp. 77-83.

- [51] R. Hegde and N. R. Shanbhag, "Toward achieving energy-efficiency in presence of deep sub-micron noise," *IEEE Trans. on Very Large Scale Integrated (VLSI)* Systems, vol. 8, no. 4, pp. 379-391, Aug. 2000.
- [52] B. Fu, "Crosstalk-aware multiple error control for reliable on-chip interconnects," Ph.D. dissertation, Dept. of Elec. and Comp. Eng., University of Rochester, New York, 2010.
- [53] International Telecommunication Union Telecommunication Standardization Sector (ITU-T). (cited 2010, Dec.). H.264: Advanced video coding for generic audio-visual services [Online]. Available: http://www.itu.int/rec/T-REC-H.264
- [54] W. Heinzelman, M. Budagavi and R. Talluri, "Unequal error protection of MPEG-4 compressed video," Proc. of Int. Conf. on Image Processing (ICIP), Kobe, Japan, 1999, vol. 2, pp. 530-534.
- [55] T. Stockhammer and M. Bystrom, "H.264//AVC data partitioning for mobile video communication," Int. Conf. on Image Processing (ICIP), Singapore, Oct. 2004, vol. 1, pp. 545-548.
- [56] Video Coding Experts Group (VCEG). (2008). H.264/AVC JVT JM-14.2 [Online]. Available: http://iphome.hhi.de/suehring/tml/download/
- [57] W. Zhenyu, J.M. Boyce, "An error concealment scheme for entire frame losses based on H.264/AVC," Proc. of IEEE International Symposium on Circuits and Systems (ISCAS), 2006, pp. 4-8.

- [58] R. Singhal, "Data integrity for on-chip interconnects," Ph.D. dissertation, Dept. of Elec. and Comp. Eng., Texas A&M University, College Station, 2007.
- [59] M.G. Johnson, E.L. Hudson, "A variable delay line PLL for CPU-coprocessor synchronization," *IEEE Journal of Solid-State Circuits*, vol. 23, no. 5, pp. 1218-1223, 1988.
- [60] C. Michael and M. Ismail, Statistical Modeling for Computer-aided Design of MOS VLSI Circuits, Boston: Kluwer Academic, 1993.
- [61] B.E. Stine, D.S. Boning, J.E. Chung, "Analysis and decomposition of spatial variation in integrated circuit processes and devices," in *IEEE Trans. on Semi*conductor Manufacturing, vol. 10, no. 1, pp. 24-41, 1997.
- [62] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, V. De, "Parameter variations and impact on circuits and microarchitecture," in *Proc. of Design Automation Conference (DAC)*, Anaheim, CA, June 2003, pp. 338-342.
- [63] S.R. Nassif, "Modeling and analysis of manufacturing variations," Proc. of IEEE Conf. on Custom Integrated Circuits, San Diego, CA, 2001, pp. 223-228.
- [64] K. Bernstein, K.M. Carrig, C.M. Durham, P.R. Hansen, D. Hogenmiller, E.J. Nowak, N.J. Rohrer, *High Speed CMOS Design Styles*, Boston: Kluwer Academic, 1998.
- [65] C. Yu, H.Y. Liu and C.J. Spanos, "Patterning tool characterization by causal variability decomposition," in *IEEE Trans. on Semiconductor Manufacturing*, vol. 9, no. 4, pp. 527-535, 1996.
- [66] A. Das, H. De, V. Misra, S. Venkatesan, S. Veeraraghavan, M. Foisy, "Effects of halo implant on hot carrier reliability of sub-quarter micron MOSFETs," in

Proc. of 36th IEEE Int. Annual Reliability Physics Symposium, Reno, NV, Mar. 1998, pp. 189-193.

- [67] I. De, C.M. Osburn, "Impact of super-steep-retrograde channel doping profiles on the performance of scaled devices," in *IEEE Trans. on Electron Devices*, vol. 46, no. 8, pp. 1711-1717, Aug. 1999.
- [68] W. Poppe, L. Capodieci, J. Wu and A. Neureuther, A.; "From poly line to transistor: building BSIM models for non-rectangular transistors," *Design and Process Integration for Microelectronic Manufacturing IV, Proc. of the SPIE*, vol. 6156, pp. 235-243, 2006.
- [69] U. Narasimha, B. Abraham, N. NS, "Statistical analysis of capacitance coupling effects on delay and noise," in 7th Int. Symp. on Quality Electronic Design (ISQED), San Jose, CA, Mar. 2006, pp. 795-800.
- [70] J.Chern, J. Huang, L. Arledge, P. Li and P. Yang, "Multilevel metal capacitance models for CAD design synthesis systems," in *Electron Device Letters*, vol. 13, no. 1, pp. 32-34, 1992.
- [71] S. Lakshminarayanan, P.J. Wright, J. Pallinti, "Electrical characterization of the copper CMP process and derivation of metal layout rules," in *IEEE Trans. on Semiconductor Manufacturing*, vol. 16, no. 4, pp. 668-676, 2003.
- [72] C. Ouyang, K. Ryu, L. Milor, W. Maly, G. Hill, Y. Peng, "An analytical model of multiple ILD thickness variation induced by interaction of layout pattern and CMP process," in *IEEE Trans. on Semiconductor Manufacturing*, vol. 13, no. 3, pp. 286-292, 2000.
- [73] C. Mingjing, A. Orailoglu, "Cost-effective IR-drop failure identification and yield

recovery through a failure-adaptive test scheme," in *Design Automation & Test* in Europe Conf. & Exh. (DATE), Dresden, Germany, Mar. 2010, pp. 63-68.

- [74] K.N. Tu, "Recent advances on electromigration in very-large-scale-integration of interconnects," in *Journal of Applied Physics*, vol. 94, no. 9, pp. 5451-5473, 2003.
- [75] P. Heydari, M. Pedram, "Ground bounce in digital VLSI circuits," in *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, vol. 11, no. 2, pp. 180-193, 2003.
- [76] L. Zhang, Y. Hu, C. Chen, G. Russell, "Statistical timing analysis in sequential circuit for on-chip global interconnect pipelining," in *Proc. of 41st Design Automation Conf. (DAC)*, San Diego, CA, 2004, pp. 904-907.
- [77] B. Halak, S. Shedabale, H. Ramakrishnan, A. Yakovlev, G. Russell, "The impact of variability on the reliability of long on-chip interconnect in the presence of crosstalk," in *Proc. of the 2008 Int. Workshop on System Level Interconnect Prediction (SLIP)*, Newcastle, United, Kngdm, 2008, pp. 65-72.
- [78] D.C. Montgomery, Design and Analysis of Experiments, 5th edition, Danvers, MA: John Wiley & Sons, 2001.
- [79] K. Agarwal, M. Agarwal, D. Sylvester, D. Blaauw, "Statistical interconnect metrics for physical-design optimization," in *IEEE Trans. on Computer-Aided De*sign of Integrated Circuits and Systems, vol. 25, no. 7, pp. 1273-1288, 2006.

## VITA

Name: Reeshav Kumar Address: C/O Dr. Seong Gwan Choi Computer Engineering Group Department of Electrical and Computer Engineering WERC 333C, MS 3259, Texas A&M University, College Station, TX 77843-3259.

Email Address: reeshav123@tamu.edu

•

Education: B.E., Electronics & Instrumentation Engineering,
 Birla Institute of Technology and Science, India, 2009.
 M.S., Electrical Engineering,
 Texas A&M University, 2011.

The typist for this thesis was the author.