Global clock distribution in the SiLago platform

(1)

DEGREE PROJECT IN ELECTRICAL ENGINEERING SECOND CYCLE, 30 CREDITS

STOCKHOLM, SWEDEN – 2020

Global clock distribution

in the SiLago platform

Jordi Altay´o

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Author

Jordi Altay´o, <jordiag@kth.se>

EECS School / Electrical Engineering Department Division of Electronics and Embedded Systems KTH Royal Institute of Technology

Examiner

Prof. Ahmed Hemani, <hemani@kth.se>

Supervisor

Dimitrios Stathis, <stathis@kth.se>

(3)

Abstract

The extreme evolution of Very Large Scale Integration (VLSI) design has followed Moore’s law for the past decades, which predicts a doubling of the number of transistors that can be implemented on a chip every 18 months. However, tightly coupled with the evolution of the technology capabilities, the complexity during the implementation of such designs has also increased dramatically. Several solutions have been proposed to cope with this problem, one of them being currently developed at the group of VLSI design at KTH named the SiLago platform.

The SiLago platform is a framework that enables an efficient VLSI design methodology by providing a set of tools and libraries capable of generating ready-to-manufacture Aplication Specific Integrated Circuit (ASIC) designs from a high level description. The physical design of a SiLago design is achieved using pre-characterized, hardened, abuttable, micro-architectural blocks that are placed during the synthesis process.

This design methodology causes a set of problems to arise, one of them being the distribution of a valid clock signal that reaches all the sinks. Given the nature of the designs that the SiLago platform intends to tackle a fully synchronous design style can be considered impractical and unachievable so alternative approaches and methods had to be taken.

This work proposes a methodology for distributing a valid clock signal through the global Network-on-Chip (NoC) on a SiLago design. By analysing the timing paths in each every of the NoC edges a set or rules is derived from standard Static Timing Analysis (STA) methods. Additionally, by using a previously developed GALS-related interface type, named Globally-Ratiochronous Locally-Synchronous (GRLS), the distribution methodology can cope with latency insensitive paths as well as allowing a fine grain frequency scaling in different SiLago regions.

(4)

(5)

Sammanfattning

Den extrema utvecklingen av Very Large Scale Integration (VLSI)-design har under de senaste ˚artionden följt Moores lag, som förutsp˚ar att antalet transistorer som f˚ar plats p˚a ett chip fördubblas varje 18 m˚anader. I stort samband med den ökade teknikförm˚agan har även kom-plexiteten av VLSI-design ökat dramatiskt. Flera lösningsförlag har presenterats för att hantera detta problem. En av lösningarna utvecklas för tillfället p˚a KTH av VLSI-designgruppen och g˚ar under namnet SiLago-platformen.

SiLago-platformen är ett ramverk som möjliggör en effektiv VLSI-designmetodik genom förseende av verktyg och bibliotek som är kapabla att generera Application Specific Integration Circuit (ASIC)-design, som är redo att tillverkas, fr˚an en högniv˚asbeskrivning. Den fysiska delen av SiLago-design g˚ar att uppn˚a genom att använda för-karakteriserade, härdade, möjligheten av gränsande placering, micro-arkitektoniska block som är placerade under syntesprocessen.

Denna designmetodik orsakar en uppsättning problem, där ett av problemen är distributionen av en giltig klocksignal till alla slutpunkter. Med tanke p˚a den design som SiLago-platformen avser att hantera kan helt synkron design anses vara opraktisk och ouppn˚aelig. Alternativa lösningsmetoder behövde användas för att överkomma detta problem.

Denna arbetsprocess föresl˚ar en metodik för distribution av en giltig klocksignal genom det globala Network-on-Chip (NoC) i en SiLago-design. Genom att analysera de olika vägarna mellan varje NoC-edge kan en uppsättning av regler härledas fr˚an standard Static Timing Anal-ysis (STA)-metoder. Genom att använda en tidigare utvecklad GALS-relaterat gränsnitt, som kallas “Globally-Ratiochronous Locally-Synchronous (GRLS), kan distributionsmetodiken klara av fördröjningsokänsliga vägar och till˚ata finkornig frekvensskalning i olika SiLago-regioner.

(6)

(7)

List of Figures

1.1 Robert Noyce holding one of the first monolithic IC mask. [10] . . . 14

1.2 Evolution of the number of logic gates integrated in a chip. [14] . . . 14

1.3 Transistor count evolution for real-world designs . . . 15

2.1 DRRA fabric. . . 20

2.2 DiMArch structure. . . 21

2.3 Structure and connections by abutment. . . 22

2.4 SiLago floorplan . . . 23

3.1 Graph representation of a NoC. . . 25

3.2 Clock tree spanning through the NoC wires. . . 26

3.3 Concepts of global clock distribution. . . 27

3.4 Flop-to-flop synchronous communication. . . 28

3.5 Flop-to-flop asynchronous communication. . . 29

3.6 Communication betwen NIs. . . 30

3.7 GRLS transmitter . . . 33

3.8 GRLS receiver . . . 34

3.9 Type 1 path segmentation. . . 35

3.10 Type 2 path segmentation. . . 35

3.11 Wire block. . . 37

3.12 Buffer block. . . 37

3.13 Register block. . . 38

3.14 GRLS block. . . 38

3.15 Clock distribution example. . . 39

(10)

(11)

List of Algorithms

1 Transmitter regulator. . . 32

2 Type 1 path segmentation. . . 35

3 Type 2 path segmentation. . . 36

(12)

(13)

Chapter 1

Introduction

1.1 Historical background

The first two sections of the introduction aim to provide an understanding of the evolution of both ASICs and Electronic Design Automation (EDA) tools. The evolution of these two topics is tightly coupled and a symbiotic relationship emerges. The evolution of the tools allow for better designs to be realistically implemented and the constant evolution of the IC technology urges the need for better EDA tools.

1.1.1 Evolution of ASIC designs

The first Integrated Circuit (IC) during 1958 and 1959, depending on what definition is taken for an IC. Figure 1.1 shows Robert Noyce, co-founder of Intel® holding masks for one of the first IC. The initial concept of what is known as an Aplication Specific Integrated Circuit appeared in the early 1980s [13]. The semiconductor technology available was capable of integrating a useful number of transistors inside a chip. The only business model present at that time for semiconductor manufacturers was to imagine what the market needed, design it, manufacture and then sell it on the open market to multiple costumers. The problem was that semiconductor industries had a deep understanding of the manufacturing process but lacked the system-level knowledge. At this point companies started to conceive their own designs for specific applications and needed a way to manufacture them. This is how the Aplication Specific Integrated Circuit industry commenced.

The initial ASICs used gate array technology. A gate array is a prefabricated silicon chip with most transistors having no predetermined function. These transistors can be connected by metal layers to form standard NAND or NOR logic gates. These gates can then be further interconnected into a complete circuit on the same or later metal layers. The creation of a circuit with a specified function is accomplished by adding these final layer or layers of metal interconnects to the chips late in the manufacturing process, allowing the function of the chip to be customized as desired. Although this was a very significant improvement compared to being forced to rely on semiconductor industries to come up with the designs, the gate array structure presented some intrinsic limitations. To overcome these restraints the cell-based ASIC appeared.

(14)

Figure 1.1: Robert Noyce holding one of the first monolithic IC mask. [10] 19591960196119621963196419651966196719681969197019711972197319741975 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Technology nodes (nm) log 2 of the numb erof comp onents p er integrated function

Figure 1.2: Evolution of the number of logic gates integrated in a chip. [14]

In a cell-based ASIC design companies are provided with libraries, or lib-cells, that contain the building blocks used in the design. This lib-cells can provide a wide variety of functionalities, from standard cells to Input Output (IO) blocks including, but not limited to, memories and In-tellectual Property (IP) blocks. Using this cell, designers implemented the desired functionalities that were then tested, layout and sent for manufacturing.

The semiconductor evolution keeps following Moore’s law (Figure 1.2, Figure 1.3). In 1965, Gordon E. Moore published a paper titled Cramming More Components onto Integrated Circuits where he predicted that the number of transistors that could be integrated in a single chip will double every 18 months [14]. Although this was a heuristic prediction, this trend is still present as per 2020. The technology current node is currently manufacturing 7 nm gate-length transistors and the next node of 5 nm is already announced by the International Technology Roadmap for Semiconductors (ITRS) to appear in 2020.

This extreme evolution has been made possible thanks to the parallel evolution of EDA tools which is commented in next section.

(15)

1970 1980 1990 2000 2010 2020 103 105 107 109 1011 Years T ransisto r count

Figure 1.3: Transistor count evolution for real-world designs

1.1.2 History of EDA Tools

Before EDA tools existed, integrated circuits were designed by hand, and manually laid out. Geometric Computer Design Automation (CAD) software was used to generate the tapes that were later sent out to the manufacturing companies. As more powerful chips appeared CAD tools needed to provide the functionalities needed in the design process giving birth of what is known as Electronic Design Automation (EDA) tools.

The beginning of EDA tools can be traced back to the early 1980s [15] similar to the appearance of ASICs designs. For many years companies had been developing their own an internal tools but around these years EDA software began to be sold to third party companies. The evolution of EDA tools has made possible the evolution of the semiconductor industry which, at the same time, by providing more powerful chips that could automate some of the design process such as Place and Route (PnR), facilitated the evolution of EDA tools. Hence, the chicken and egg problem mentioned above.

Today, the EDA industry is completely independent of the manufacturing industry and is estimated to be a$5 billion industry by itself. Compared to the estimated value of the semicon-ductor industry of around $400 billion as per 2017 its clear that the later one is much bigger. This is true but it should be noted that the EDA industry is comprised by very large companies, such as Synopsys® or Cadence®, thereby monopolizing the market and limiting its growth compared to the semiconductor industry.

New challenges emerge in the evolution of both ASIC and EDA, notably the emerging Artificial Intelligence (AI) paradigm that will probably generate a new chicken and egg problem. New ASIC architectures will support the evolution of AI that probably will later contribute in the design process in some manners that are not discovered yet.

1.2 Problem statement

Design a methodology to distribute the clock signal given a SiLago floorplan. The clock signal should travel through the global NoC blocks. The methodology should also include the verifica-tion of the data transmissions to ensure correct timing behaviour and reliable communicaverifica-tion.

(16)

1.3 Project goals

The goals of the project are divided into main and secondary and ordered by priority. Main goals

• Develop a methodology for distributing the global clock signal in a SiLago design. • Test the developed methodology to ensure correct data transfers between all elements of

the NoC.

• 1_{Design a test-bench generator that can create randomised SiLago floorplans so that the}

proposed clock distribution methodology can be tested. Secondary goals

• Generate concepts of the SiLago blocks that will be required to implement the clock distribution methodology.

• Low level design of the mentioned blocks.

The primary goals are the ones considered essential for the completion of the thesis work. They form the key core of the work developed during the thesis and will serve as a building point for future work on the topic. The secondary goals on the other hand, are considered non-essential for the completion of the thesis and, even though they are of great importance, have been left in a lower priority and have been only completed if the time has allowed to. The discussion about the achieved goals and future work can be found in chapter 4

1.4 Research methodology

The topic that is being discussed in this thesis is something that has not been studied in the past. The fact that the application of the work of the thesis is targeting a new design methodology currently still in development makes this work a first-of-its-kind. However, similar problems have been studied and solved in the past. This work has analysed the publications that have touched on similar topics and based the development from the conclusions presented. The related works are presented in section 1.6.

1.5 Project boundaries

In order to have a feasible amount of task to be completed during the limited time-span of the project, specific boundaries have been imposed in order to keep the scope of the project predictable.

The following assumptions are taken during the project:

• The NoC graph, with positions and lengths of all nodes and edges is known.

1_{This goal was added during the development of the process since it was noticed that designing a test-bench}

generator was essential for the accomplishment of the second goal

(17)

• The clock frequency at which each edge operates is known, or at least its bounds. • The clock distribution method should be able to provide a deterministic prediction of the

latency between any two points in the NoC, this prediction can be in the form of upper and lower bounds.

1.6 Previous and related work

[8, 9] Presents the concept of the SiLago platform, a novel VLSI design methodology that tries to lower the engineering costs of developing modern ASIC designs by raising the level of abstraction from standard cells to micro architectural blocks that are pre-synthesised and hardened in a one-time engineering effort. This eliminates the need of synthesising ad-hoc wires and thus reducing the development costs associated with the physical design aspect of IC design. [4, 5, 3] Presents the concept of Globally-Ratiochronous Locally-Synchronous (GRLS), a GALS-related communication protocol that intends to provide maximum throughput and min-imum latency between rationally-related clock frequency domains. It also enables a latency-insensitive communication between regions.

[2, 1, 16] Analyses distribution methods in traditional VLSI-style NoC. These publications review the state-of-the-art problems that emerge in NoC-based designs when they are synthe-sised using traditional VLSI methods. The proposed solutions tackle the problem of the clock distribution method as well as the data communication between connected elements in the NoC. The solutions rely on GALS-style communication schemes using handshakes and synchroniser stages.

[18] Discusses the clock distribution for the regional clock tree in the SiLago platform. The proposed solution relies on the balancing of all the local clock tree entry points inside each SiLagoregion. The balancing is achieved by the means of a programmable delay line. The work proposes a method for calculating the delay needed at each local clock tree entry point and proves the scalability of the method for the intended chip sizes that the SiLago platform tries to cope with.

(18)

(19)

Chapter 2

SiLago platform

This chapter presents the environment that in which the work on this thesis has been developed: the SiLago framework.

2.1 Introduction

SiLago, which stands for Silicon Large grain object, is a project currency being carried out by the Department of Electronics, part of the EECS School from KTH university. The SiLagp project aims to develop a new ASIC design methodology by introducing the concept of synchoricity, the spatially analogous term for synchronicity. Thanks to a regular grid based layout process, the SiLagoframework raises the level of abstraction from the traditional standard-cell-based design to a micro-architectural-based design.

2.2 Building blocks

This section presents the main building blocks of the SiLago platform.

2.2.1 Dynamically Reconfigurable Resource Array

Dynamically Reconfigurable Resource Array (DRRA) is a coarse grain reconfigurable fabric tar-geted at the implementation of parallel Digital Signal Processing (DSP) application, specially for data streaming. The fabric is composed of cells that are organised in two rows and a variable number of columns. Figure 2.1 shows a Dynamically Reconfigurable Resource Array (DRRA) structure.

The DRRA cell is composed of four components: register file, sequencer, Data-Path Unit (DPU) and two Switch Boxs (SBs). They can be seen in Figure 2.1.

Register file The DRRA register file is designed to work with DSP application. For this reason it includes two read and two write ports to treat complex numbers, but they can be used for any arbitrary operation that requires two inputs. The register file contains 32 16-bit words by default. It is also comprised of a Address Generation Unit (AGU) that allows streams of data with spatial and temporal programmability. Temporal programmability allows to insert a time

(20)

Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Input busses Output buss es DRRA cell

Figure 2.1: DRRA fabric.

delay at any point of the stream. Spatial programmability allows a variety of operating modes to be used, like linear or circular buffers.

The AGU are Finite State Machines (FSMs) that can be configured depending on the application needs. This provides more efficient address generation but also reduce the cost of address transportation.

Data-Path Unit The DRRA Data-Path Unit (DPU) is composed of four input words that can correspond to two complex numbers but, similarly as the register file, they can be used for any arbitrary 4-input operands. It supports the typical DSP operations, like Multiply Accumulate (MAC), which can also be commonly found in Neural Network (NN) hardware implementations, Finite Impulse Response (FIR), sum of difference, rounding, etc.

Sequencer The sequencer is mostly a configuration unit, but it can also serve as a simple sequencer that is meant to handle control of static DSP functionalities.

The principal functionality of the sequencer is to configure a vector operation by configuring the register files, connecting the DPU and register file via the SB, and enabling the desired mode of the DPU.

Sliding window interconnect The DRRA intra-regional interconnect structure is composed of a Sliding window nearest neighbour connectivity. All DPU and register file output to a bus that spans three columns in each direction. The output horizontal buses intersect the input buses as shown in Figure 2.1. At each intersection SBs can be found, 4 in every column. These SBs can be configured to select any input as a source from any DPU or register file in the spanning window described before, 3 positions in each direction.

(21)

2.2.2 Distributed Memory Architecture

The DRRA is mainly a computational fabric with low latency read and write capabilities from the DPUs but DSP applications require larger memory to store packets other than the off-chip Dynamic Random Access Memory (DRAM). Furthermore, the capabilities of the DRRA fabric of providing a large degree of parallelism would be wasted if a similar degree of parallel access to a large memory was not possible. The Distributed Memory Architecture is a configurable memory fabric that accomplishes these objectives: providing a large memory to the DRRA fabric and do it in a parallel-capable way that matches the capabilities and requirements imposed by the DRRA structure. Figure 2.2 shows a representation of the Distributed Memory Architecture (DiMArch) structure. Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB Reg. File SB Sequencer DPU SB SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM P ack et-switched control NoC Circuit-switched data NoC

Figure 2.2: DiMArch structure.

Each cell that composes the DiMArch has an Static Random Access Memory (SRAM) whose dimension is decided at compile time but physically takes the space of one or two DRRA columns. The current implementations contains SRAM banks of 2 KB. Each bank has a similar spatial and temporal programmability capabilities, similar to the DRRA register files. These memory banks are glued by a circuit-switched NoC whose SBs can be programmed by configurable FSMs.

A switched NoC is used during the SRAM banks configuration process. The packet-switched NoC is also used to configure the circuit-packet-switched NoC used for the high bandwidth data transfers. The packet-switched NoC offers the convenience and flexibility to be used as a

(22)

low bandwidth and low utilisation communication channel where as the more efficient circuit-switched NoC is used for higher bandwidth communication. The overhead introduced by having to reconfigure the packet-switched NoC is not present in the packet-switched NoC but the later introduces overheads in the routing decision that needs to take place at every switchbox of the NoC.

2.3 Structure and floor-planing

This section presents two of the principal aspects of the structure of the SiLago platform, which are: the composition by abutment and the clock distribution, the later being of high relevance for this thesis.

2.3.1 Composition by abutment

The concept of composition by abutment is the key enabler of the design methodology that the SiLagoplatform proposes. By constraining the blocks to have a size that is an integer multiple of the SiLago grid all building blocks can be composed together in a similar way as LEGO®bricks are used to construct complex structures. When two compatible blocks are placed adjacent to each other, their wires connect together creating a DRC-clean GDSII design. In order to achieve that, all blocks that are intended to be placed adjacently must have all the wires distributed to the boundary of the block and placed in the correct position and metal layer. Figure 2.3 shows an example of several blocks that have been composed by abutment with its wires connecting together. P ow er strip es Power stripes

Inner power rings Connection by abutment Region instances

Figure 2.3: Structure and connections by abutment.

The composition by abutment removes the need of synthesising ad-hoc wires since all the infrastructural elements that are required are already part of the building blocks that are used in the design process. This is one of the major contributors in cost in modern ASIC designs. Furthermore, the existence of such wires also introduces a enormous difficulty when it comes to estimate the power consumption of the interconnect wires, since its precise shape and length is not known until the physical synthesis is completed. Interconnect wires play a major role in the power consumption breakdown in modern ASIC [12] and thus, being able to precisely estimate its power consumption is a major advantage that the SiLago platform offers.

(23)

2.3.2 Clock distribution

Clock distribution plays a central role in the SiLago methodology. A SiLago design divides the clock in 3 main hierarchical levels: local, regional and global. Figure 2.4 shows an example of a SiLago floorplan where the 3 levels of clock tree can be seen.

LCT LCT GRLS LCT LCT LCT LCT LCT LCT LCT LCT LCT GRLS LCT LCT LCT LCT LCT LCT LCT LCT LCT LCT LCT LCT LCT LCT LCT LCT GRLS LCT LCT RISC-V PLL

Global Clock Tree wires NoC switch SiLago block

Region specific NI SiLago block with GRLS interface

Functional region instances Local Clock Tree Regional Clock Tree delay line

Figure 2.4: SiLago floorplan

Local clock tree

The local clock tree is found inside each SiLago block. This clock tree is generated using commercial EDA tools in the same way as the clock tree of a traditional ASIC design is performed. The appropriate settings and constraints are loaded into the tool and it will generate a clock tree and verify that all timing paths are meeting the timing constraints. The tool will iterate the design of the clock tree until all timing closure is achieved. The clock tree can be seen in Figure 2.4 in the blocks as a green box labelled LCT.

Regional clock tree

The regional clock tree is the second level of clock hierarchies in the SiLago platform. The regional clock tree addresses the distribution of the clock signal to an entire SiLago region. As opposed to the local clock tree, the regional clock tree is not generated using commercial EDA tools. A solution to the regional clock tree generation is proposed by Stathis et al. [18]. The proposed solution makes used of adjustable delay lines to balance the entry points of the clock signal into each local clock tree of the interested region. The delay line that comprises the regional clock tree can be seen in Figure 2.4 as a pink bubble in each block that comprises a functional region.

(24)

Global clock tree

The global clock tree is the higher level of clock hierarchies in the SiLago platform and is the focus of study of this thesis. The global clock tree is distributed through the global NoC and is responsible of delivering a clock signal to all functional regions of the design. The global clock tree can be seen in Figure 2.4 as the pink blocks that span in between the regions. The details on the solution proposed in this thesis are presented in chapter 3.

(25)

Chapter 3

Global clock distribution

This chapter discusses the central topic of this thesis: the global clock distribution. A detailed description of the problem is given as well as the mathematical models that have been developed and used to solve the problem.

3.1 Problem details

As described in section 1.2, the global clock distribution methodology comprises the distribution of a valid clock signal to all clock sinks as well as the configuration of the switch-boxes that constitute the global NoC.

3.1.1 Problem modelling

In order to be able to propose a solution that can be formally analysed and ensured to be correct, a formal model of the problem input has been constructed. The input element to the problem, the global NoC, has been modelled as a graph. An example of a graph can be seen in Figure 3.1.

Figure 3.1: Graph representation of a NoC.

(26)

• Nodes represent switch-boxes • Edges represent NoC wires

• Nodes have coordinates that represent their physical location in the floor-plan • Edges have weights that represent the distance, or delay, of the wires

Using this model the clock distribution problem can be formulated using standard graph theory methodologies and algorithms.

3.1.2 Problem partitioning

The problem has been divided into two main problems that can be solved separately, these are the clock distribution itself and the verification of data transfers data transfer timing constraints.

Clock distribution

The first objective of the clock distribution method is to deliver a valid clock signal to all sinks of the design. A valid clock signal has to fulfil the requirements imposed by the technology, such as maximum and minimum slew, maximum allowed jitter, etc. Note that the timing constraints associated with the data transfers are analysed and treated separately in the next section.

Figure 3.2 shows a NoC graph where the clock has been distributed through some of its edges, the ones marked in pink.

1.

2.

Figure 3.2: Clock tree spanning through the NoC wires.

The clock distribution is achieved by constructing the Minimmum Spanning Tree (MST) of the graph. This method achieves the minimum weighted tree that connects to all nodes. By minimising the length of the clock tree the power consumption related to switching power will be also minimised. This is mainly due to the reduced capacitance of the clock nets and the reduced amount of clock buffers that will be required to maintain the integrity of the clock signal.

(27)

Data transfers

The second and equally important objective of the clock distribution method is to ensure that all data transfers throughout the global NoC will be reliable. The details of the data transfer verification are explained in the following section.

3.2 Evaluated concepts

During the initial phase of the project a set of concepts was evaluated. Based on previous literature it was found that a small set of solutions had been proposed for similar problems that the one it was being solved in this thesis, however all of them are intended to target ASIC designs based on traditional VLSI flows. All the solutions that could be found are based on mesochronous interfaces [11, 1, 6, 2].

Based on the mentioned existing solutions one of the first concepts that was considered is shown in Figure 3.3a. Every switch-box on the global NoC is comprised of four GRLS interfaces, one at each entry point, so that all communication is done in a latency insensitive basis. This is the simplest solution that can achieve the desired goal but it will introduce unnecessary overhead, furthermore additional considerations need to be taken about long paths with delay greater than one clock cycle.

NoC wires _GRLS GRLS GRLS GRLS GRLS GRLS GRLS GRLS Switch fabric GRLS interfaces

(a) Switchbox with GRLS interfaces in all di-rections.

R R R R

Phase domain crossing

(b) Phase domains with domain crossing points.

Figure 3.3: Concepts of global clock distribution.

A different concept that was considered can be seen in Figure 3.3b. Similar to the concepts of clock or power domains, a phase-alignment domain is achieved by balancing the skew of all clock sinks in a specific region of the NoC. When data needs to cross phase domains a GRLS interface is used to provide a latency insensitive communication between both phase domains. This solution was discarded since the complexity of the implementation was considered to be to much for the scope of this thesis.

3.3 Data timing analysis

The data paths can be categorised in two types. Two paths in Figure 3.2 have been labelled accordingly:

• Type 1 paths consists of paths where clock and data travel through the same edge of the graph.

• Type 2 paths consist of paths where the clock and data don’t travel through the same edge of the graph.

(28)

The two paths are treated separately since the timing behaviour is very different and different methods to ensure the data correctness will be used. Both methods are presented in the next sections.

3.3.1 Type 1 paths

Type 1 paths consists of paths where clock and data travel through the same edge of the NoC. This situation can be characterised as a well-known flop-to-flop communication found commonly in synchronous designs. An example is presented in Figure 3.4.

A Wires/buffers B

Wires/buffers

Figure 3.4: Flop-to-flop synchronous communication.

In a type 1 path the clock skew between registers A and B is only dependent on the delay introduced by the wires or registers represented in a pink box in Figure 3.4. The equations that regulate the data transfers for a type 1 paths are both the setup and hold equations for the forward and backward path1_{. The equations are presented in (3.1-3.4).}

tCA+ tCQ+ tDAB> tCB+ thold (3.1)

tCB+ tCQ+ tDBA > tCA+ thold (3.2)

tCA+ tCQ+ tDAB < T+ tCB− tsetup (3.3)

tCB+ tCQ+ tDBA < T+ tCA− tsetup (3.4)

Where:

• tCA: clock arrival at flop A • tCB: clock arrival at flop B • tCQ: flop delay from C to Q

• tDAB: data delay from flop A to B

• tDBA: data delay from flop B to A • tsetup: flop setup time

• thold: flop hold time

• T : clock period

1_{In this context forward path means the path where the data travels in the same direction as the clock, and}

backward where the data travels in the opposite direction as the clock

(29)

From Equation 3.4 the maximum length of the combinational path can be derived. By expressing the delay tDBS as proportional relation with the distance we can write the maximum length constrained by the backward setup condition as Equation 3.5.

L < T − tsetup− tflop

2D (3.5)

Where:

• D: delay per unit of length • L: length of the wires

Although equations (3.1-3.4) are a necessary condition for the correctness of the timing, it is not sufficient that those conditions are meet to be able to ensure a correct operation. If any of the combinational paths have a combinational delay that exceeds one clock cycle that path will have to be pipelined using registers. The segmentation of combinational paths is explained in section 3.5

3.3.2 Type 2 paths

Type 2 paths consist of paths where clock and data signals travel through different edges of the NoC. Similarly, to type 1 paths, this situation can be thought of two flops communication but in this case the clock paths that reach both flops are completely unrelated (phase wise). A type 2 paths representation is shown in Figure 3.5.

A Wires/buffers B

Wires/buffers

GRLS GRLS

Figure 3.5: Flop-to-flop asynchronous communication.

In a type 2 path the skew the clock skew between registers A and B is dependant on the delay introduced by each branch of the clock tree that reaches each register and independent of the delay in the data path between the registers. The fact that the clock skew is not related to the data delay makes the synchronous communication scheme impractical and hard to implement. In a best-case scenario, such a synchronous communication would only work if both branches of the clock tree are correctly balanced in terms of injection delay. Moreover, any change in Process, Voltage and Temperature (PVT) variations that affects the clock skew would compromise the reliability of the data transfers.

In order to overcome the aforementioned issues a latency insensitive communication scheme has been used in type 2 paths. The working principles of the communication scheme are explained in section 3.4.

(30)

3.3.3 Latency calculations

Figure 3.6 illustrates a case of a communication scheme between two resources, DRRA and DiMArch in this case. Both resources integrate a Network Interface (NI) that connect to a SB which interfaces the global NoC through a GRLS in case of a type 2 path. A type 1 path, in the other hand, would omit the GRLS interfaces in both ends.

NI GRLS NoC GRLS NI

DiMArch DRRA

f1 f2 f3

SB SB

Figure 3.6: Communication betwen NIs.

In order to properly schedule the computations that are being performed in the different resources, the latency, or worst-case latency, between such resources needs to be known. The calculations for the worst case latencies are presented in the following sections, separately for type 1 and type 2 paths.

Type 1 paths

To calculate the worst case latency for a type 1 path we consider the following steps: • The transmitter produces an output at t0

• The data travels through N register stages2

• The clock period Tclk for this path

The worst case latency can be written as:

tSB arrival= t0+ N Tclk (3.6)

The worst case latency fora type 1 path is a deterministic value since the exact path and the delay in all elements of the path is known.

Type 2 path

To calculate the worst case latency for a transition between two SBs through the global NoC we can consider the following steps3_:

• The transmitter produces an output at t0

2_{The register stages have been inserted in the segmentation process. See section 3.5} 3_{The GRLS notation defined in section 3.4 has been used in the calculations}

(31)

• The GRLS transmitter First-In First-Out (FIFO) will read the output at the next trans-mitter clock cycle, i.e. the maximum latency to belNR

NT m

• The maximum latency through the register stage of the GRLS receiver will be 3 receiver clock cycles, i.e. 3TR

With this, the maximum latency can be written as follows. tNI−to−NI _max= 2 1 + NR NT TT+ 3TR | {z } GRLS transmission +tNoC (3.7)

It is important to note that the GRLS communication scheme relies on dataflow property 2 to sample data [3]. This means that to properly sample data at cycle PCi the correct strobe

signal has to be sent in the previous periodicity cycle, i.e. PCi−1. Effectively this requires that

a dummy data transfer is done during the first Periodicity Cycle (PC) after power-up because it is wrong.

3.4 Globaly ratiochronous locally synchronous interfaces

This sections presents a basic explanation on the GRLS interfaces. For more details refer to the original paper by Chabloz and Hemani[3].

Part of the work in this thesis has used the previous work on GALS-related interfaces by Chabloz and Hemani [3]. Chabloz and Hemani introduced the Globally-Ratiochronous Locally-Synchronous (GRLS) concept, a communication scheme for rationally related frequencies that operates on a latency insensitive basis. The rationale behind the proposed interfaces relies on the fact that most modern ASIC designs, even though they globally operate in an asynchronous manner, the different clocks used in different clock domains are derived from the same Phase-Locked Loop (PLL), so in essence the frequencies are always rationally related. By exploiting this characteristic a set of properties are derived [3] and a more efficient implementation than traditional Globally Asynchronous Locally Synchronous (GALS) methods can be achieved.

3.4.1 Concept and notation

The main idea behind the GRLS concept is to take advantage of the rationally related nature of the transmitter and receiver clocks to guarantee metastability-free sampling points for the data. A strobe signal is used to signal the receiver that a new data output has been generated. The sampling of the strobe signal will allow the receiver to determine if the captured data is metastability free or not. In case that the data is determined as metastable the next sample is guaranteed to be metastability-free and thus can be safely read.

Different notations are used in the original paper [3] and have also been used in this thesis when referring to the GRLS. The basic notations are presented hereunder.

• clkT: receiver clock with a frequency

fT=_f1_T

• clkR: receiver clock with a frequency

(32)

• clkH: clock from which clkR and

clkT can be generated, formally fH =

LCM (fT, fR) = _T1_H

• We define NT=f_fH_T and NR= N_NR_H

• We define P = LCM (NT, NR)

3.4.2 Transmitter

The GRLS transmitter consists three elements: a regulator, a strobe generator and a First-In First-Out (FIFO). A schematic representation of the GRLS transmitter is presented in Fig-ure 3.7a. The algorithm for the regulator block is described in Algorithm 1.

Algorithm 1Transmitter regulator.

1: if NR≤ NTthen

2: send ←1

3: else

4: c ← NR

5: loop

6: Wait for rising edge of clkT

7: if c > NR− NTthen 8: send ←1 9: c ← c −(NR− NT) 10: else 11: send ←0 12: c ← c+ NT 13: end if 14: end loop 15: end if

The GRLS transmitter output signals contain the data bus, a flag to indicate that the data is valid, and finally, a strobe signal that toggles its value when a new data frame is being transmitted. This strobe signal is used by the GRLS receiver to properly sample the received data avoiding metastability [7] issues by exploiting the ratiochronous nature of the communication scheme. Figure 3.7b shows a timing diagram of a transmission in a fast transmitter situation where the transmit rate is higher than the receive rate.

3.4.3 Receiver

The GRLS receiver consists of two main blocks: a strobe analysis stage and a register stage. The purpose of the strobe analysis stage is to sample and analyse the strobe signal. The strobe analysis stage determines if the data is guaranteed to be metastability-free at the sampled point and signals it to the register stage. The register stage samples the data at multiple points and decides which sample is guaranteed to be metastability-free depending on the signals received from the strobe analysis stage. A schematic representation of the receiver stage can be seen in Figure 3.8.

(33)

Transmitter FIFO data valid clkT clkT Regulator clkT send D_E Q clkT strobe data out valid out GRLS transmitter (a) Schematic clkT send data in 1 2 3 4 5 valid in data out 1 2 3 4 5 valid out strobe (b) Timing diagram Figure 3.7: GRLS transmitter

3.4.4 Dataflow properties

The ratiochronous nature of the communication scheme provides a different set of properties named dataflow properties.

• Average rate: The number of data items d output in a time KTRis always d ≤ K + 1.

• Periodicity: The flow of data is periodic with period PC (Periodicity Cycle).

• Maximal instantaneous data rate: The minimal time between to successive data out-puts is TD>T₂R.

• Minimal instantaneous data rate: The maximum amount of time between two succes-sive data outputs is TT

l

NR

NT m

3.5 Path segmentation

One of the main points of interests in the clock distribution methodology proposed in this thesis is the path segmentation, which provides a way of handling with long paths as well as ensuring a error-free communication. Once again, due to the different nature of type 1 and type 2 paths they have been handled separately in the segmentation process.

Type 1 paths

The segmentation of type 1 paths will be imposed by the maximum length of the path as described in (3.5). The objective of the type 1 path segmentation process is to insert the minimum number of register stages that ensures that the maximum length of wires is still limited by (3.5). Figure 3.9 shows a type 1 path after the segmentation process. As can be

(34)

clkR clkR2 clkR clkR2 clkR clkR clkR clkR P0 P2 N0 N2 S0 S2 SN0 SN2 SP0 SP2 clkR D0 clkR clkR clkR P0 NR− LS− 1 clkR clkR clkR P0 NR− LS− 1 sn clkR sp TW clkR1 TW clkR clkR2 strobe

(a) Strobe analysis

P REG E sp data in sn E valid in clkR1 clkR1

N REG SYNC_REG

clkR SAVE REG LN vnvsvp LP vp(vn+ vs) clkR 1 0 0 1 vp vn vs vp vs data out valid out vs+ vn+ vp (b) Register stage Figure 3.8: GRLS receiver

observed, register stages (blue boxes) have been inserted. These register stages include a register bank for the data signals as well as buffers for both the clock and data signals. The register bank is clocked by the clock signal that arrives trough the NoC wires. It should be noted that no GRLS interfaces are used in this path.

Given the nature of the SiLago platform, all elements of the physical design process are spatially discretised so the length of the both the NoC wires and the register stages will be determined according to the grid size. Since the objective is to have the minimum required number of registers the segmentation process will insert the maximum number of length grid units that still fulfils (3.5). The register insertion process is described in Algorithm 2.

(35)

Switch box Wires/buffers Register stage

Data

Clock

Figure 3.9: Type 1 path segmentation. Algorithm 2Type 1 path segmentation.

1: _{procedure type1 segmentize(edge)}

2: Ledge←Length(edge) 3: if Ledge> L1then 4: for iin 1 : l_L edge Lmax m do

5: insert register stage in point i × L1

6: end for

7: end if

8: end procedure

Type 2 paths

The segmentation of type 2 paths will be imposed by the maximum length of the path con-strained by the clock period. The objective of the type 2 segmentation process is to insert a GRLS interface as far as possible from the switch-boxes. Figure 3.10 shows a representation of a type 2 paths where register stages have been inserted.

Switch box Wires/buffers Register stage

Data

Clock L1 GRLS interface L2 Clock

Figure 3.10: Type 2 path segmentation.

Two lengths, L1and L2, have been marked. These correspond to the length between register

stages and between GRLS stages, respectively. As seen in previous sections, the distance between register stages is constrained by (3.5) and the distance between GRLS stages is constrained by the combinational delay being longer than one clock cycle.

For this reason a type 2 path will only require to be segmented if the combinational delay is longer than one clock cycle. After a register stage has been added, the additional register stages will be constrained by (3.5) as if it was a type 1 path.

(36)

Algorithm 3Type 2 path segmentation.

1: _{procedure type2 segmentize(edge)}

2: Ledge←Length(edge) 3: if Ledge> L2then 4: for iin 1 :lLedge−L2 Lmax m do

5: insert register stage in point i × L1

6: end for

7: insert GRLS stage in point i × L1

8: end if

9: end procedure

3.6 Final clock distribution method

The final clock distribution method can be sumarised in Algorithm 4. As it can be seen, it makes use of the segmentation methods described in section 3.5 to generate a clock tree that spans to all elements of the NoC as well as taking care of the data transfers between all switch-boxes of the NoC.

Algorithm 4Clock distribution. Require: N oC graph

Ensure: clock graph

1: clock tree ← M ST(N oC graph)

2: for edge ∈ N oC graph do

3: if edge ∈ clock tree then

4: _{type1 segmentize(edge)}

5: else if edge /∈ clock tree then

6: type2 segmentize(edge)

7: end if

8: end for

3.6.1 SiLago blocks

The implementation of the clock distribution methodology comprises not only the conceptual methods for distribution of the clock signal but also determining which blocks have to be used during the floor planing process. One should remember that the concept behind the SiLago framework is to remove the standard cell level design and abstract all the functionality on pre-synthesized, hardened, abutable blocks. The different blocks that have been considered during this thesis are presented in the following sections.

Wire blocks

The most basic block that has been considered is the wire block. The wire block, as it’s name suggests, consists exclusively of wires that connect each end of the block to the opposite side. Wire blocks are intended to be used over short distances or in between buffer or register blocks

(37)

when the operating conditions do not require buffers to maintain the signal integrity character-istics required by the design kit specifications. Figure 3.11 shows a simple representation of a wire block.

x256

Figure 3.11: Wire block.

It should be noted that the wire block has data wires going in both directions, in particular 256 in each direction, but the clock wire is only going in one direction. This is is due to the way the paths have been structured, as described in section 3.5. Since the block is symmetrical in the y-axis, it can be rotated ±90 and 180 degrees to achieve the desired direction of the clock signal.

Buffer blocks

The next block that has been considered is a buffer block. If the operating conditions where ideal, the buffer block would behave equivalently as the wire block, but in a real-life design, the buffer block complements the wire block by adding active buffers that facilitate the driving of the wire load. Figure 3.12 shows a simple representation of a buffer block.

x256

Figure 3.12: Buffer block.

Similar to the wire block, the buffer block has also 256 buffered wires going in each direction but the clock wire only in one direction. As the wire block, it can be rotated ± and 180 degrees to achieve the desired direction of the clock signal. The sizing of and trade-off considerations of strength vs. power consumption and area of the buffers has not been considered in this thesis. Register blocks

The third block that has been considered is a register block. The register block plays a crucial roll in the segmentation method. By inserting register stages we can pipeline long paths that

(38)

would otherwise have to be treated as multi-cycle paths, decreasing its maximum throughput. Figure 3.13 shows a simple representation of a register block.

x256

Figure 3.13: Register block.

Similar to the previous blocks, the register block contains two 256 buses of wires going in each direction plus a clock wire going in only one direction. Additional the register block includes registers in each of the data lines. The registers are clocked using the clock signal that travels trough the block. Similar to the other blocks they can also be rotated ± and 180 degrees to adjust the direction of the clock and data. The balancing of the inner clock tree of the block has not been considered in this thesis.

GRLS blocks

The final block that has been considered in this thesis is the GRLS interface block. This block also plays a crucial roll in the proposed distribution scheme since it allows for communication on a latency insensitive basis to occur.

x256

GRLS

Figure 3.14: GRLS block.

Similar to the previous blocks the GRLS block contains two 256 wide buses of wires going in each direction, but in this case the clock wire is not propagating to the next block, it only enters the GRLS block and clocks it’s elements. Each of the two buses is governed by a GRLS interface which means that the additional strobe signal, as described in section 3.4, is also transmitted and received. Similar to the other blocks they can also be rotated ±90 and 180 degrees to adjust the direction of the clock but, in this case, since the clock signal doesn’t exit the block, the only important aspect is the entry point of the clock signal.

(39)

3.6.2 Experiments

Given the current state of development of the SiLago platform the experiments have been carried on an artificial test bench generator that has been developed as part of this project. The details of the test bench generator are presented in Appendix A. The test bench generator has helped test different concepts of segmentation procedures and verify the effectiveness of them.

The clock distribution method presented in the previous sections has been implemented with multiple floor-plans generated by the synthetic generator and the results have been successful given the initial expectations of the thesis.

Figure 3.15 shows an example of the clock distribution method. In particular, Figure 3.15a shows the clock tree after being distributed in through the NoC graph. We can observe that the clock edges (red colour) reach all switch-boxes of the NoC which means that all SiLago regions will receive a clock signal.

(a) Before segmentation. (b) After segmentation.

Figure 3.15: Clock distribution example.

Figure 3.15b shows the results of the segmentation process. One can observe the two types of paths described in previous sections: type 1 paths (the ones with red edges) and type 2 paths (the ones with yellow edges). Figure 3.16 presents the results of the clock distribution algorithm for a different set of test-cases generated using the test-bench generator. Different levels of “regularity” can be observed in the different test-cases which is a result of the different constraints in the test-bench generator. As can be observed the distribution algorithm success-fully distributes the clock signal to all the switch-boxes of the NoC as well as inserting all the required register and GRLS stages according to the procedures described in previous sections (3.3, 3.5).

(40)

(a)

(b)

(c)

(d)

Figure 3.16: Multiple runs of the clock distribution method on different floor-plans.

(41)

Chapter 4

Conclusions

This chapter presents a summary of the results and analysis of this work as well as the remaining steps required to fully finalise the clock distribution methodology that have fallen out of the scope of this thesis.

4.1 Results

The results of this thesis include a global clock distribution methodology targeting the SiLago platform which allows a correct-by-construction design process, an intrinsic characteristic of the SiLago methodology. A test-bench generator was also developed during the thesis which, although not being an important part of the clock distribution method, it helped during the testing and validation phase of the development.

4.1.1 Test-bench generator

One of the contributions of this work is a test-bench generator that synthetically creates SiLago floor-plans which have been used to test the distribution methods. The test-bench generator presents a useful tool that can be used in the future development of the SiLago project.

The key characteristics that the test-bench generator can offer can be summarised into the following aspects.

Parameterised geometrical characteristics A different set of geometrical characteristics can be modified so that the distribution and shape of the randomised SiLago resources are influenced, such as aspect ratio constraints, minimum and maximum dimensions of the resources, density of the resources etc.

Generation of the NoC graph During the generation of the randomised floor-plan a graph representing the NoC is generated. This graph can later be used during the clock distribution process to apply the segmentation rules and generate the clock tree.

The limitations and future improvements that can be implemented in the test-bench gener-ator are discussed in the following sections.

(42)

4.1.2 Clock distribution methodology

The main contribution that this work presents is a clock distribution methodology that is capable of distributing a valid clock signal to all elements of a SiLago floor-plan. The clock distribution method not only analyses the distribution aspect of the problem but also the verification of the data transfers that occur through the global NoC.

4.1.3 SiLago blocks

This work has also analysed and conceptually conceived the different SiLago blocks that need to be used during the physical construction process of the SiLago flow. As stated in the goals of the project (section 1.3) this thesis has not taken focus at the low level design of each block, neither at the characterisation of them. The remaining steps are described in section 4.3. The main results of the block concepts are presented hereunder.

Block concepts The blocks presented in subsection 3.6.1 can be used as a complete basis capable of generating the necessary physical representation of the global NoC using composition by abutment, one of the fundamental concepts upon with the SiLago platform is build. Feasibility of the blocks The low level physical design was not part of the goals of this thesis, however different “rough” designs have been synthesised, laid out and routed as a proof-of-concept to ensure that the dimension requirements are not completely infeasible. The conclusion of this steep has found that even the most demanding, in terms of area, of the blocks, the GRLS block can safely fit as a 1x1 tile on the SiLago grid.

4.2 Analysis and discussion

A brief analysis and discussion related to the previously mentioned results is presented together with the aspects that can be improved. Together with he future work (section 4.3) present the some of the key aspects that can be worked on and improved after this thesis.

4.2.1 Test-bench generator

The test-bench generator served a key role in the development of this thesis, however it is uncertain if it will be needed in the future. The main reason for requiring the use of a test-bench generator was the fact that there were no existing designs available to test the concepts presented on this work, thus a synthetic test-bench generator was required.

The performance of the test-bench generator has been enough allow the concepts of this thesis to be tested but the way that the geometrical constraints have been implemented can lead to corner cases not being exploited and thus giving only a subset of all the possible floor-plans that can be encountered in a real-life design.

4.2.2 Clock distribution methodology

The clock distribution method and timing verification process proposed in this thesis has been based on fundamental concepts used currently in ASIC physical design and that all modern EDA tools implement. Given the fact that the low level implementation of the blocks that comprise

(43)

the global NoC, and thus distribute the clock, will be done using commercial EDA tools it is to be expected that the physical aspects of the implementation wont pose any problem to the proposed. As mentioned in the future work section the verification of real implementations using STA process is part of the next steps that will continue after this thesis.

4.2.3 SiLago blocks

The SiLago blocks that have been proposed in this thesis can be used as a basis to generate any arbitrary clock distribution scheme. Nevertheless, a more detailed implementation and low level implementation of the blocks will arise other design aspects that will have to be discussed. The configuration capabilities of the blocks is an aspect that has been left out of the scope of this thesis and thus the potential problems and/or difficulties that could emerge from that have not been able to be analysed.

Furthermore, the implementation difficulties that will come from the physical design aspect of the block implementation, such as balancing all the skew in the data lines or the timing violations that might occur during the low level implementation of the synchronous blocks. Parts of this potential problems are described in the next section.

4.3 Future work

The work on this thesis has focused in the concept and the methods used in the clock distribution rather than the low level implementation of the blocks. For this reason one of the main elements of the future work is the low level and detailed implementation of the blocks presented in subsection 3.6.1. A brief explanation of the remaining task are presented hereunder.

4.3.1 Low level block design

One of the most important work that remains to be completed is the detailed low level imple-mentation of all the blocks proposed that need to be used for the clock distribution methodology described in subsection 3.6.1. The low level design of the blocks will have to take into account the physical constraints, such as area, driving loads, or skew of data lines, in order to have a reliable implementation of all the blocks. The wires for both clock and signal lines will have to be correctly aligned in order to have the abutment process to working correctly.

4.3.2 Sizing of buffers

Another important task, closely related to the aforementioned low level design, will be the sizing of all driving cells, mainly buffers, of the blocks. This part is closely related to the next part, the characterisation, since the sizing of the driving cells will depend on the characteristics of the blocks abutted next to the block of concern, thus presenting a somehow “circular” problem: the characterisation of the block depends on the composition of itself and the composition of itself depends on the characterisation. This problem can be solved using the traditional hierarchical bottom-up synthesis flow [17], where the constrains at the first iteration are estimated and after each synthesis run the constraints are updated with the result of the previous run. This process will eventually converge to a final value that will be finally be used.

(44)

4.3.3 Block characterisation

As mentioned before, the block characterisation process will heavily influence the sizing of the driving cells and is thus of major importance for the correct implementation of the blocks. The block characterisation will consists of a set of steps that will result into a full characterisation of the block in terms of power consumption, timing behaviour as well as load and driving characteristics. The correct characterisation of the blocks is not only important for the correct sizing of the driving elements, but also to allow the high level synthesis tools to correctly make decisions on compile time that depend on the characteristics, such as latency, of the blocks.

4.3.4 Block hardening

The final step of the design of the blocks will be the hardening of the block. The process itself is not of major complexity but it represents the final stage of the design and implementation of the blocks. Prior to the hardening all wires will have to be moved close to the boundary of the block so that they are abutable with neighbour blocks. Once this step is completed the block will become the lowest level element of the SiLagoo methodology, also known as the primitive, with which all the future designs will be implemented.

(45)

References

[1] A. Alhussien, C. Wang, and N. Bagherzadeh. “A scalable delay insensitive asynchronous NoC with adaptive routing”. In: 2010 17th International Conference on Telecommunica-tions. 2010, pp. 995–1002.

[2] T. Bjerregaard, M. B. Stensgaard, and J. Sparso. “A Scalable, Timing-Safe, Network-on-Chip Architecture with an Integrated Clock Distribution Method”. In: 2007 Design, Automation Test in Europe Conference Exhibition. 2007, pp. 1–6.

[3] J. Chabloz and A. Hemani. “A Flexible Communication Scheme For Rationally-related Clock Frequencies”. In: 2009 IEEE International Conference on Computer Design. Oct. 2009, pp. 109–116. doi: 10.1109/ICCD.2009.5413166.

[4] J. Chabloz and A. Hemani. “Low-Latency Maximal-Throughput Communication Inter-faces for Rationally Related Clock Domains”. In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems _{22.3 (Mar. 2014), pp. 641–654. issn: 1557-9999. doi: 10.} 1109/TVLSI.2013.2252030.

[5] Jean-Michel Chabloz. “Globally-Ratiochronous, Locally-Synchronous Systems”. QC 20120229. PhD thesis. KTH, Electronic Systems, 2012, pp. xviii, 180. isbn: 978-91-7501-258-2.

[6] Jose Flich and Davide Bertozzi. Designing Network On-Chip Architectures in the Nanoscale Era_{. Chapman and Hall/CRC, 2010. isbn: 1439837104.}

[7] R. Ginosar. “Metastability and Synchronizers: A Tutorial”. In: IEEE Design Test of Com-puters _{28.5 (Sept. 2011), pp. 23–35. issn: 1558-1918. doi: 10.1109/MDT.2011.113.} [8] A. Hemani, S. M. A. H. Jafri, and S. Masoumian. “Synchoricity and NOCs could make

billion gate custom hardware centric SOCs affordable”. In: 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS). Oct. 2017, pp. 1–10.

[9] Ahmed Hemani et al. “The SiLago Solution: Architecture and Design Methods for a Heterogeneous Dark Silicon Aware Coarse Grain Reconfigurable Fabric”. In: The Dark Side of Silicon: Energy Efficient Computing in the Dark Silicon Era. Ed. by Amir M. Rahmani et al. Cham: Springer International Publishing, 2017, pp. 47–94. isbn: 978-3-319-31596-6. doi: 10.1007/978-3-319-31596-6_3. url: https://doi.org/10.1007/978-3-319-31596-6_3.

[10] Intel-Free-Press. Robert Noyce with Motherboard 1959. Ed. by Intel-Free-Press. License: Creative Commons BY-SA. Aug. 29, 2013. url: flickr.com/photos/54450095@N05/ 8267615769.

(46)

[11] S. Kumar et al. “A network on chip architecture and design methodology”. In: Proceedings IEEE Computer Society Annual Symposium on VLSI. New Paradigms for VLSI Systems Design. ISVLSI 2002_{. Apr. 2002, pp. 117–124. doi: 10.1109/ISVLSI.2002.1016885.} [12] Dominik Langen, Andr´e Brinkmann, and U. Ruckert. “High level estimation of the area

and power consumption of on-chip interconnects”. In: Feb. 2000, pp. 297–301. isbn: 0-7803-6598-4. doi: 10.1109/ASIC.2000.880753.

[13] Paul McLellan. A Brief History of ASIC. 2012. url: https://www.semiwiki.com/ forum/content/1587-brief-history-asic-part-i.html.

[14] G.E. Moore. “Cramming More Components Onto Integrated Circuits”. In: Electronics 38.8 (Jan. 1965), pp. 114–117. issn: 0018-9219. doi: 10.1109/JPROC.1998.658762. url: http://ieeexplore.ieee.org/document/658762/.

[15] Daniel Nenni. A Brief History of EDA. 2012. url: https://www.semiwiki.com/forum/ content/1547-brief-history-eda.html.

[16] Johnny ¨Oberg. “Clocking Strategies for Networks-on-Chip”. In: Networks on Chip. Ed. by Axel Jantsch and Hannu Tenhunen. Boston, MA: Springer US, 2003, pp. 153–172. isbn: 978-0-306-48727-9. doi: 10.1007/0-306-48727-6_8. url: https://doi.org/10. 1007/0-306-48727-6_8.

[17] M. Stan et al. “Teaching Top-Down ASIC/SoC Design vs Bottom-Up Custom VLSI”. In: 2007 IEEE International Conference on Microelectronic Systems Education (MSE’07). 2007, pp. 89–90.

[18] Dimitrios Stathis et al. Regional Clock Tree Generation by Abutment in Synchoros VLSI Design. 2019. arXiv: 1910.11253 [cs.AR].

(47)

Glossary

cell an abstract representation of a component within a schematic diagram or physical layout. 13, 14

clock a signal that oscillates between a high and a low state and is used like a metronome to coordinate actions of digital circuits. 3, 34, 35

clock cycle time between two rising edges of the clock signal. 29, 31

IP a reusable unit of logic, cell, or integrated circuit layout design that is the intellectual property of one party. 14, 49

lib-cell library cells, i.e. cells that define a specific logic function and are part of an IP library. 14

memory circuit capable of storing information temporarily or permanently. 14 sink endpoints of the clock tree. 3

standard cell low-level electronic logic functions such as AND, OR, INVERT, flip-flops, latches, and buffers. 14, 17, 36

(48)

(49)

Terms and abbreviations

AGU Address Generation Unit. 19, 20 AI Artificial Intelligence. 15

ASIC Aplication Specific Integrated Circuit. 3, 13–15, 17, 19, 22, 23, 27, 31, 42

CAD Computer Design Automation. 15

CRV Constrained Random Verification. 51

DiMArch Distributed Memory Architecture. 21, 30

DPU Data-Path Unit. 19–21

DRAM Dynamic Random Access Memory. 21

DRRA Dynamically Reconfigurable Resource Array. 19–21, 30 DSP Digital Signal Processing. 19–21

EDA Electronic Design Automation. 13–15, 23, 42, 43 FIFO First-In First-Out. 31, 32

FIR Finite Impulse Response. 20 FSM Finite State Machine. 20, 21

GALS Globally Asynchronous Locally Synchronous. 3, 31

GRLS Globally-Ratiochronous Locally-Synchronous. 3, 17, 27, 30–32, 34, 35, 38 IC Integrated Circuit. 13, 17

IO Input Output. 14

IP Intellectual Property. 14, Glossary: IP

(50)

MAC Multiply Accumulate. 20

MST Minimmum Spanning Tree. 26

NI Network Interface. 30 NN Neural Network. 20

NoC Network-on-Chip. 3, 15–17, 21, 22, 24–30, 34, 36, 39, 41–43, 51 PC Periodicity Cycle. 31, 33

PLL Phase-Locked Loop. 31 PnR Place and Route. 15

PVT Process, Voltage and Temperature. 29 SB Switch Box. 19–21, 30

SRAM Static Random Access Memory. 21

STA Static Timing Analysis. 3, 43

VLSI Very Large Scale Integration. 3, 17, 27

Global clock distribution in the SiLago platform

Global clock distribution

in the SiLago platform

Jordi Altay´o

Author

Examiner

Supervisor

Abstract

Sammanfattning

Contents

List of Figures

List of Algorithms

Chapter 1

Introduction

1.1

Historical background

1.1.1

Evolution of ASIC designs

1.1.2

History of EDA Tools

1.2

Problem statement

1.3

Project goals

1.4

Research methodology

1.5

Project boundaries

1.6

Previous and related work

Chapter 2

SiLago platform

2.1

Introduction

2.2

Building blocks

2.2.1

Dynamically Reconfigurable Resource Array

2.2.2

Distributed Memory Architecture

2.3

Structure and floor-planing

2.3.1

Composition by abutment

2.3.2

Clock distribution

Chapter 3

Global clock distribution

3.1

Problem details

3.1.1

Problem modelling

3.1.2

Problem partitioning

3.2

Evaluated concepts

3.3

Data timing analysis

3.3.1

Type 1 paths

3.3.2

Type 2 paths

3.3.3

Latency calculations

3.4

Globaly ratiochronous locally synchronous interfaces

3.4.1

Concept and notation

3.4.2

Transmitter

3.4.3

Receiver

3.4.4

Dataflow properties

3.5

Path segmentation

3.6

Final clock distribution method

3.6.1

SiLago blocks