• No results found

System Level Exploration of RRAM for SRAM Replacement

N/A
N/A
Protected

Academic year: 2021

Share "System Level Exploration of RRAM for SRAM Replacement"

Copied!
99
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

System Level Exploration of RRAM for SRAM

Replacement

Examensarbete utfört i Elektroniksystem vid Tekniska högskolan vid Linköpings universitet

av

Rabia Dogan

LiTH-ISY-EX--12/4628--SE

Linköping 2012

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)
(3)

System Level Exploration of RRAM for SRAM

Replacement

Examensarbete utfört i Elektroniksystem

vid Tekniska högskolan i Linköping

av

Rabia Dogan

LiTH-ISY-EX--12/4628--SE

Handledare: J Jacob Wikner

isy, Linköpings universitet Francky Catthoor

SSET, Imec Peter Debacker

SSET,Imec

Examinator: J Jacob Wikner

isy, Linköpings universitet

(4)
(5)

Avdelning, Institution Division, Department

Division of Electronics Systems Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2012-06-30 Språk Language  Svenska/Swedish  Engelska/English   Rapporttyp Report category  Licentiatavhandling  Examensarbete  C-uppsats  D-uppsats  Övrig rapport  

URL för elektronisk version

http://www.es.isy.liu.se http://www.es.isy.liu.se ISBNISRN LiTH-ISY-EX--12/4628--SE Serietitel och serienummer Title of series, numbering

ISSN

Titel

Title System Level Exploration of RRAM for SRAM Replacement

Författare Author

Rabia Dogan

Sammanfattning Abstract

Recently an effective usage of the chip area plays an essential role for System-on-Chip (SOC) designs. Nowadays on-chip memories take up more than 50% of the total die-area and are responsible for more than 40% of the total energy consumption. Cache memory alone occupies 30% of the on-chip area in the latest microprocessors.

This thesis project “System Level Exploration of RRAM for SRAM Replace-ment ”describes a Resistive Random Access Memory (RRAM) based memory or-ganization for the Coarse Grained Reconfigurable Array (CGRA) processors. The benefit of the RRAM based memory organization, compared to the conventional Static-Random Access Memory (SRAM) based memory organization, is higher in terms of energy and area requirement.

Due to the ever-growing problems faced by conventional memories with Dynamic Voltage Scaling (DVS), emerging memory technologies gained more im-portance. RRAM is typically seen as a possible candidate to replace Non-volatile memory (NVM) as Flash approaches its scaling limits. The replacement of SRAM in the lowest layers of the memory hierarchies in embedded systems with RRAM is very attractive research topic; RRAM technology offers reduced energy and area requirements, but it has limitations with regards to endurance and write latency. By reason of the technological limitations and restrictions to solve RRAM write related issues, it becomes beneficial to explore memory access schemes that tolerate the longer write times. Therefore, since RRAM write time cannot be reduced re-alistically speaking we have to derive instruction memory and data memory access schemes that tolerate the longer write times.We present an instruction memory access scheme to compromise with these problems.

In addition to modified instruction memory architecture, we investigate the effect of the longer write times to the data memory. Experimental results provided show that the proposed architectural modifications can reduce read energy consumption by a significant frame without any performance penalty.

Nyckelord

Keywords Resistive RAM(RRAM),Static RAM (SRAM), Non-volatile memory(NVM), Coarse Grained Reconfigurable Array (CGRA)

(6)
(7)

Abstract

Recently an effective usage of the chip area plays an essential role for SOC designs. Nowadays on-chip memories take up more than 50% of the total die-area and are responsible for more than 40% of the total energy consumption. Cache memory alone occupies 30% of the on-chip area in the latest microprocessors.

This thesis project “System Level Exploration of RRAM for SRAM Replace-ment ”describes a RRAM based memory organization for the CGRA processors. The benefit of the RRAM based memory organization, compared to the conven-tional SRAM based memory organization, is higher in terms of energy and area requirement.

Due to the ever-growing problems faced by conventional memories with DVS, emerging memory technologies gained more importance. RRAM is typically seen as a possible candidate to replace NVM as Flash approaches its scaling limits. The replacement of SRAM in the lowest layers of the memory hierarchies in em-bedded systems with RRAM is very attractive research topic; RRAM technology offers reduced energy and area requirements, but it has limitations with regards to endurance and write latency.

By reason of the technological limitations and restrictions to solve RRAM write related issues, it becomes beneficial to explore memory access schemes that tolerate the longer write times. Therefore, since RRAM write time cannot be reduced re-alistically speaking we have to derive instruction memory and data memory access schemes that tolerate the longer write times.We present an instruction memory access scheme to compromise with these problems.

In addition to modified instruction memory architecture, we investigate the effect of the longer write times to the data memory. Experimental results pro-vided show that the proposed architectural modifications can reduce read energy consumption by a significant frame without any performance penalty.

(8)
(9)

Acknowledgments

First of all I would like to thank my supervisor, Prof. Francky Catthoor, for introducing me to the exciting research field of RRAMs and tutoring me memory systems, during my master thesis. At the same time I would like to thank my supervisor and guru, Dr. Jacob Wikner, for tutoring me during my master study and thesis work. I would like to present my sincerest gratitude to him for guiding me throughout the thesis with his excellent knowledge and patience and his end-less support in case of personal issues as well. One simply could not wish for a supervisor in such a friendly manner.

I am also extremely thankful to Peter Debacker for his expert knowledge and supervision. I would like to also thank to Praveen Raghavan, Matthias Hart-mann and Manu Komalan for their valuable discussions during this thesis work. I should also thank all the professors and Ph.D. students at the division of Elec-tronics Systems, Department of Electrical Engineering, Linköping University, for their support during my study in Linköping University.

I also thank my thesis opponent, Xin Xu, for her time and invaluable feedback on the thesis.

Special thanks to my beloved parents Nihal and Hüseyin Doğan and lovely sis-ters and brother Özgür Eylem, Özlem and Mehmet Can who have always believed in me and I thank for their unconditional support, love and affection.It would not have been possible to complete the thesis project work without their support.

A very special thanks to my friends in KU.Leuven especially to Simla Küçük-sayan and Marissa Priyanka Chatterjee and also all of my colleagues in Imec especially to Dhurv, Narasimha, Rajkumar, Jack, Kanishk, Namita and Yumiao for helpful discussions and suggestions in many practical issues of design and doc-umentation.

I would also like to thank Naeim Safari, Duygu Kıral, Esra İncesu, Jale Özdemir, Murat Görür, Partha Sarathy, Buket Kale, Merve Demirci, Emre Canayaz, Afshin Hemmati, Ezgi Özer and Emilie Peynaud for being my friend and inspiring me. I thank all people, who are directly or indirectly involved in the completion of this thesis work, whose names that I might have forgotten to mention here.

(10)
(11)

Contents

1 Introduction 5

1.1 Motivation and Problem Formulation . . . 6

1.2 Objective . . . 8

1.3 Contribution . . . 8

1.4 Structure of The Report . . . 8

2 Basic Principles of Memory 11 2.1 Memory Fundamentals . . . 11

2.1.1 Memory . . . 11

2.1.2 Cache . . . 11

2.2 Requirements For Embedded Memories in The Lower-Layer . . . . 12

2.2.1 Stability . . . 12

2.2.2 Area . . . 13

2.2.3 Latency . . . 13

2.2.4 Reliability and Endurance . . . 13

2.2.5 Energy Efficiency . . . 13

3 Analysis of SRAM and RRAM 15 3.1 Analysis of SRAM . . . 15

3.1.1 Static Random Access Memory Design . . . 15

3.1.2 SRAM Operations . . . 16

3.1.3 SRAM Timing Diagrams For VHDL Model . . . 18

3.2 Analysis of RRAM . . . 19

3.2.1 Resistive Ransom Access Memory Design . . . 19

3.2.2 RRAM Operations . . . 19

3.2.3 RRAM Memory Modelling in VHDL . . . 21

4 RRAM Implementation For Memory Replacement 25 4.1 Related Work . . . 25

4.2 ADRES Framework . . . 26

4.2.1 Architecture for Dynamically Reconfigurable Embedded Sys-tems . . . 26

4.2.2 Dynamically Reconfigurable Embedded Systems Compiler . 28 4.2.3 BoADRES . . . 30

4.3 Base Architecture: SRAM Based Configuration Memory . . . 32 ix

(12)

x Contents

4.4 RRAM Implementation for Configuration

Memory . . . 33

4.4.1 Problem and Solution . . . 34

4.5 RRAM Implementation for Configuration Cache . . . 37

4.5.1 Problem and Solution . . . 37

4.5.2 Target Applications . . . 38

4.5.3 Implementation . . . 38

4.5.4 Very Wide Register . . . 41

4.5.5 Modified RRAM Based configuration memory organization with very-wide line size of configuration memory . . . 42

4.5.6 Modified RRAM based configuration memory organization with reduced instruction size . . . 43

4.5.7 Modified RRAM based Configuration Memory Organization with Bypass Multiplexer . . . 44

4.5.8 Modified RRAM based Configuration Memory Organization with Multiple Update . . . 45

4.5.9 Modified RRAM based Configuration Memory Organization with Two VWR . . . 45

5 A Study on RRAM Based Data Memory 47 5.1 Impact of RRAM on Data Memory Organization . . . 47

6 Results and Comparison 51 6.1 The Used Benchmarks . . . 51

6.2 The Effect of RRAM on The Energy . . . 51

6.3 Leakage Energy Contribution of the Memory . . . 52

6.4 Results . . . 53

7 Conclusion and Future Work 59 7.1 Conclusion . . . 59

7.2 Future Work . . . 60

Bibliography 61 A Appendix A –RRAM Model 65 A.1 VHDL Architecture code for RRAM Behavioral Modelling . . . 65

A.2 VHDL Entity code for RRAM Behavioral Modelling . . . 67

B Appendix B - VWR Modelling 69 B.1 VHDL Architecture Code For VWR Modelling . . . 69

B.2 VHDL Entity Code For VWR Modelling . . . 70

C Appendix C –VHDL Code For Control Unit 72 C.1 VHDL Architecture Code for Control Unit . . . 72

(13)

List of Figures

1.1 Performance of volatile and non-volatile memories. [6] . . . 7

1.2 Report organization. . . 9

2.1 A view of multilevel memory organization. . . 12

3.1 6T SRAM Cell. [18] . . . 15

3.2 SRAM 6T Cell Model. [18] . . . 16

3.3 Read and write operation . . . 17

3.4 6T SRAM Layout. [10] . . . 17

3.5 SRAM Read Operation. [20] . . . 18

3.6 SRAM Write Operation. [20] . . . 18

3.7 RRAM Cell Modelling [21] . . . 19

3.8 RRAM Operations [22] . . . 20

3.9 RRAM Voltage-Current Curve. [22] . . . 20

3.10 RRAM VHDL Model. . . 21

3.11 Behavioral model write operation . . . 23

3.12 Behavioral model read operation . . . 23

4.1 Example Architecture with 64 FUs. [29] . . . 27

4.2 Rotating Register File. [29] . . . 28

4.3 DRESC Tool Flow. [29] . . . 29

4.4 BOADRES Multi Thread Processor Architecture. [29] . . . 31

4.5 Cache Memory Structure. . . 32

4.6 The different building blocks of a CGA control unit. . . 33

4.7 Flow Graph. . . 35

4.8 State Machine Implementation. . . 36

4.9 Overall Implementation. . . 40

4.10 Very Wide Register. [31] . . . 41

4.11 Very Wide Register VHDL Model. . . 42

4.12 Alternative Design 1. . . 43

4.13 Alternative Design 2. . . 44

4.14 Alternative Design 3. . . 45

4.15 Alternative Design 4. . . 46

5.1 Modified Data Memory Organization with VWR. [32] . . . 48

6.1 RRAM Memory Array. [1] . . . 52

6.2 Read energy consumption for 256 FFT on BOADRES . . . 55

6.3 Read energy consumption for 2K FFT on BoADRES . . . 56

6.4 Read energy consumption for PrecMtx2x on BoADRES . . . 57

(14)

2 Contents

List of Tables

1.1 Emerging NVMs [6] [4] . . . 6

3.1 SRAM and RRAM Energy Comparison. [23] . . . 21

3.2 Pin description for behavioral design of RRAM . . . 22

4.1 Explanation of the states . . . 36

5.1 Impact of RRAM on CGA Duty Cycles . . . 47

5.2 Impact of long write latency on the CGA Duty Cycles in SRAM Based Data Memory Organization . . . 49

5.3 Impact of long write latency on the CGA Duty Cycles in Modified RRAM Based Data Memory Organization . . . 49

6.1 List of Kernels and their Functionality . . . 51

6.2 The Number of Read and Write Cycles for Each Design Alternative for 256 FFT on BOADRES . . . 55

6.3 The Number of Read and Write Cycles for Each Design Alternative for 2K FFT on BOADRES . . . 56

6.4 The Number of Read and Write Cycles for Each Design Alternative for PrecMtx2x on BOADRES . . . 57

6.5 The Number of Read and Write Cycles for Each Design Alternative for cpstIQCFO on BoADRES . . . 58

(15)

Contents 3

Acronyms

SOC System-on-Chip

NVM Non-volatile memory

eNVM Embedded NVM

RAM Random access memory

ROM Read only memory

PCRAM Phase-change RAM

MRAM Magnetic RAM

CBRAM Conductive-bridge RAM

RRAM Resistive Random Access Memory

FeRAM Ferroelectric RAM

SRAM Static-Random Access Memory

ADRES Architecture for Dynamically Reconfigurable Embedded Systems

BOA Baseband On ADRES

BOADRES ADRES for the BOA

RTL Register Transfer Level

LRS Low resistive state

HRS High resistive state

BL Bit line

WL Word line

VLIW Very long instruction word

CGRA Coarse Grained Reconfigurable Array

FFT Fast Fourier Transform

DSP Digital Signal Processor

ASIC Application Specific Instruction Set

DRESC Dynamically Reconfigurable Embedded Systems Compiler

FU Functional Unit

(16)

4 Contents

VWR Very Wide Register

RRF Rotating Register File

IMO Instruction Memory Organization

CGA Coarse Grained Array

DVS Dynamic Voltage Scaling

L1 Level 1

L2 Level 2

L3 Level 3

MOS Metal Oxide Semiconductor

NMOS n-channel Metal Oxide Semiconductor Field Effect Transistor

FPGA Field Programmable Gate Array

BS-CMO Base SRAM Configuration Memory Organization

MRCMO Modified RRAM Based Configuration Memory Organization RI-MRCMO Modified RRAM based Configuration Memory Organization with

Reduced Instruction Size

B-MRCMO Bypass - Modified RRAM based Configuration Memory

Organization

MU-MRCMO Modified RRAM based Configuration Memory Organization with

Multiple Update

MV-MRCMO Multiple VWR - Modified RRAM based Configuration Memory

Organization

XML Extensible Mark-up Language

DMQ Data Memory Queue

(17)

Chapter 1

Introduction

Embedded memories have been more and more influencing System-on-Chip (SOC) designs in terms of area, energy consumption, performance and manufacturing yield. The continues progression of communication technologies coupled with the continuous change in the range of applications of embedded systems has lead to an ever increasing appeal for high performance, hight throughput and flexibility from these systems. Earlier approaches such as digital signal processors, general purpose processors or Application Specific Instruction Sets (ASICs) no longer fit the requirements. They either fail to deliver the required effectiveness or provide the required adaptability. [1]

Coarse Grained Reconfigurable Arrays (CGRAs) have shown their effective-ness in order to overcome this shortcoming. CGRAs take advantage of the data flow dominance and exploit more parallelism. After all, they are typically only programmable at the assembly level and the programming is complicated due to some reasons. Very long instruction word (VLIW) architectures provide more flexibility, but these architectures are intrinsically more power consuming. An optimized version of the Architecture for Dynamically Reconfigurable Embed-ded Systems (ADRES) architecture template, that combines the advantages of VLIW and CGRA approaches while arbitrating their drawbacks is proposed in [2]. This takes advantage of the Dynamically Reconfigurable Embedded Systems Com-piler (DRESC) framework.

The Instruction Memory Organization (IMO) of the ADRES has the Coarse Grained Array (CGA) control unit that consists of two levels, the configuration memory and the configuration cache as with many of the commercially available embedded systems today. The execution of the CGA is controlled by CGA control unit and an ADRES core employs, as many CGA control units as there are CGA units. [3]

The data memory organization of the ADRES template consists of two scalar memories connected to 32 bit scalar path. Also for the multi threaded architec-tures such as ADRES for the BOA (BOADRES) processor, there is an additional dedicated data memory and a mixed width data-path with scratch-pad memories attached to the 256-bit wide vector data path.

(18)

6 Introduction

This project targets designing of the Resistive Random Access Memory (RRAM) based configuration memory and data memory organizations for the target pro-cessor.

1.1

Motivation and Problem Formulation

Embedded NVM (eNVM) is generally integrated on a chip to store programs and data for embedded applications. eNVM requires high-voltage devices, has a limited retention time, and cannot achieve high-speed operation because of long write times. High performance embedded applications obviously require fast-access embedded Non-volatile memory (NVM). [4]

Owing to the standard memories arriving their ultimate scaling limitations, attention to alternative NVMs is increasing. Initially emerging memories were based on Magnetic RAM (MRAM) or Ferroelectric RAM (FeRAM) materials. In addition to this, current NVM technology is a fundamental change from the standard charge-based memory devices towards materials that may electrically switch their resistance so called resistive switching materials. [5]

Table 1.1 is bench marking emerging standalone NVMs in terms of access and write times, write current, area, the resistance ratio (R ratio) that is the ratio between the high resistance state and low resistance state and endurance. As it is seen in the table, MRAM can be promising NVM in terms of write time, area and endurance but due to limited R ratio it would be difficult to achieve good sensing yield. Additionally, MRAM is more costly due to more mask requirement. Similarly Phase-change RAM (PCRAM) requires extra circuits such as a current cell regulator in order to achieve high R ratio. Although Conductive-bridge RAM (CBRAM) has a high endurance and small write current it suffers from long write and read times.

Table 1.1: Emerging NVMs [6] [4]

Memory Flash MRAM PCRAM FeRAM RRAM

Write current 5µA 0.87 mA 1.2 mA N/A 25µA

Write time 200 ms 10 ns 300 ns 10 ns 5 ns

Access time 15 ns 8 ns 12 ns 8 ns 8.5 ns

Endurance 105 1015 106 N/A 1010

Cell Area Small Medium Medium Medium Medium

R Ratio >10 2 10 N/A >10

However, none of these new memories have succeeded although all of them cor-respond to some of the important memory characteristics in some sense. RRAM is one of the more promising candidates for next generation NVMs due to a num-ber of advantages like large R ratio, easy integration, faster read access times and relatively smaller read energy consumption. The non-volatile nature of RRAM proposes a multifaceted solution to the ever increasing problem of on-chip leakage

(19)

1.1 Motivation and Problem Formulation 7

energy and can also be used to considerably reduce the area and power consump-tion in respect of embedded memories. [7]

However one of the weaknesses of the current RRAM technologies is the rela-tively low endurance which is about 1010. This is not sufficient to use it in

high-performance caches. But with the special regularization it can be enough provided somehow with the technology and the adaptation in the architecture. Therefore, the integration of RRAM based memories into the traditional memory hierarchy brings our new architectural difficulties. In contrast to the fast access during the read operation, the write access of RRAM technology faces a number of problems due to long latency, high energy consumption and low endurance. These issues results in challenges for designing simple RRAM based memory architectures.

In addition to this RRAM is typically seen as a possible candidate to replace non-volatile memory as Flash approaches its scaling limits. However, because of given features, another very appealing application is Static-Random Access Mem-ory (SRAM) replacement in the lower layers of the data or instruction memMem-ory hierarchies or embedded systems. For that application, however, not density but performance and reduced energy consumption form the main requirement. Fig-ure 1.1 provides comparison between embedded volatile and non volatile memories in terms of performance requirements.

ROM

SRAM

DRAM

MRAM RRAM

PCRAM

Flash

0.0

0.5

1.0

1.5

2.0

10

4

10

3

10

2

10

1

10

0

10

−1

A

cc

es

s

ti

m

e

(n

s)

Low

Vddmin

NVM/

Emerging

Power-off

Data

Storage

SRAM:

High speed

operation

Vddmin (V)

(20)

8 Introduction

1.2

Objective

In the previous work, the reading process of an embedded RRAM has been ana-lyzed; revealing that read speed of RRAM can even match that of SRAM L1 cache memory (<1 nsec). [8]

Given these very promising results, it is the objective of this thesis to replace the SRAM with the RRAM. In more detail; the aim of this thesis is to study the RRAM based on the number of write and read cycles and latency measurement, and to implement the RRAM behaviour to the Register Transfer Level (RTL) design of Imec BOADRES processor by designing peripheral circuits.

The primary goal is to apply the RRAM into the lowest instruction and data memory layers of multi-thread BOADRES processor at 40 nm and 850 MHZ thus minimize the power consumption and area.

1.3

Contribution

In this thesis, we propose RRAM based configuration memory organizations that address the above mentioned challenges associated with the write behaviour of the RRAM. We rethink the conventional SRAM based memory organization and replace it by an RRAM based alternative. Since the main focus is on the system level changes, we will not study technology and circuit related matters.

The main bottleneck of the RRAM is the long write time. Firstly the refer-ence design and BOADRES processor architecture in RTL is analysed. The main contributions of the thesis are:

• Based on the read and the write access times a behavioral model of the RRAM is designed.

• A memory controller is designed which controls the read and the write op-erations to the instruction memory.

• Several instruction memory architecture has been proposed to find architec-ture that can provide energy efficient solution.

• In the last part of the thesis the data memory has been analysed and the effect of RRAM replacement to SRAM based data memory investigated. To the best of our knowledge, this is the first work that analyses such a RRAM based hybrid memory organization for CGRAs.

1.4

Structure of The Report

The rest of the report is organized as follows. Figure 1.2 shows the report organi-zation.

• Chapter 2 briefly introduces the basic principles of the memory and discusses the main requirements for an embedded memory to replace the SRAM layer.

(21)

1.4 Structure of The Report 9

• In Chapter 3 the reference design is analysed and the RRAM behavioral model presented.

• Chapter 4 contains the implementation of the RRAM to replace the config-uration memory in the BOADRES processor. The different RRAM based memory organizations are proposed as alternatives. The various components involved and the motivation behind their use is explained. Also we analyse write access patterns in the embedded application benchmarks, and discuss the insights that led to the proposed hybrid architectures in this chapter. • Chapter 5 contains a study on RRAM based data memory architecture. • Chapter 6 is dedicated to the results and the comparison and shows

architec-tural evaluation results of the proposed methodologies and compares them against a base-line SRAM architecture.

• Finally conclusions and the future works are presented in Chapter 7.

Chapter 7

Chapter 6

Chapter 5

Chapter 4

Chapter 3

Chapter 2

Memory Basics

RRAM Behav. Model

Implementation for Config. Mem

Results & Comparison Conclusions & Future Work A study on Data Memory Ref. Design: SRAM

(22)
(23)

Chapter 2

Basic Principles of Memory

2.1

Memory Fundamentals

2.1.1

Memory

Memory is a physical device which stores instructions and data with or without the need for external stimuli. This functionality typically requires at least two structural elements and/or mechanisms. Semiconductor memories can be classified as volatile and non-volatile memories with respect to type of data access. [9]

Non-volatile memory, keeps the stored information even when memory is not powered. Most known Non-volatile memorys (NVMs) are Read only memory (ROM) and Flash memory. Early stage NVMs were non-volatile Static-Random Access Memory (SRAM), Ferroelectric RAM (FeRAM), Magnetic RAM (MRAM) when compared to emerging non-volatile memories are Conductive-bridge RAM (CBRAM), Resistive Random Access Memory (RRAM), Racetrack and Millipack memories. [10]

2.1.2

Cache

A cache memory is used to reduce the average time to access memory as it is smaller and faster. The cache memory stores the most frequently used data in the main memory. When the processor is going to read or write an instruction or data to the main memory, firstly it checks the same memory location in the cache memory to do read or write operation immediately since it is much faster than reading or writing to the main memory.

Each memory location has its own address which is unique, this location called as physical address. Each location in cache is also unique for the cache unit and has a tag which is cached to an address in main memory.

When the processor requires a read or write operation to main memory, firstly it checks whether the data in that location is already available in the cache.

If there are 3 levels of cache hierarchy in the memory organization, the average 11

(24)

12 Basic Principles of Memory

effective memory access time is computed as follows;

tef f = tL1+ (1 − hL1)xtL2+ (1 − hL2)xtL3+ (1 − hL3)xtM ain (2.1)

where,

t = access time. h= % hit.

Advanced processors have several layers of memory. The nearest layer called as Level 1 (L1) cache and commonly has direct mapping with the central processing unit’s pipeline.

Therefore L1 cache has shortest access time and provides the highest perfor-mance. To control relatively small L1 cache, a larger Level 2 (L2) cache is often used on-chip. L2 cache does not have direct mapping to central processing unit thus it has relatively lower speed and performance. This causes a larger latency compared to L1 cache. Occasionally Level 3 (L3) cache can be used which has lowest performance due to its location on chip. A view of multilevel memory organization shown in Figure 2.1. [11]

CPU Register File Cache Level-1 Cache

Level-2 Level-3Cache MemoryMain SecondaryMemory

Figure 2.1: A view of multilevel memory organization.

2.2

Requirements For Embedded Memories in The

Lower-Layer

Embedded memories have to fulfil some requirements which are essential for em-bedded applications such as stability, short latency, reliability, area and power efficiency.

2.2.1

Stability

Attenuation of the power supply and threshold voltages to satisfy power constrains and the performance requirements results in a significant degradation in the

(25)

mem-2.2 Requirements For Embedded Memories in The Lower-Layer 13

ory cell data stability with the scaling of CMOS technology. Leakage currents in memory cells increases the power consumption and might create data read hazards. In order to avoid the data read failures with minimum power consumption, cell design, sizing and layout of the circuit are gaining more importance. [12]

2.2.2

Area

Recently, the area employed by the embedded memories takes more than half of the total area of a typical processors. [13] Thus area reduction is one of the most hot topics for memory design. In order to improve packing density different kind of memory block architectures and cell configurations can be presented. [14]

2.2.3

Latency

In embedded systems, specially for those with high data computations, the la-tency of the memory access is the major bottleneck in the system’s performance. There are several reasons to have long latency depending on the ways of designing memory.

An efficient memory allocation and mapping approach has to be proposed to resolve this challenge such as mapping of arrays to memories and scheduling of memory access operations. [15]

2.2.4

Reliability and Endurance

Dramatic decreases in device dimensions and power supply have importantly min-imized noise margins and challenged the reliability of processors. The primary reason behind this challenge is the soft error rate. Embedded memories are im-pacted by this challenge a lot. Therefore failure rate in embedded memories are getting dominant.

Incontrovertible reliability is one of the most important issues for the embedded memories as there are different approaches to resolve this problem. Embedded non-volatile memories are one of the good candidate to reduce this failure rate because of their resilience to environmental disturbances. [16]

2.2.5

Energy Efficiency

Considering the increase of the effects of the scaling, high leakage current results in high energy consumption. In order to overcome high energy consumptions new CMOS devices or emerging non-volatile memory devices can be used. Also several circuit design techniques to get reduced leakage and high performance memory design are presented. [17]

(26)
(27)

Chapter 3

Analysis of SRAM and

RRAM

3.1

Analysis of SRAM

3.1.1

Static Random Access Memory Design

SRAM is a random access memory which can store the data as long as power is on. A cross-couple inverter view of six Metal Oxide Semiconductor (MOS) SRAM cell shown in Figure 3.1. Each bit is stored on four transistors and two supplemental access transistors are used to control the access to a memory cell during read and write operations.

Generally 6T SRAM cell used in many of commercial chips due to low leakage current and short operation latency which is typically < 1 ns.

MAH, AEN EE271 Lecture 11 7

Static RAM

Uses only six transistors:

Read and write use the same port. There is one wordline and two bit lines. The bit lines carry the data. The cell is small since it has a small number of wires.

Bit Bit_b

MAH, AEN EE271 Lecture 11 8

SRAM

The key issue in an 6T SRAM is how to distinguish between read and writes. There is only one wordline, so it must be high for both reads and writes. The key is to use the fact there are two bitlines.

Read:

• Both Bit and Bit must start high. A high value on the bitline does not change the value in the cell, so the cell will pulls one of the lines low

Write:

• One (Bit or Bit) is forced low, the other is high

• This low value overpowers the pMOS in the inverter, and this will write the cell.

Figure 3.1: 6T SRAM Cell. [18]

n-channel Metal Oxide Semiconductor Field Effect Transistor (NMOS) transis-tor is used as an access transistransis-tor due to higher mobility and lower on resistance. Memory cells that use less than six MOS transistors are available as well. As the cost of processing a silicon wafer is partially fixed, using scaled cells and thus packing more bits on one wafer will end up with reduction of the cost per bit of

(28)

16 Analysis of SRAM and RRAM

memory. But due to aggressive CMOS scaling there are potentially several issues which can limit the progress of scaling.

One of the key factor which limits the SRAM scaling is less noise margin of 6T SRAM bit-cells in sub-threshold voltage. In many memory designs, to have less dynamic power consumption, lower supply voltage has been used. Also for static leakage reduction, source-body biasing has been utilised.

The analysis shows that in this conditions the noise margin of the SRAM cell is reduced as well, hence the range of applicability of scaling is limited by the noise margin requirements for a safe read and write operations. [13]

3.1.2

SRAM Operations

An SRAM cell has three different operations as follows:

bit

bit_b

N1

N2

P1

A

P2

N3

N4

A_b

word

Figure 3.2: SRAM 6T Cell Model. [18]

• For the idle state, the word line is not asserted, the access transistors N2 and N4 disconnect the cell from the bit lines. The two cross coupled inverters will continue to reinforce each other as long as they are connected to the supply.

• The read cycle is started by asserting the word line, enabling both the access transistors N2 and N4. The second step occurs when the values stored in A and Ab are transferred to the bit lines. Assume that the memory content is ’1’. In this case bit line ’bit’ will remain at ’1’ and ’bitb’ will be discharged through N4 and N3. On the ’bit’ side, the transistors P1 and N2 pull the bit line to Vdd.

If the content of the memory was a 0, the reverse would happen and ’bit’ line would be pulled towards 0 and ’bitb’ towards 1. Then this ’bit’ and ’bitb’ lines will have a small difference of delta which will be sensed by sense amplifier and tell whether the data is ’1’ or ’0’.

• The start of a write cycle begins by applying the value to be written to the bit lines. Then word line is made high and the value that is to be stored is latched in.

(29)

3.1 Analysis of SRAM 17 0.0 0.5 1.0 1.5 0 100 200 300 400 500 600 time (ps) word bit A A_b bit_b

(a) Read Operation

time (ps) word A A_b bit_b 0.0 0.5 1.0 1.5 0 100 200 300 400 500 600 700 (b) Write Operation

Figure 3.3: Read and write operation. [19]

The input-drivers of the bit lines are designed to be much stronger than the relatively weak transistors in the cell itself, so that they can easily override the previous state of the cross-coupled inverters.

The voltage changes on bit line, word line and memory cell illustrated as shown in Figure 3.3a and Figure 3.3b respectively.

A 6T SRAM layout is shown in Figure 3.4.

VDD

GND BIT BIT_B GND

WORD

Cell boundary

(30)

18 Analysis of SRAM and RRAM

3.1.3

SRAM Timing Diagrams For VHDL Model

Figure 3.5 shows simplified read operation from a SRAM cell. [20] Before read operation starts which corresponds to clock transition from low to high, the address that is going to be read must be applied to the address input, chip must be selected and write enable must go to low.

Each of these signals have their specific set-up and hold times. During the design these timings must be considered.

On the rising edge of clock signal, read operation begins. After an explicit amount of time which can vary between devices depending on access time, data appears at the output.

In order to introduce delay before data appears at the output an asynchronous output enable signal might be used.

Figure 3.5: SRAM Read Operation. [20]

Figure 3.6 shows the write operation. In order to write data to a memory cell, before the clock transition starts the write address and the data must be applied to the address and data inputs, chip must be selected and the write enable signal must be high.

When clock signal goes from low to high address and input data latched and write operation begins. In the same way as read operation, each of these signals must be valid during an explicit amount of time.

(31)

3.2 Analysis of RRAM 19

3.2

Analysis of RRAM

3.2.1

Resistive Ransom Access Memory Design

Resistive memory is based on a switching mechanism which can be controlled by current or/and voltage between two significant resistive states depending on the material of memory element. Figure 3.7 (a) represents the 1 transistor 1 resistive (1T1R) RRAM cell schematic with a NMOS device and Figure 3.7 (b) shows the physical implementation of 1T1R element memory cell. Resistive memory model can be classified based on storage mechanism such as filement based, interface based and programmable metallization cell. Also, it can be classified based on electrical property such as unipolar and bipolar switching cell. [21]

(a) 1T1R Cell Model (b) HFO2 based RRAM

Figure 3.7: RRAM Cell Modelling [21]

3.2.2

RRAM Operations

A forming procedure is needed to initialize RRAM cell before a regular read or write operation. Forming procedure will set the RRAM cell to low resistive state LRS.

A read operation can be performed with a sense amplifier. For read operation Word line (WL) of the memory element which is going to be read will be grounded and Bit line (BL) can be set to particular voltage given by sense amplifier. A write operation can be considered in two parts as either set or reset operations. As it is seen in Figure 3.8 (a) and (b), in reset operation RRAM device goes to LRS from High resistive state (HRS) by applying reset voltage to source and NMOS device and 0 V to BL. In set operation, RRAM device will change to HRS from LRS by applying set voltage to BL and 0 V to source of NMOS device. [22] [21]

(32)

20 Analysis of SRAM and RRAM

(a) RRAM Reset Operation (b) RRAM Set Operation

Figure 3.8: RRAM Operations [22]

Figure 3.9: RRAM Voltage-Current Curve. [22]

Table 3.1 shows SRAM and RRAM energy comparison based on the width of single word. As it is seen in the table, wide word access is more energy efficient. Thus we will try to keep single word size as wide as possible in the next stages of the design.

(33)

3.2 Analysis of RRAM 21

Table 3.1: SRAM and RRAM Energy Comparison. [23]

Read Energy

64 Bit SRAM 64 Bit RRAM

Consumption 32 bit word 512 bit wide 32 bit word 512 bit wide

read access word read read access word read

access access

Read Energy 4.39 pJ 32.41 pJ 0.74 pJ 2.20 pJ

Per Access

Read Energy 0.13 pJ 0.063 pJ 0.023 pJ 0.004 pJ

Per Accessed bit

3.2.3

RRAM Memory Modelling in VHDL

In this section we present a behavioral model of RRAM, written in VHDL. To achieve this, no signal sizes are fixed in the description; unconstrained ports and use of array attribute allow the easy re-use of this memory.

Conceptually, the RRAM’s address is used as an index into the memory array. The address port is modelled as a std logic vector signal with a predefined ADDR WIDTH generic.

This memory displays idle, read and write behaviours in dynamic memories. Figure 3.10 shows the block diagram of implemented behavioral model of param-eterized RRAM.

Figure 3.10: RRAM VHDL Model.

In previous thesis works, after simulating physical mechanism and material measurements, access time and write times have been determined. Based on these parameters the behavioral model has been generated. The memory access time is determined and <1 ns, with this, write time is determined as between 5-10 ns. [24] The important constraints for behavioral design of RRAM are that the address inputs must be stable to the rising edge of memory enable input. A read operation can be done in one clock cycle due to short access time of RRAM however a write operation will take more than one clock cycle depending on the operating frequency of the processor. By taking this into consideration, a memory model has been designed to allow for the potential stalls.

Table 3.2 is listing RRAM port description. The address input is used to select the memory location on the processor. The number of addresses depend on the

(34)

22 Analysis of SRAM and RRAM

Table 3.2: Pin description for behavioral design of RRAM

Port Type Description

clk in std logic System clock

cs in std logic Chip select signal

we in std logic Write enable signal

enable in std logic Enable Signal

addr in std logic vector(ADDR WIDTH-1 downto 0) Address data

din in std logic vector(DATA WIDTH-1 downto 0) Input data

dout in std logic vector(DATA WIDTH-1 downto 0) Output data memory size.

During a write operation:

• The input data is applied at the data input pin.

• This data is transferred into the suitable signal and stored in the selected memory location.

• The read access is only possible when the CS pin is high. • The WE pin should be set to high if there is a write operation.

• The address and the input data have to be stable during the sample and the hold times.

During a read operation:

• Data from the selected memory location appears at the data output once the access is complete.

• The read access is only possible when the CS pin is high. • The WE pin should be set to low in case of read operation.

• The address have to be stable during the sample and the hold times. The RRAM behavioral model in VHDL is presented in appendix A.

(35)

3.2 Analysis of RRAM 23

VHDL code is simulated with a test bench in order to make sure of the func-tionality of the designed memory model. The testbench has been design based on the consecutive read and write operations to the memory. All the simulations has been completed by using Modelsim.

00000000 00000001 00000010 00000011 00000100 00000101 00000110 00000111 00001000 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 XXXXXXXX 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 write 0 ns 10 ns 20 ns 30 ns 40 ns 50 ns 60 ns 70 ns 80 ns rst clk cs we enable addr 00000000 00000001 00000010 00000011 00000100 00000101 00000110 00000111 00001000 din 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 dout XXXXXXXX data 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 add_gen test write

Entity:rram_test Architecture:tb Date: Wed Jun 06 20:12:22 Bat? Avrupa Yaz Saati 2012 Row: 1 Page: 1

Figure 3.11: Behavioral model write operation .

As it is seen in Figure 3.12, read operation will be completed concurrently but however write operation will take more than one clock cycle. Figure 3.11 and Figure 3.12 shows the write and read operations respectively.

00000000 00000001 00000010 00000011 00000100 00000101 00000110 00000111 00001000 00001001 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 00000009 XXXXXXXX 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 00000009 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 00000009 read 2070 ns 2080 ns 2090 ns 2100 ns 2110 ns 2120 ns 2130 ns 2140 ns 2150 ns rst clk cs we enable addr 00000000 00000001 00000010 00000011 00000100 00000101 00000110 00000111 00001000 00001001 din 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 00000009 dout XXXXXXXX 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 00000009 data 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 00000009 add_gen test read

Entity:rram_test Architecture:tb Date: Wed Jun 06 20:11:40 Bat? Avrupa Yaz Saati 2012 Row: 1 Page: 1

(36)
(37)

Chapter 4

RRAM Implementation For

Memory Replacement

In this chapter Resistive Random Access Memory (RRAM) implementation for the configuration memory, the configuration cache and the data memory is analysed. This chapter presents detailed discussion on the architecture of RRAM based memory organizations.

Section 4.1 presents information about the related work. Section 4.2 provides introduction to the Architecture for Dynamically Reconfigurable Embedded Sys-tems (ADRES) architecture template. The base architecture with Static-Random Access Memory (SRAM) explained in section 4.3. Furthermore we present a RRAM based configuration memory, configuration cache and data memory ar-chitecture to dynamically adapt RRAM and Very Wide Register (VWR) on the Coarse Grained Reconfigurable Array (CGRA) in the following sections.

4.1

Related Work

Flash based devices have several advantages such as being non-volatile but unfor-tunately also deprive many of the features which can be only achieved with SRAM based memory since flash based devices require a time consuming erasure process. Magnetic RAM (MRAM) is good candidate to fill the gap. An exploration of the MRAM based memories to control complex experiments especially dynamic reconfiguration and multi context Field Programmable Gate Arrays (FPGAs) is presented in [25]. 120 nm technology node used to design MRAM layout.

Also in [26], a reconfigurable hybrid cache architecture that couples Non-volatile memory (NVM) in the last level cache via SRAM. Reconfigurable hy-brid cache organization is explored in terms of circuit and architectural issues. These unified architecture obtains 63%, 48% and 25% energy efficiency than non-reconfigurable SRAM based cache, non-non-reconfigurable hybrid cache, and reconfig-urable SRAM based cache. On the other hand, the proposed architecture provide high endurance compare to the base architecture.

(38)

26 RRAM Implementation For Memory Replacement

Similarly, an another work in the field of hybrid cache architecture presented in [27]. A bandwidth-aware reconfigurable cache hierarchy that consists of the hy-brid cache hierarchy, the reconfiguration method, and the prediction mechanism is proposed. In respect of the bandwidth aware reconfigurable cache architecture, the cache organization is dynamically reconfigured at each level adaptive to the demands of different applications. The bandwidth aware reconfigurable cache ar-chitecture resulted in throughput improvement for the multi-threaded benchmarks as well as the performance improvement for the multiprogrammed applications.

In contradistinction to the mentioned hybrid architectures presented above our architecture is a software controlled configuration memory leading to a lower energy and area overhead. In al the above, the lower levels which are often accessed are still SRAM based in the hybrid architecture proposals, while both of the most often accessed lowest levels are fully RRAM based in our proposed architecture.

4.2

ADRES Framework

This section makes an introduction to the ADRES architecture template and based on [28], [29] and [30].

With the increase of wireless standards, Digital Signal Processor (DSP) and Application Specific Instruction Set (ASIC) based solutions do not fulfil the re-quirements to design high-performance devices. The design of ASIC brings along high engineering cost and slow time-to-market for a given specific application. On the other hand DSP based solutions are more flexible but cannot provide required speed and efficiency.

A CGRA is a good candidate to obtain high efficiency in terms of power, high performance and reduced chip area in typical communication applications due to coarse-grained operations. CGRAs involve an array of basic functional units which can execute both word level or sub word-level operations.

ADRES in combination with the Dynamically Reconfigurable Embedded Sys-tems Compiler (DRESC) compiler couples a Very long instruction word (VLIW) processor and a coarse-grained reconfigurable array. Since DRESC handles the switching between these two execution modes, programming of the processor has been simplified. By coupling VLIW and CGRAs, ADRES takes advantage of instruction level parallelism as well as loop level parallelism.

4.2.1

Architecture for Dynamically Reconfigurable

Embed-ded Systems

A typical ADRES architecture involves routing resources, storage resources and computational resources. Figure 4.1 shows the ADRES architecture with 64 Functional Units (FUs), register files, multiplexers and wires. As it is seen in the figure due to close connectivity between FUs and small, distributed Register Files (RFs), ADRES processor serves power efficient solutions.

The code to be executed in VLIW mode compiled by DRESC and by this means to take advantage of having instruction level parallelism. If the number of

(39)

4.2 ADRES Framework12 2.1. ADRES 27

3.1. ADRES ARCHITECTURE TEMPLATE DESCRIPTION 45 similar to other CGRAs. The ADRES array is a flexible template instead of a concrete instance. Figure 3.2 only shows one instance of the ADRES array with a topology resembling the MorphoSys architecture [111]. An XML-based description language is developed to specify ADRES instances (see Section 3.2).

RF

FU

VLIW view

Reconfigurable array view

Instruction fetch Instruction dispatch Instruction decode DATA Cache FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF

Figure 3.2: An instance of the ADRES array

Figure 3.3 shows an example of the detailed datapath. The FU performs coarse-grained operations. To remove the control flow inside loops, the FU supports predicated operations (see Section 3.1.3). To guarantee timing, the outputs of FUs are required to be buffered by an output register. The results of the FU can be written to the RF, which is usually small and has less ports than the shared RF, or routed to other FUs. The multiplexors are used for routing data from different sources. The configuration RAM provides bits to control these components. It stores a number of configuration contexts locally, which can be loaded on a cycle-by-cycle basis. The configurations can also be loaded from the memory hierarchy at the cost of extra delay if the local configuration RAM is not big enough. Figure 3.3 shows only one possibility of how the datapath can be constructed. Very different instances are possible. For example, the output ports of a RF can be connected to input ports of several neighbouring FUs. The ADRES template has much freedom to build an instance out of these basic components.

Figure 2.1: Example Architecture with 64 FUs

to provide retargetable compilation and simulation support, and also by a hardware gen-eration engine to automate the gengen-eration of a VHDL-based description of the derived architecture.

This chapter is intended to serve as an introduction to the ADRES architecture template and the DRESC design framework. It is based on [1] and [10].

In the first part of this chapter an overview of the ADRES template is given. After showing a simple example of an ADRES instance and the discussion of it’s different functional views and the resulting programming model, a more detailed introduction of the different basic building blocks and important architecture features is given.

The second part of this chapter is dedicated to the DRESC design framework. After the introduction of the general toolflow the most important compiler features and their un-derlying algorithms are explained. In the end of this section a short discussion of special features like the support for constraints or the usage of intrinsic functions to extend the instruction set of the FUs with user-defined instructions is provided.

2.1

ADRES

2.1.1 Overall Architecture

A typical ADRES instance consists of computational resources, storage resources and routing resources. Figure 2.1 shows a simple example of an ADRES architecture. It consists of an array of 64 (8x8) FUs that are augmented by local register files (RF). The FUs are interconnected by a routing network that consists of multiplexors and wires. Also to be seen in the figure is that neiter every FU has to be attached to a local RF, nor

Figure 4.1: Example Architecture with 64 FUs. [29]

FUs is k, in the best case k instructions can be issued per cycle. Instructions for VLIW FUs are fetched from main memory and decoded by a specific unit. The code which composes the loop body is executed in CGRA FUs. Thus in CGRA mode loop level parallelism is exploited. In CGRA mode, all the FUs receive their configurations from the configuration memory. The switching between CGRA to VLIW modes is solved by means of loop stop signal and the switching between VLIW to CGRA mode is done by means of issuing a special operation.

Basic building blocks in the ADRES template are briefly explained in the following subsections.

• Functional Units

The core element in the ADRES architecture is the functional unit. The instruction set for FUs defined by the compiler and can be enlarged by user defined instructions. FUs can be specified in terms of their functionalities and intended purpose in the architecture such as memory load and store FUs, the arithmetic and logic instruction FUs etc.

• Register Files

Register files are used to store intermediate data in the ADRES architecture. There are two types of register files;

Predicate register files. Data register files.

Calculating RF addressing in each loop iteration can result in substantial overhead.In order to overcome this issue, ADRES architecture employs rotat-ing register files. Figure 4.2 shows Rotatrotat-ing Register File (RRF) architecture.

(40)

28 RRAM Implementation For Memory Replacement

According to this structure physical address is calculated by considering the iteration number and the virtual address.

CHAPTER 2. ADRES AND DRESC DESIGN FRAMEWORK 17 3.1. ADRES ARCHITECTURE TEMPLATE DESCRIPTION 51

Table 3.2: Instruction set only for VLIW FUs Op Description Op Description JUMP unconditional branch JSR subroutine call

RTS return from subroutine BEQ branch if == BNE branch if ! = BGT branch if > BGT U branch if >, unsigned BGE branch if≥ BGE U branch if≥, unsigned BLT branch if < BLT U branch if <, unsigned BLE branch if≤ BLE U branch if≤, unsigned

iteration 1 iteration 2 iteration 3 def def def use use use life-time of var.

Figure 3.5: Overlapped life-time of a variable

RRF + virtual address iter. counter physical adress Figure 3.6: Rotating register file

register renaming method. Each physical RF address is calculated by adding a virtual RF address and a value from the iteration counter (Figure 3.6). Hence, the different iterations of the same variable are actually assigned to different physical registers to avoid name clashes. In ADRES, the RRF approach is adopted. It increases hardware cost by requiring an additional counter and adder for each port of the register file. However, since the distributed RF is very small. An 8-entry RRF only needs 3-bit adders and counters. Hence the extra hardware costs are limited.

3.1.5 Routing Networks

The routing networks consist of a data network and a predicate network. The data network routes the normal data among FUs and RFs, while the predicate network directs 1-bit predicate signals. These two networks do not necessarily have the same topology and can not overlap because of different data widths. Figure 2.4: Rotating Register File

When switching between VLIW mode and CGRA mode, data needs to be passed be-tween the different views. Therefore the compiler identifies live-in and live-out variables2 that serve as communication channels between the VLIW and the CGRA.

In VLIW mode only the non-rotating part of the data register files should be used, whereas in CGRA mode both parts are accessible. From the VLIW mode perspective, values stored in the rotating part of the data RFs are not live over the execution of kernels in CGRA mode.

Memories

In each ADRES instance there should be at least a global data memory that is connected to the VLIW FUs. Therefore those FUs need to have a bidirectional port of the memory width, e.g. 32 bit. In multithreaded architectures this global memory facilitates inter-thread communication. Each partition in such a multiinter-threaded architecture might also have an additional dedicated data memory.

Although initially not supported, in later versions of the DRESC design framework and the ADRES template, support for distributed memory blocks was added. The BoADRES (see section 5.1 for details) architecture for example comprises a mixed width datapath with scratchpad memories attached to the 256-bit wide vector data path.

Furthermore a queuing mechanism called Direct Memory Queue (DMQ) was added to the architecture. This memory queue is designed to reduce stall cycles by hiding memory latency and bank confilict generated by data memory operations of multiple load/store units.

Constant Memories

Operands to operations can either be register operands or immediate operands. For the VLIW FUs those immediate operands are part of the instruction and are passed to the FU from the decoding unit. For the CGRA FUs this is not possible because only the

2a life-in variable is written before entering the loop and read inside of the loop, whereas a live-out variable is

written inside of the loop and read by a successor block

Figure 4.2: Rotating Register File. [29] • Memories

An ADRES architecture consists of a global memory which is connected to VLIW unit. In order to support intermediate operations on the CGRA mode, configuration memories are used to feed source ports of CGRA FUs. The detailed information about memory architecture is given in BOADRES section as our main focus is to explore memory organization on ADRES for the BOA (BOADRES) processor.

4.2.2

Dynamically Reconfigurable Embedded Systems

Com-piler

The DRESC architecture allows C code compilation on the ADRES architecture by using a scheduling algorithm to utilize loops on available FUs in an elegant way. The DRESC tool flow is shown in Figure 4.3. As it can be seen in the figure the tool flow performs following operations.

• The source level transformations can be done to increase pipelining. • As a second step C Code is transferred to Open Impact compiler which

performs several optimizations such as control-flow reduction and in-lining of functions.

• The generated LCode is transferred to the DRESC compiler where it splits into two parts for VLIW and CGRA modes to exploit the instruction level parallelism and loop level parallelism.

(41)

4.2 ADRES Framework 29 24 2.2. DRESC Source-level transformations OpenIMPACT DRESC ILP Scheduling Register Allocation Dataflow Analysis and Optimizations Modulo Scheduling VLIW CGRA DRE Files DRE Files DRE Files DRE Files DRE FilesC Code

DRE Files DRE FilesLcode

XML Architecture Description Architecture Parser Architecture Abstraction Compiled Code Simulator AVM DRE Files DRE Files Performance Results DRE Files DRE Files Performance Results xml2vhdl DRE Files DRE Files VHDL Files Synthesis DRE Files DRE Files Gate-level Design Binutils DRE Files DRE Files Binary Files Esterel & Modelsim Simulation Toggle File Power Calculation DRE Files DRE FilesPower

Results

Figure 2.7: DRESC Toolflow

support for intrinsic operations and code constraints, source level transformations and the integrated simulation framework.

2.2.1 Toolflow

As depicted in figure 2.7, the design framework needs basically two input sources to compile an application for a specific ADRES instance: the high-level C code files of the application and the XML architecture description file.

The parts of the framework not being relevant for this thesis are grayed out in the figure and won’t be discussed any further.

In order to be pipelineable and to improve mapping efficiency, some manual source-level transformations might be necessary.

(42)

30 RRAM Implementation For Memory Replacement

4.2.3

BoADRES

BOADRES is a scalable baseband processor template for Gbps 4G radios. BOADRES processor consists of ADRES core which is a new generation power-efficient, high performance and flexible processor architecture designed to achieve the data pro-cessing challenges of future mobile terminals.

ADRES correlates a VLIW processor with a Coarse Grained Array (CGA) accelerator through a shared central register file. Number and type of memories in the processor listed as below.

• 2 scalar, 1 global scalar memories • 4 vector(scratch pad) memories • 2 levels of configuration memory

(43)

4.2 ADRES Framework 31

(44)

32 RRAM Implementation For Memory Replacement

4.3

Base Architecture: SRAM Based

Configura-tion Memory

The ADRES architecture template consists of an array of basic components, FUs, RFs and routing resources. It tightly combines a VLIW processor and a CGRA in the same texture. Generally loops are executed in the reconfigurable array, while the rest of the code is executed in the VLIW processor. The data communication between the VLIW processor and the reconfigurable array is solved via the shared RFs and shared memory access.

ADRES is a adaptable template specified by an Extensible Mark-up Language (XML)-based architecture specification language. The reconfigurable nature of the processor provides a customizable number of FUs, RFs and connections.

This template is integrated onto the DRESC compiler design framework that constructs the architecture based on the XML description. The execution of the CGA is controlled by the CGA control unit, whereas the VLIW processor has a standard fetch, decode, execute pipeline, that receives configurations from in-struction cache. The configuration memory is loaded via the configuration memory interface after system reset.

The ADRES central control unit controls all CGA control units. CGA control units provide configurations and control signals to particular units of the CGA array. The original memory organization of the CGA control unit: is taken as the reference for our next applications and test benches.

Figure 4.5: Cache Memory Structure.

(45)

4.4 RRAM Implementation for Configuration

Memory 33

an array of n D-latch registers and 2 flip-flop registers. In current architecture cache depth is 16 thus in the rest of the report, the CGA control unit will always be assumed to have cache depth of 16. This configuration memory organization shown in Figure 4.5

Configuration memory line is a single instruction wide that is 135 bits in this version. The cache is loaded at the first iteration of the loop and used in all the later iterations. Since instead of memory macros flip flops and D-Latches used, this structure consumes substantially less energy during read operation.

SRAM based memory organization will be called as Base SRAM Configuration Memory Organization (BS-CMO) later on.

4.4

RRAM Implementation for Configuration

Memory

As mentioned above an ADRES core contains several CGA control units depending on CGA units. Each control unit consists of a configuration memory, configuration cache and control logic. The configuration memory is loaded after the system is started. Input signals to the configuration memory generated by configuration memory interface block.

After system reset CGA configurations are sent to configuration from an L2 memory if the memory is not being read or power is not down.The different build-ing blocks of a CGA control unit is shown Figure 4.6

(46)

34 RRAM Implementation For Memory Replacement

4.4.1

Problem and Solution

In the current control logic, each configuration is being written to the memory in one clock cycle due to short write time of SRAM model. The write time of RRAM is, as explained above, 5-10 ns.

The BOADRES configuration unit requires a number of signals to perform read and write operations successfully. In the base design both read and write opera-tions occur without any delay between two operaopera-tions. This approach is simple but it is not compatible with RRAM architecture due to long time required to write to RRAM.

This is solved by creating a finite state machine that will introduce a specific delay cycle after every single write operation whereas perform read operation concur-rently. This solution is achieved by introducing a stall mechanism and designing a memory controller without a noticeable performance degradation as required bandwidth is not very high. This solution provided high reliability and compati-bility for the next implementations.

During the implementation write latency considered as parameter and a memory controller is designed to manage read and write operations to the memory based upon the number of delay cycles . The controller is designed in a full digital design flow, and implemented in VHDL.

Finally after implementation, 256 point Fast Fourier Transform (FFT) is used as benchmark. The controller is a finite state machine which depends on previous state and the current input. The operations are carried out in four states which are; idle, read, write and wait states. Flow graph and state diagram for the con-troller shown in Figure 4.7 and Figure 4.8 respectively.

The write speed is much lower than system clock thus number of delay cycles are defined as variable. The configuration memory controller provides required sig-nals, input data and write address till the end of the write operation.

By taking into consideration that the read speed of RRAM can even match that of SRAM, intrinsically read operation is performed without any delay. The memory controller sets the read address and the control signals fed by arbitrary block and sends requested data to output.

(47)

4.4 RRAM Implementation for Configuration

Memory 35

Memory Bank Arbiter

FSM

Configuration

Memory

I$

I$ process Config out process

Configuration Memory

to Functional Units

Input from Configuration

Memory Interface

(48)

36 RRAM Implementation For Memory Replacement IDLE WRITE READ WAIT (n) Arbiter_we Config_re Data_out Data Memory_busy Memory_busy Memory_busy M em o ry _r ea d y Memory_ready mem_we=0 mem_en=0 mem_ready=1 mem_we=1 mem_en=1 mem_ready=0 mem_we=0 mem_en=1 mem_ready=0 mem_we=1 mem_en=1 mem_ready=0

Figure 4.8: State Machine Implementation. Each state explained in the following Table 4.1.

Table 4.1: Explanation of the states

Idle State Write State

Memory is ready for read Memory begins write

or write operations. operation.

Disable address, input and Get address and input

output. data.

Disable pull-up and Enable pull-up and

pull-down. pull-down.

Check for the next state. Set next state as wait.

Wait State Read State

Give back pressure to the bus. Memory begins read operation.

Enable pull-up and Enable pull-up

pull-down. and pull-down.

Check counter. Get address.

Based on counter set Set next state as

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

I regleringsbrevet för 2014 uppdrog Regeringen åt Tillväxtanalys att ”föreslå mätmetoder och indikatorer som kan användas vid utvärdering av de samhällsekonomiska effekterna av

a) Inom den regionala utvecklingen betonas allt oftare betydelsen av de kvalitativa faktorerna och kunnandet. En kvalitativ faktor är samarbetet mellan de olika

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

• Utbildningsnivåerna i Sveriges FA-regioner varierar kraftigt. I Stockholm har 46 procent av de sysselsatta eftergymnasial utbildning, medan samma andel i Dorotea endast