Energy monitoring of the Cortex-M4 core, embedded in the Atmel SAM G55 microcontroller

(1)

Faculty of Technology and Society Computer Engineering

Engineering Degree Thesis

15 credits

Energy monitoring of the Cortex-M4 core, embedded in the Atmel SAM

G55 microcontroller

Zeid Bekli

William Ouda

Exam: Bachelor of Science in Engineering Examiner: Olle Lindeberg Subject Area: Computer Engineering Supervisor: Tommy Andersson Date of final seminar: 2017-06-01

(2)

i

Abstract

The technology in cellular phones, portable computing systems, intelligent- and connected- devices are evolving in a high pace and in many cases these devices are required to operate in a low-power environment. The problem that continues to emerge, is the power consumption in microcontrollers and DSP devices. This issue has over time become important to solve in order to maximize battery life. To ease the choice of power efficient microcontrollers, controlled experiments were therefore performed with the Cortex-M4, this microcontroller was chosen because of the

upgraded hardware, which has led to an appreciable change in both power- and speed efficiency compared to its predecessors.

The conclusion presents important points, along with advantages and difficulties to consider when implementing a DSP application. By comparing different optimizations with the Floating Point Unit(FPU), Fixed-point and software Floating-point, the results show that there are major differences in power consumption between these three options. Depending on which option and optimization used then the power

(3)

ii

Acknowledgements

We would like to show our appreciation to our supervisor Tommy Andersson for taking the time to guide and support us during this thesis.

We would also like to thank Magnus Krampell for all the support and encouragement during our three years of studies in Malmo University.

(4)

iii

Acronyms

ADC

Analog to Digital Converter

ADP

Atmel Data Protocol

CMSIS-DAP

Cortex Microcontroller Software Interface Standard-Debug Access

Port

CMSIS_DSP

Cortex Microcontroller Software Interface Standard-Digital Signal Processing

DMIPS

Dhrystone Million Instructions Per Second

DP

Double Precision

DSC

Digital Signal Controllers

DSP

Digital Signal Processing

DSP device

Digital Signal Processor

DWT

Data Watchpoint and trace

EDBG

Atmel Embedded Debugger

FFT

Fast Fourier Transform

FIR

Finite Impulse Response

FPO

Floating Point Operations

FPU

Floating Point Unit

HP

Half Precision

GCC

GNU Compiler Collection

IIR

Infinite Impulse Response

IoT

Internet of Things

JTAG

Joint Test Action Group

MAC

Multiplier–Accumulator unit

MCU

Microcontroller Unit

MSB

Most Significant Bit

Opt

Optimization

SIMD

Single Instruction Multiple Data

SP

Single Precision

(7)

1

1 Introduction

Signal processing is used in almost every technology that we rely on today such as cellphones, computers, smart watches, automotive control systems, just to name a few[1].

“Signal processing is at the heart of our modern world, powering today’s entertainment and tomorrow’s technology” [1].

One part of signal processing is Digital Signal Processing (DSP) that refers to a set of algorithms that are used to process digital signals. The usage for some of these algorithms is to improve the signal by using techniques such as Finite Impulse

Response (FIR), Infinite Impulse Response (IIR). Other algorithms that are widely used is the Fast Fourier Transform (FFT) [2].

A DSP device1_{is a microprocessor specialized in processing of real time signals and}

algorithms for DSP. The advantage of a DSP device compared to a general-purpose processor is its ability to be more efficient when processing the same algorithms. This is because the main task of a DSP device is signal processing [3][4]. This results in one of the most common issue, that is the increased complexity of DSP blocks that have led to an ever-increasing power consumption challenge. The DSP blocks consist of a

network of adders and multipliers, typically it is these networks that have a major influence on the power consumption for a DSP device [5][6].

There are processors that are manufactured with DSP architecture layers that support the general-purpose processor to perform the DSP algorithms more efficient and these can be characterized as Digital Signal Controllers (DSC). This means that, there might not be any reason to buy a separate DSP device and a general-purpose processor to get the desired results. One could be able to save significantly on the cost of the products that are being constructed, by replacing two processors with one high performance processor with DSP extension [7]. One of these processors is the ARM Cortex-M family processor, Cortex-M4. The Cortex-M4 is a powerful member in the Cortex-M family and is used worldwide in a range of digital signal control embedded market segments. Robotics is one of these segments and has a critical role in

healthcare such as precision surgery and assisted-living. Other important segments are automotive control systems, smartwatches and medical instruments [8].

1

Both Digital signal processing and Digital signal processor use the acronym DSP, therefore from onwards we will be referencing to Digital signal processor as DSP device.

(8)

2

1.1 Problem domain

The technology in cellular phones, portable computing systems, intelligent- and connected- devices are evolving in a high pace and in many cases these devices are required to operate in a low-power environment [7].

The problem that continues to emerge, is the power consumption in microprocessors and DSP devices. This issue has in time become important to solve in order to

maximize battery life [7][9]. For mobile devices where the requirement to achieve higher performance and better power efficiency is crucial to the success of the mobile devices, one option to consider is using the Cortex-A72 processor. It uses the power-optimization ARM big.LITTLETM _{processing technology which combines}

high-performance ARM CPU cores with the more power efficient ARM CPU cores, to give a good performance at a significantly lower average power [10].

The Cortex-M4 processor is different from the Cortex-A72 processor in the way that it is built on one high-performance core [11], this means that it cannot use the big.LITTLE processing technology. Yet the Cortex-M4 processor is used in areas such as robotics and healthcare. This is because of the upgraded hardware, which has led to an

appreciable change in both power- and speed efficiency compared to its predecessors. New technologies and hardware accelerators were made and then implemented in the Cortex-M4 CPU such as single cycle multiply, hardware division, bit field instruction and of course the added DSP functions, this has been an important factor that has led to making the Cortex-M4 into a high-performance processor [12].

How does power consumption relate to these new hardware technologies and are there any significant changes in power consumption when using DSP algorithms?

1.2 Research Questions

The aim of this research is to investigate the power consumption in a Cortex-M4 DSC, and to review the DSP algorithms when enabling the hardware Floating Point

Unit(FPU) that is featured in the Cortex-M4 DSC compared to when having it disabled. By implementing DSP algorithms such as FIR- and IIR- filters, will there be any

noticeable tradeoff between speed and power consumption? This will give a better insight of the advantages and disadvantages of the Cortex-M4 DSC.

(9)

3

Main question

How does power efficiency vary in the Cortex-M4 DSC when enabling the FPU compared to when disabled, and is speed related in anyway?

Sub questions

RQ1: How does power consumption vary when using the same algorithms with and without hardware floating point unit? (Floating-point operation)

RQ2: How does power consumption vary when using an algorithm with the same functionality as RQ1? (Fixed-point operation)

RQ3: How is the dependency between speed and power consumption?

1.3 Limitations

The aim of this thesis is to measure the power consumption on the Cortex-M4 DSC along with the embedded FPU which can be enabled optionally. The research done on the sub questions that are in section 1.2 will be based on the points below.

● DSP execution computed with FPU, and GNU compiler optimization -O0 and

-O1

● DSP execution computed with software Floating-point, and GNU compiler

optimization -O0 and -O1

● DSP execution computed with Fixed-point, and GNU compiler optimization -O0

and -O1

The device used in this thesis is the SAM G55 with the Cortex-M4 core. The wake-up time will not be included in the measurements, this is because of the differences that can be found between the MCUs that use the Cortex M4 core. The power consumption of the ADC and DAC will also not be included in the measurements, because the main focus is on the power consumption in the Cortex-M4 DSC when executing the DSP algorithms.

Optimization -O2 and -O3 makes debugging harder and gives incorrect DWT cycles values and therefore will not be used.

(10)

4

2. Theoretical background

The aim of this chapter is to review the areas that are important to understand in order to follow the chapters ahead in this thesis. Each subsection below should give a sufficient understanding to each term.

2.1 Digital Signal Processing

Signals are patterns of variations that represent information. There are all kinds of signals such as speech signals, audio signals, video or image signals, radar signals, just to name a few [4][13].

Many signals originate as continuous-time signals, and speech signals are one of these. It can sometimes be desirable to obtain the discrete-time representation of the signal, and one way to do this is through sampling equally spaced points in time. The result will be a discrete time representation of the signal that can be processed digitally [16].

2.1.1 The FIR filter

In FIR filters each value in the output sequence is a weighted sum of a finite number of samples of the input sequence, which is basically a feed-forward difference equation. The relationship of a general FIR filter is specified by the following equation [4][13].

𝑦[𝑛] = ∑ 𝑏𝑘𝑥[𝑛 − 𝑘] 𝑀

𝑘=0

(Eq. 1)

The output signal (y[n]) is dependent on the input signal (x[n]), the filter order (M) and the value of the impulse response (bk).

It can be illustrated by doing a block-diagram. A third-order FIR filter can be seen in figure 1 below and its equation is.

𝑦[𝑛] = 𝑏₀𝑥[𝑛] + 𝑏₁𝑥[𝑛 − 1] + 𝑏₂𝑥[𝑛 − 2] + 𝑏₃𝑥[𝑛 − 3]

(11)

5

Figure 1. Third-order FIR filter

The input signal in an third-order FIR filter (Figure 1) has three signal delays (unit

delays) that are then multiplied with filter coefficients (b0, b1, b2, b3), and the results of

the product are then added to generate the output(y[n]) [13].

2.1.2 The IIR filter

IIR filters are feedback systems in such way that the output value of the system is reused. The difference between a FIR filter and a IIR filter, is its intelligibility to combine an output value with an input signal to compute an output. The general IIR difference equation is [4][13]. 𝑦[𝑛] = ∑ 𝑎_𝑖𝑦[𝑛 − 𝑖] 𝑛 𝑖=1 + ∑ 𝑏_𝑘𝑥[𝑛 − 𝑘] 𝑀 𝑘=0 (Eq. 3)

The equation coefficients are feedback (ai), feedback filter order (N), the feedforward

(bk), the feedforward filter order (M) and of course the input signal (x[n]) and output

signal (y[n]). By taking a closer look at the equation, it is obvious that if the coefficient (ai) were to be zero then we would have acquired the equation of a FIR filter [4][13].

A block-diagram of a first-order IIR filter with its corresponding equation can be seen in figure 2 and equation 4.

𝑦[𝑛] = 𝑎1𝑦[𝑛 − 1] + 𝑏0𝑥[𝑛] + 𝑏1𝑥[𝑛 − 1]

(12)

6

Figure 2A. First-order IIR filter in Direct Form I

Figure 2B. First-order IIR filter in Direct Form II

The IIR filter in figure 2A is of Direct Form I, however there is also Direct Form II which can be seen in figure 2B. The difference in these two forms, is that the unit delay in

Direct Form II can be combined, this is because the signal to the unit delays in figure 2B

is the same [4][13]. IIR filters gets unstable if the pole(s) are outside the unit circle[4].

(13)

7

2.2 GNU Compiler Collection (GCC)

The GCC is one of the most used compilers today, it is a free software and volunteers can contribute to improving the functionality of the GCC. Basically the GCC is an

optimizing compiler from the GNU project. The GNU project have object file tools such as the assembler and linker [14].

There are five options for code optimization with GCC [15][16].

● -Os - This optimization is for space usage(size) rather than speed.

● -O0 - Optimization is disabled and will make it more easy to debug, and will be slower than the three options below.

● -O1 - This optimization is also suitable for debugging, this option will enhance both speed performance and space usage(size).

● -O2 - Full optimization and skips any optimization that can lead to increase in space usage(size).

● -O3 - Does the same as -O2, the difference is in how the optimization is used to increases the space usage(size) for speed

performance.

Optimization -O2 and -O3 are fast performing options, but makes debugging harder [16].

2.3 Fixed-point

Fixed-point arithmetic means fixed number of digits before and after the decimal point. This implies that the resolution is depending on the amount of bits. For example, if 8-bits are used for fraction, then the resolution will be 2^-8 _[4][17][18].

The maximum number in Fixed-point arithmetic is limited to the number of bits available, e.g. with 32-bits it is possible to use from 0- up to 32- bits as fraction numbers and the scaling is predetermined by the user. This makes it more commonly used when FPU is not available in the hardware [2][17].

There are four ways to store and represent integer(converting decimal numbers into binary), Unsigned integer, Offset binary, Sign and Magnitude, Two’s complement [2][17].

Unsigned integer is quite straightforward when compared to the other three formats, it can go from the number zero up to the maximum positive number depending on the amount of bits. One noticeable disadvantage with unsigned integers is that there are no negative representations.

(14)

8

Offset binary format works in a similar way to the unsigned integer format, the difference lies in the shifted offset that allows either a positive number or a negative number to be represented.

Sign and magnitude format is another way to represent negative and positive integers. Where the most significant bit (MSB) is zero for positive numbers, and one for negative numbers, this is called the Sign bit. The following bits function as a standard binary format. This in term will mean that there are two ways to represent zero, and that is a waste of bit pattern [2][17].

Two’s complement format is more used by engineers, because it is less complex to implement in the hardware compared to the other three formats [2][17]. This is illustrated in the table below.

Bits Decimal 011 3 010 2 001 1 000 0 111 -1 110 -2 101 -3 100 -4

Table 1. Illustration of Two’s complement format

When wanting to represent fraction numbers with sign and magnitude format, it is possible to trick the CPU into thinking that it is dealing with integers. As seen in figure 3 that the Sign(S) bit is in the 16th position while the decimal point is put between the 5th and 6th bit [17][18].

Figure 3. Fraction representation with sign and magnitude format

The equation below is to calculate the intended value for figure 3 [17].

𝑆 =₋₁+1

𝑆𝑢𝑚 = 𝑆 ∗ (𝑖𝑛𝑡𝑒𝑔𝑒𝑟(𝑑𝑒𝑐𝑖𝑚𝑎𝑙) + 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛(𝑑𝑒𝑐𝑖𝑚𝑎𝑙)) ∗ 2−5

(15)

9

2.4 Floating-point

Floating-points indicate that the decimal point is floating around based on the given value, unlike the Fixed-point representation where the decimal point is set on the same place, this makes the Floating-point representations more dynamic and efficient. There are three precisions used with Floating-point, Half Precision(HP) uses 16-bit, Single Precision(SP) is the more common one and uses 32-bit, and Double

Precision(DP) is used with 64-bits [4][19].

The IEEE standard representation for Floating-point is divided in three parts [4][19]. ● The sign bit

● The exponent

●

The mantissa

Precision

Sign bit

Exponent

Mantissa

HP floating

1 bit

5 bits

10 bits

SP floating

1 bit

8 bits

23 bits

DP floating

1 bit

11 bits

52 bits

Table 2 - Basic floating formats

The sign bit decides the polarity of the value, where setting the bit to 0 represents positive numbers and 1 represent negative numbers [4][19].

The mantissa is the “fraction part”(the part after the separator), for example the following Floating-point number can be considered, the number 7 can be represented as 1.75*4=( 1+½+ 1/4)*22, the mantissa is the fraction part, 2−1+2−2 (0.75) and the

exponent 2+bias [4][19].

For SP floating point it can be normal to have an exponent in the range of 1 up to 254, and this is best explained when studying the mathematical equation of the SP value seen in eq.6. 𝑣𝑎𝑙𝑢𝑒 = (−1)(−1)𝑠𝑖𝑔𝑛_{∗ 2}(𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡−127)_{∗ (1 + (}1 2∗ 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛[22]) + ( 1 4∗ 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛[21]) … ( 1 223∗ 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛[0])) (Eq. 6)

This is the Offset binary format (explained above in section 2.3), where the exponent is shifted by a bias. The bias shown in the equation above is 127 [4][19].

(16)

10

FPU is a hardware unit that can be added to processors to perform Floating-point arithmetic operations in less cycles than the software Floating-point. Most FPU’s support the IEEE standard[4][19].

2.5 Energy monitoring in MCU

CMOS technology is used in MCUs and there are two power dissipations, static and dynamic. Static dissipation is the power leakage that occurs during steady state, while dynamic power occurs when switching states [20].

There are different forms in how to monitor energy efficiency and power

characteristics. One way to measure energy efficiency, is to look at the work that is done with a limited energy. By doing so the measurement unit can be in the form of Dhrystone Million Instructions Per Second (DMIPS)/μW or CoreMark(benchmark score)/μW. These two forms are a set of benchmark for the embedded system. Power measurement is on the other hand based on three factors, the active current which is measured in μA/MHz, the sleep mode current is measured in μA since the clock is should be stopped and the third factor to consider is energy efficiency. By energy efficiency it is the execution time that is taken into consideration, if the MCU has a long execution time then the overall power consumption will suffer [19].

The Cortex-M3 and Cortex-M4 has a number of power features such as sleep mode, wait mode and backup mode. The Cortex-M4 should be able to run at under

200μA/MHz, while some other Cortex-M processors are able to run at under 100 μA/MHz [19].

2.6 DSP device

A DSP device is a processor that is specialized in DSP algorithms, this leads to fast arithmetical calculations [3][21].

Harvard structure or the improved Harvard structure is generally used in a DSP device, that means data and instructions/program are in a separate memory. There are at least 4 buses in the DSP device: bus of program data, bus of program address, bus of data, and bus of data address. This separation means faster- and independent access during a cycle [3].

DSP device usually possess several processing units, these units main purpose are to enhance the speed of the device [3]. One of the units is the FPU, approximately one third of the DSP devices out in the market have a FPU unit, and over one half of the FPU non-users are planning to change. This is due to high cost of the hardware [2][21]. The pipelines are structured in a different way than the general-purpose processors, this allows the DSP device to execute multiple instructions simultaneously [3].

(17)

11

2.7 The Cortex-M4

In the year 2004 the Cortex-M microcontrollers were presented. The Cortex-M4 and Cortex-M7 processors support DSP instructions. The Cortex-M4 can be used in demanding areas where memory protection and Floating-point for SP and HP

calculations are mandatory. The key features in the Cortex-M4 are DSP, SIMD(Single Instruction Multiple Data), MAC(Multiply-Accumulate) unit, debug, Harvard

architecture, 32-bit performance, and optional FPU [22].

2.7.1 The SAM G55 DSC

The SAM G55 is a microcontroller based on the Cortex-M4 core and is intended for low power applications [8]. The SAM G55 DSC is an development board with this

controller.

Figure 4A. SAM G55 features Figure 4B. The SAM G55 DSC

The key features in the SAM G55 DSC are Atmel Embedded Debugger(EDBG), Atmel Data Protocol(ADP), current measurement header, 120 MHz, Analog to Digital Converter(ADC) module, Serial Wire Debugger(SWD)- and Joint Test Action Group(JTAG) interfaces, Data Watchpoint and trace(DWT) [23][24].

The EDBG is intended for onboard debugging, and one of its functions is to stream data from the MCU to the host PC. The EDBG makes use of the ADP when streaming the data. The DWT is a debugging unit that enables data tracing and counters for the processor. SWD is an alternative to the JTAG interface for debugging [23][24].

(18)

12

2.8 Atmel power debugger

The Atmel power debugger(figure 6) is a development tool which is intended for debugging and programming the ARM Cortex-M based Atmel SAM and Atmel AVR microcontrollers. The controllers need to have an interface of JTAG or SWD [25]. The JTAG also referred as boundary-scan is defined by IEEE 1149.1 as a method for testing functionality on circuit boards [26], while the SWD interface is a subset of the JTAG interface. The SWD interface takes use of TCK- and TMS- pin for connection, and these two pins can also be found on the JTAG 10-pin connector [27]. The power debugger has two separate means for measuring current and is ARM CMSIS-DAP(Cortex Microcontroller Software Interface Standard-Debug Access Port) compatible which means it will work with Atmel Studio 7.0 or later[25]. CMSIS-DAP is a interface that provides access for debugging [28]. A key benefit of the debugger is that it streams measurements and data to the Atmel Data Visualizer for real-time analysis [25].

Figure 5. The Power Debugger

Channel A in the power debugger provides high accuracy measurements when measuring a low current in the range of 100mA - 500μA, the resolution is around 3μA and the accuracy is no worse than 3% [25].

2.9 Interrupt latency

Interrupt latency is the number of clock cycles required from a processor to react to an interrupt signal on entry and on exit. The interrupt latency is around twelve cycles on entry and ten cycles on exit. If the FPU is enabled then an increase of seventeen cycles is possible on entry and on exit [29].

(19)

13

3. Related work

In this section, you will find relevant information contributed from previous work that is closely related to this thesis. The guidelines in the subsections below points out important features in the Cortex-M4 and mathematics in the subject of DSP but also about the speed efficient Floating-point unit.

3.1 Martin Trevor -The designers guide to the cortex-m processor family

- Chapter 8

The main focus in Trevor’s [12] book is to understand the DSP functions that are embedded within the Cortex M4 and the Cortex M7. The combination with a

traditional MCU can be referred to as a DSC. Martin Trevor explains the key features that are added to the M4 and M7 to support DSP usage. The enhancements are SIMD instructions, FPU and a more improved MAC unit compared to the M3.

Trevor then uses the ARM CMSIS-DSP(Cortex Microcontroller Software Interface Standard - Digital Signal processing) software library to show how to access these functions that are added in M4 and M7.

By doing experiments he explains the difference between FPU and the software Floating-point, he also explains how to enable and disable the FPU.

The SIMD instructions are also explained. This is done by giving some code examples that shows how efficient SIMD is with DSP algorithms and he even shows some exercises on how to optimize DSP algorithms and these are explained in a chronological order.

Further the CMSIS DSP Software library is explained in more detail, a part of this is about the conversion functions and their ability to convert between Floating-point and Fixed-point.

The most relevant points that are brought up is about SIMD, FPU and MAC that are embedded in Cortex M4. All these functions are relevant to this thesis since they will be encountered when solving the research question in section 1.3. By studying this book, it has given a better understanding of what a DSC represents but also how speed and power efficiency play a significant role in modern processors such as the M4.

(20)

14

3.2 Li Tan, Jean Jiang - Digital Signal Processing - chapter 7, chapter 8,

and chapter 9

Digital Signal Processing offers electrical engineers and computer engineers an introduction to the use of mathematics in the subject of DSP. Tan, et al. [4] takes advantage of the availability of powerful computers, and software environments such as MATLAB to perform extensive computation and create “laboratories”, this in return will give engineering students a bigger perspective about the effects that can be gained from filtering signals.

In chapter 7 Tan, et al. illustrates with figures, and mathematical equations about the concept of FIR filters. This is also explained by creating block-diagrams. The intention of these basic illustrations is to give engineers an understanding of how FIR filters can be implemented in projects or “laboratories”.

Chapter 8 in Digital Signal Processing is much like chapter 7, it is about IIR filters and how they can be implemented in projects and “laboratories”. It is explained through block-diagrams, mathematical representations, and figures. To keep it simple, Tan, et al. explaining a simple first-order IIR filter and a second-order filter, and to sum up all the points presented in the subsections, they present a few examples of IIR filters. In chapter 9 Tan, et al. brings up hardware and software for DSP devices. They explain the architecture differences that exist between a DSP device and a traditional MCU, such as the Harvard- and Von Neumann-architecture. Followed by the hardware units that exist in most common DSP devices such as the MAC unit. They bring up how important it is with a MAC unit by showing a visual representation of how the execution of the MAC function works.

Fixed- and Floating- point are both brought up in much detail in this book. Li, et al. brings up the differences of these two and how they are implemented in DSP devices. The FPU and MATLAB are both essential to this thesis. To solve RQ1 and RQ2 the use of MATLAB is required for generating filter coefficients and a signal with noise, and by following this guide has made it less complex to understand the workflow of MATLAB. The examples that are given by Li, et al. on FIR- and IIR-filters are implemented with both Fixed- and Floating- point, which is an important part to understand for this thesis.

(21)

15

3.3 Joseph Yiu - The Definitive Guide to ARM® M3 and

Cortex®-M4 Processors

-

chapter 9, 13, 21 and 22

Joseph Yiu [19] sheds light on the Cortex-M3 and the Cortex-M4 having examples of guidelines. The chapters of focus will be 9, 13, 21 and 22 because they are closely related to this thesis.

Chapter 9 is divided into two major sections. The first section is about low power systems, and low power features in the Cortex-M family. The focus will be on this section. Joseph Yiu brings up important questions like “what does low power mean in microcontrollers?” and then later explains that one typical way to measure energy efficiency is in the form of DMIPS/uW or CoreMark/uW which is basically how much processing is done with limited energy. Yiu later states that the measurement of power is done in uA/MHz since it traditionally is based on active current and sleep mode current, however this is now inadequate because energy efficiency is equally important. The end of the section is about how to utilize the low power feature in application software, this is illustrated through charts and tables.

The second important chapter that needs to be reviewed is chapter 13. It is based on Floating Point Operations (FPO). Yiu introduces software Floating-point, FPU and their usage in Cortex-M4.

By showing examples such as how to convert a value to SP in IEEE-754 standard, along with HP and DP. He later points out that for MCUs without FPU, the arithmetic

calculations are carried out by run-time library functions.

This brings us to chapter 21 (ARM Cortex-M4 and DSP Applications) and chapter 22 (Using the ARM CMSIS-DSP Library) which is about the DSP functions in the Cortex-M4 processor and how it compares to DSP devices. Yiu starts by explaining the term DSP, and its use on a MCU which is the key feature in the Cortex-M4 which makes it into a DSC. This is illustrated by showing the architecture layers added to the Cortex-M4. Yiu even states that by using the Cortex-M4 which is a DSC will solve the limitation of having an MCU and an DSP device separately, this will lead to lower power

consumption and lower overall system cost. The signal processing algorithms in the CMSIS-DSP library are optimized for Cortex-M4, Yiu brings up some examples and guides through common algorithms from the CMSIS-DSP library such as FIR-filter, IIR-filter, and FFT.

Yiu guidelines that are introduced in the book such as FPO, CMSIS-DSP library, SIMD, Cortex-M4 as a DSC and these guidelines are important to this thesis.By understanding in which form energy- and power- efficiency is measured, this in turn sets the

foundation for the benchmark that will be applied to solve the research questions in section 1.3.

(22)

16

3.4 Savita Rani - Area and Speed Efficient Floating Point Unit

Savita Rani [30] explains what the FPU is and the advantages of using FPU compared to the use of Fixed-point arithmetics. Rani states that the FPU is a key element in the area where real time computations are required such as with signal processing, and then mentions that with numbers that are very large or very small the use of FPU is required even if using Fixed-point arithmetics can be faster.

One point that Rani brings up is that multiplication is not as common as the use of addition, but is very important even essential for MCUs and DSP devices where DSP applications are involved. Rani talks in more detail about multiplication techniques and methods such as, Integer Multiplication Methods, Truncated Multipliers and

Logarithmic Multipliers. Then investigates the performance of these three

multiplication methods mentioned above, by using simulations to analyze the output of the multiplication techniques with full FPU, it is then discussed which multiplication technique that provides better results.

In this thesis both the use of FPU and Fixed-point arithmetics are performed, therefore it is important to think about what Rani says about how very large and very small numbers can be an issue when using Fixed-point arithmetics. Especially since the use of very small numbers are used in both the FIR- and IIR- filter. For the IIR-filter small changes in the coefficients can make it very unstable and this is probably one of the reasons that Rani recommends the use of FPU over Fixed-Point arithmetics when dealing with very small numbers.

(23)

17

3.5 Alexandre Aminot, et al. - Floating Point Units Efﬁciency in

Multi-Core Processor

Alexandre Aminot, et al. [31] explains the speed-up extensions in multi-core processors such as having a multi-core processor with FPU in every core, Aminot, et al. call them SMP. There are processors that only have FPU in some of the cores, Aminot, et al. call them for FAMP. The paper's research question is ”how to efﬁciently exploit ﬂoating

point units in multi-core processors?” and if there is any advantages of having FPU in all

the cores.

The method used in Aminot, et al. research is based on controlled experiments, three energy management systems are compared when using FAMP processors, the results are later compared with the SMP processors that only takes use one energy

management system. The three energy management systems are, application level, scheduler event level, and the hardware level which SMP use. It is stated that no modification of the code will be made for the experiments and that they use different benchmarks to estimate the power consumption, and the performance.

The results that Aminot, et al. achieved from experimentation is from the first energy management system which is about when using applications that take advantage of integers or minimal use of floating point then it is better with the FAMP processor, because the speed-up does not balance the power cost. Aminot, et al. mentions an application that is mostly used for Floating-point and with this application the SMP consumes less energy and has a higher speed than with FAMP. The second energy management system is the scheduler level where the system switches cores depending on the event that is in the application. The results lowered the energy consumption but instead increased the execution time, this is because of the time spent in the core without FPU is longer. The energy management system did not decrease the energy consumption for applications that depend more on the FPU, this is because more time is spent on the speed for switching cores and less time is spent for the cores without FPU. Aminot, et al. recommends that for applications that need to use floating point should be completely executed on the core that have FPU. The third energy

management system used is experimented with both an SMP- and FAMP- processor, is the hardware(instruction) level. The hardware level is an aggressive technique that quickly powers up the FPU in the core(s). This technique is application dependent, and the power up time is 1000 cycles. The energy consumption in the hardware level is reduced when using longer applications compared to having the same applications in the scheduler level. Aminot, et al's conclusion is that the FPU is not necessary for each core, this is because of the power leakage that occurs in the FAMP processors because of the FPU.

(24)

18

4. Method

The workflow that is used in this thesis is presented in this section. It can be seen as a top-down framework that consist of two main categories; literature study and

controlled experiments. Controlled experiments are then divided into three subcategories; SAMG55 with FIR- and IIR- filter, Power measuring system and In-system debugging. This structure is presented in figure 6 [32].

Figure 6. Research workflow

4.1 Literature study

It should be noted that several studies were reviewed during this thesis, and the most relevant reviews can be found in section 3. This has resulted in giving an overview of the domain problem stated in section 1.1 and the techniques (observe, formulate and evaluate) to identify the sub questions.

(25)

19

4.2 Research problem

The questions in section 1.2 are acquired through studies. To reach a conclusion on the research problem three steps have to be followed in a chronological order to ease the workflow.

● Evaluating the background ● Deciding the problem domain ● Setting the limitations

Evaluating the background is the first step done in this thesis, the advantage of this step is to get an overview of the area. The following step is to decide the problem domain found in the area that was evaluated in the first step. The last step is to set the limitations in order to focus on the specific problem at hand. These steps above are done through iteration of literature study which can be seen in figure 6. The most related studies in this thesis can be seen section 3.

4.3 Controlled experiments

“Science classifies knowledge. Experimental science classifies knowledge derived from observation” Denning P.J [33].

To get a basis in an area it is important to acquire an understanding of the

fundamental components and relationships in that area. By doing experiments one will be provided with the necessary data to better evaluate, predict, understand, control and improve a development process and product [32]. This is a well-known concept, where basically everything is held constant except for one variable [32][33]. The DSP functionality can be seen as the “variable” in the Cortex-M4 DSC.

In this thesis experiments will be performed with an apparatus. The data( Section 5.4) given from the apparatus is then analyzed and used to answer the problems in

question. Apparatus can be divided into two categories, system apparatus and simulator apparatus [33]. The apparatus in this case will be the system apparatus, SAMG55, In-system debugging, and Power measuring system.

4.3.1 SAMG55 with FIR- and IIR- filter

The system apparatus from Atmel is the SAMG55 development board. The SAMG55 is used to execute DSP algorithms such as FIR- and IIR-filter. During the execution of the algorithms the current measurement header and the Cortex debugger header are connected to a Power measuring system(section 4.3.3), the schematics for this setup can be seen in figure 11.

(26)

20

4.3.2 In-system debugging (DWT)

Measuring speed with in-system debugging is efficient. This is done by marking a set of code with a start- and stop- counter. This allows the user to see the amount of cycles that will be performed to execute the marked code. In this thesis in-system debugging will be performed with Atmel Studio to achieve a result to sub question three in

section 1.3 [34].

4.3.3 Power measuring system

The apparatus used in this thesis is the Atmel power debugger (section 2.8) which is a device used to measure power consumption. The power debugger allows the user to follow the power consumption of the FIR- and IIR- filter in a real-time application and analyze the efficiency of the Cortex-M4 device in its present state [25].

Atmel data visualizer is a program that is compatible with the Atmel power debugger which offers a graph plotter, oscilloscope and other indicators that will help in interpreting the data [25].

The power debugger and the data visualizer will be used to achieve a result to sub question one, two and three in section 1.3. The measurements will be done using common FIR and IIR algorithms.

(27)

21

5. Results and Analysis

The data and results presented in this section are based on Atmel SAM G55 DSC with the Cortex-M4 Core and FPU.

5.1 Algorithms and filter design

The algorithms that are used to accomplish RQ1, RQ2, and RQ3 are FIR-filter and IIR-filter. With such filters, there are some key parameters that need to be considered.

● The sampling rate ● The number of taps ● The pass-band ● The stop-band

In this thesis, the tool that is used to calculate FIR- and IIR- filter coefficients is called the Filter Designer tool and is a graphical GUI from MATLAB to design and analyze filters. The benefit of using this tool is its easy-to-use GUI that enables the user to design digital FIR- and IIR- filters by setting the specific parameters(sampling, pass-band and stop-band) listed above.

The two filters that are mentioned above has been designed as low pass filters with the following parameters.

Filter Apass Astop Fpass Fstop Sampling rate Number of taps received FIR 1 dB 40 dB 1100 Hz 2000 Hz 14000 Hz 23 IIR 1 dB 40 dB 1100 Hz 2000 Hz 14000 Hz 5

Table 3 -The parameters used in FIR- and IIR filter.

The Filter Designer tool gives a magnitude response overview of the design that was created, by doing so one can evaluate if the design meets the specifications that are sought, and in this case the requirements were met.

(28)

22

Figure 7A. FIR filter magnitude response Figure 7B. IIR filter magnitude response

At this point, the FIR- and IIR- filter coefficients are created which is then implemented in Atmel Studio. The last step is to generate a sinusoidal signal with some interference. This signal is also created in MATLAB and is going to be the basis for the IIR- and FIR- filters to filter out the interference. The interference signal are sinusoids with

frequencies 2500- and 4500- Hz and can be seen in the FFT spectrum in figure 8.

Figure 8. FFT generated by MATLAB with Frequency 800, 2500, 4500 Hz

The implementation of the IIR-filter with Fixed-point was a special case. It was created as an double section filter, which in turn means that the number of taps are 3. Double section is intended to function in the same manner as an single section filter, the difference lies in the functionality where the output of the first section becomes the input for the second section this is illustrated in the figure below.

(29)

23

figure 9. Double section IIR filter

The reason for using this implementation is because of the low accuracy with the Fixed-point coefficients and this makes the IIR filter unstable, this was mentioned by Savita Rani[32]. The filter coefficients are test on another development board that are based on the Cortex M4 core, this is done because the SAMG55 lacks Digital-Analog Converter (DAC).

5.2 Energy monitoring

To monitor the energy in the DSC SAM G55 MCU, two tools were used, the data visualizer and the power debugger. The first tool is the Power debugger which can use both the JTAG- and SWD- interface to target the SAM G55 DSC. The main focus is on the SWD (programming and debugging) interface together with the two current sensing channels (power measurement) that are on the Atmel debugger (Figure 10).

Figure 10. Logical Construction of the Power Debugger [25]

The benefit of the Cortex-M4 is its capability to collect data in a cycle-by-cycle resolution with the data watchpoint and trace unit (DWT), which is then shown on Atmel Studio. By doing so have led to identifying some particular energy consuming spots in the embedded system.

The second tool is the Data visualizer that is based on the ADP. The intent of this ADP protocol is to transfer data from a target MCU to the user’s PC. This is done through the Cortex debug header that can be found on the SAM G55. In this project, the method used to transfer the data from the DSC to the PC was through the Power debugger. Figure 11 shows the paring for the the SAMG55 MCU.

(30)

24

Figure 11. The wiring diagram[25]

A great benefit is to integrate the Data visualizer to work with the GNU C/C++ compiler and debugger, making it easier to monitor the embedded system. It is also important for the monitoring tools to have the same interface as the embedded system, or else they will not be compatible to each other, unless implementing a new interface. This was not an issue in this project since the SAM G55 has the same interface as the Atmel Studio Data visualizer, which is the ADP mentioned above. In short, the ADP protocol is very important in this project because a large set of data will be transferred from the DSC to the host PC.

To measure the power consumption accurately with the data visualizer, the following three areas are important to monitor.

● The Active Mode ● The Standby Mode

● The Sample Area (Active Mode + Standby Mode)

The Active mode is the part of the Sample area where the interrupt code is executed. While the Standby mode is the time where no code is executed.

The Sample area is important in such way that it makes it possible to monitor the overall power consumption in a sample. These three areas are illustrated in the figure 12 below.

(31)

25

Figure 12. The power measuring areas

This approach below for measuring the three areas(Active mode, Standby mode, and Sample area) is chosen in order to disregard the capacitors that can be found in between the MCU Voltage supply headers and the current measurement headers. The monitoring of the Active mode is done in five steps, the first step is by increasing the sample frequency, so that almost no time is spent on Standby mode(the interrupt latency time, section5.4). The second step is to record the average current and the average power from the data visualizer. The third step is the in-system debugging, this is done to record the amount of cycles it takes to execute the Active mode. Step four is to convert the amount of cycles into time(Eq. 7). Step five is to multiply the time in Active mode with the average power to get the energy spent in Active mode(Eq. 8).

𝐷𝑊𝑇 𝑐𝑦𝑐𝑙𝑒𝑠 ∗ 1 𝑀𝐶𝑈𝐶𝑙𝑜𝑐𝑘 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑇𝑖𝑚𝑒 𝑖𝑛 𝐴𝑐𝑡𝑖𝑣𝑒 𝑚𝑜𝑑𝑒 (Eq. 7) 𝑇𝑖𝑚𝑒 𝑖𝑛 𝐴𝑐𝑡𝑖𝑣𝑒 𝑚𝑜𝑑𝑒 ∗ 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑝𝑜𝑤𝑒𝑟 𝑖𝑛 𝐴𝑐𝑡𝑖𝑣𝑒 𝑚𝑜𝑑𝑒 = 𝑇𝑜𝑡𝑎𝑙 𝑒𝑛𝑒𝑟𝑔𝑦 𝑖𝑛 𝐴𝑐𝑡𝑖𝑣𝑒 𝑚𝑜𝑑𝑒 (Eq. 8)

While for monitoring the power consumption in the Standby mode area it is done in four steps. First is set the device in sleep mode. The second step is to record the average current and the average power from the data visualizer. The third is to get the time spent in Standby mode (Eq. 9). The last step is to multiply the time in Standby mode with the average power to get the energy spent in Standby mode(Eq.10).

(32)

26 1 𝑆𝑎𝑚𝑝𝑙𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦− 𝑇𝑖𝑚𝑒 𝑖𝑛 𝐴𝑐𝑡𝑖𝑣𝑒 𝑚𝑜𝑑𝑒 = 𝑇𝑖𝑚𝑒 𝑖𝑛 𝑆𝑡𝑎𝑛𝑑𝑏𝑦 𝑚𝑜𝑑𝑒 (Eq. 9) 𝑇𝑖𝑚𝑒 𝑖𝑛 𝑆𝑡𝑎𝑛𝑑𝑏𝑦 𝑚𝑜𝑑𝑒 ∗ 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑝𝑜𝑤𝑒𝑟 𝑖𝑛 𝑆𝑡𝑎𝑛𝑑𝑏𝑦 𝑚𝑜𝑑𝑒 = 𝑇𝑜𝑡𝑎𝑙 𝑒𝑛𝑒𝑟𝑔𝑦 𝑖𝑛 𝑆𝑡𝑎𝑛𝑑𝑏𝑦 𝑚𝑜𝑑𝑒 (Eq. 10)

The total energy in the Sample area is computed by adding the energy from the Standby mode and the Active mode(Eq. 11). The calculation of the average power in the Sample area is shown in equation 12. 𝑇𝑜𝑡𝑎𝑙 𝑒𝑛𝑒𝑟𝑔𝑦 𝑖𝑛 𝐴𝑐𝑡𝑖𝑣𝑒 𝑚𝑜𝑑𝑒 + 𝑇𝑜𝑡𝑎𝑙 𝑒𝑛𝑒𝑟𝑔𝑦 𝑖𝑛 𝑆𝑡𝑎𝑛𝑑𝑏𝑦 𝑚𝑜𝑑𝑒 = 𝑇𝑜𝑡𝑎𝑙 𝑒𝑛𝑒𝑟𝑔𝑦 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑎𝑟𝑒𝑎 (Eq. 11) 𝑇𝑜𝑡𝑎𝑙 𝑒𝑛𝑒𝑟𝑔𝑦 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑎𝑟𝑒𝑎 𝑆𝑎𝑚𝑝𝑙𝑒 𝑡𝑖𝑚𝑒 = 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑝𝑜𝑤𝑒𝑟 (Eq. 12)

5.3 Enabling the FPU

There are different ways to enable the FPU depending on the MCU.

The SAM G55 uses a processor made by Atmel, and these steps where necessary: 1. Make sure that the following symbol ARM_MATH_CM4 = true can be found in the compiler.

2. Adding two flags to both the compiler and the linker. ● -mfloat-abi=hard

● -mfpu=fpv4-sp-d16

3. Include arm_math header in main.c. 4. Call the fpu_enable() function in main.

(33)

27

5.4 Results of the controlled experiments

The conclusive results obtained with experimentation in a controlled environment can be seen below.

The Active current, the Sleep mode, and the energy efficiency that are mentioned in section 2.5 can also be seen in the subsections below. The Standby mode average current:

• 11.18mA FPU disabled

• 11.46mA FPU enabled

For the results in tables 4A-4C, it is important to consider the accuracy of the power debugger that is explained in chapter 2.8 and the interrupt latency in chapter 2.9. The latency time(entry plus exit) was measured, and is around:

• optimization -O0 FPU disabled 407 ns

• optimization -O0 FPU enabled 407 ns

• optimization -O1 FPU disabled 266 ns

• optimization -O1 FPU enabled 340 ns

During the “latency time” the MCU is not put into sleep mode, and by doing so the measurements of the Active mode will be more accurate. The current measured during “latency time” is between (depending on the optimization and if the FPU is enabled or disabled) 24.48mA - 25.21mA. The effect of the latency is at worst ~0.88%, this can be calculated with the equations below.

((𝑎𝑐𝑡𝑖𝑣𝑒 𝑚𝑜𝑑𝑒 𝑡𝑖𝑚𝑒 + 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 𝑡𝑖𝑚𝑒) ∗ 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑐𝑢𝑟𝑟𝑒𝑛𝑡) − 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ∗ 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 𝑡𝑖𝑚𝑒) 𝑎𝑐𝑡𝑖𝑣𝑒 𝑚𝑜𝑑𝑒 𝑡𝑖𝑚𝑒 = 𝑋 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 (Eq.13) (1 −𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑋 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ) ∗ 100 = 𝑒𝑟𝑟𝑜𝑟 𝑖𝑛 % (Eq.14)

(34)

28

Software Floating-point

Filter Opt. Area DWT

cycles Avg. Current (mA) Avg. Power (mW) Time (𝝁𝒔) Energy (𝝁𝑱) FIR -O0 Active 66942 _32.42 _108.12 _55.78 _6.03

Sample N/A N/A 92.6 71.43 6.61

-O1

Active 48152 _29.02 _96.92 _40.13 _3.89

Sample N/A N/A 70.8 71.43 5.06

IIR

-O0

Active 29612 _30.62 _102.22 _24.68 _2.52

Sample N/A N/A 59.7 71.43 4.27

-O1

Active ₁₈₅₄2 _28.82 _96.02 _15.45 _1.48

Sample N/A N/A 50.1 71. 43 3.57

Table 4A. Complex composition of the conclusive results for software Floating-point

FPU

cycles Avg. Current (mA) Avg. Power (mW) Time (𝝁𝒔) Energy (𝝁𝑱) FIR -O0 Active 2015 2 _31.82 _106.12 _16.79 _1.78

Sample N/A N/A 54.2 71. 43 3.87

-O1

Active 1682

2 _27.12 _90.52 _14.02 _1.27

Sample N/A N/A 48.5 71.43 3.47

IIR

-O0

Active 1077

2 _30.42 _101.52 _8.975 _0.91

Sample N/A N/A 46.2 71. 43 3.30

-O1

Active 733

2 _28.62 _95.32 _6.108 _0.58

Sample N/A N/A 43.1 71. 43 3.08

Table 4B. Complex composition of the conclusive results for FPU

(35)

29

Fixed-point

cycles Avg. Current (mA) Avg. Power (mW) Time (𝝁𝒔) Energy (𝝁𝑱) FIR -O0 Active 2140 2 _31.42 _104.92 _17.83 _1.87

Sample N/A N/A 54.2 71. 43 3.87

-O1

Active 1748

2 _26.92 _89.92 _14.57 _1.31

Sample N/A N/A 48.1 71. 43 3.43

IIR

-O0

Active 1557

2 _30.92 _103.12 _12.98 _1.33

Sample N/A N/A 49.3 71. 43 3.52

-O1

Active 1023

2 _28.02 _93.62 _8.525 _0.79

sample N/A N/A 44.1 71. 43 3.15

Table 4C. Complex composition of the conclusive results for Fixed-point

5.5 The power consumption when using optimization -O0 and -O1 with

Software Floating-Point, Fixed-Point and FPU

The measuring units that are presented in this project are based on the energy consumption and the number of cycles executed. These measurements are for well documented algorithms such as the FIR filter and the IIR filter.

The results shown in tables 4A-C are achieved with the Atmel data visualizer, Atmel power debugger and the DWT unit.

A benefit in the SAMG 55 DSC is that it has three low power modes which are backup, wait and sleep. When using sleep mode, the core clock should be stopped if used correctly and all the other functions should be able to keep on running [35]. In this experiment, the sleep mode is implemented and used to reduce the power

consumption.

In the subsections below the main focus will be on the average power in the Sample area, and the execution time in the Active mode with optimization -O0 and -O1.

(36)

30

5.5.1 The power consumption in FIR filter

The power consumption varies between the two optimizations -O0 and -O1. The values are based on the FIR filter with Software Floating-Point, FPU and Fixed-Point.

Software Floating-Point:

• Active mode time difference: ~32.6%

• Sample area average power difference: ~26.7%

FPU enabled:

Fixed-Point:

For the three options(Fixed-Point, FPU and Software Floating-Point) mentioned above it is clear that with optimization -O1 the power consumption is reduced by 10% to 25% compared to -O0 and the execution time is reduced by 16% to 29%.

5.5.2 The power consumption in IIR filter

The power consumption also varies for the IIR filter depending on the optimization(-O0 and -O1) chosen.

Software Floating-Point:

• Active mode time difference: ~46%

• Sample area average power difference: ~ 17.5%

FPU enabled:

• Active mode time difference: ~38%

Fixed-Point:

• Sample area average power difference: ~ 11.1%

Much like section 5.5.1 it was more beneficial to use optimization -O1 where the power consumption is reduced by 6% to 17% compared to -O0 and the execution time is reduced by 30% to 38%.

(37)

31

5.6 Comprehensive analysis

When following the workflow of chapter 5, a pattern has been noticed in the FIR- and IIR-filter values when performing -O0 and -O1 this can be traced back to subsection 5.5.1 and 5.5.2. This pattern can be seen in the Active mode time and the Sample area power consumption where the measured values are lower with -O1 compared to -O0. Based on the charts below one can see that the execution time and the power

consumption are tightly related to each other. FPU ensures faster Floating-point calculations which in turn will lead to that the time of the active mode is decreased and the time of the Standby mode is increased. Since in the Standby mode the DSC is not executing any code, this results in an overall reduced power consumption.

Chart 1. FIR Sample area power consumption(Y-axis), execution time in Active mode(X-axis)

Chart 1, shows the total energy for the FIR filter. An interesting point is that the execution of the FIR filter with optimization -O1 is done with less time and less total energy consumption in the Sample area.

When comparing the three options (FPU, software Floating-point and Fixed-point) in chart 1, it gets clear that when using Floating-point its less energy expensive to enable the FPU. While when performing Fixed-point compared to FPU, there are no major differences for the total accumulated energy consumption in the Sample area, when taking the accuracy of the power debugger and the interrupt latency in to

consideration.

When relating the energy consumption to the execution time in the chart above, longer execution time will use more energy, however this does not apply when comparing the FPU with Fixed-point when using the same optimization.

3870 3470 3870 3430 2000 3000 4000 5000 6000 7000 10 15 20 25 30 35 40 45 50 55 60 En ergy (n J) in Samp le ar ea

time(𝜇s) in Active mode

(38)

32

Fixed-point takes more time to execute the code than FPU, yet the energy consumption is approximately the same.

Chart 2. IIR Sample area power consumption(Y-axis), execution time in Active mode(X-axis)

In the results for the IIR filter measurements seen in chart 2, it is obvious that the FPU, the software Floating-point and the Fixed-point are executed faster and consumes less energy with -O1. When comparing these three options(FPU, software Floating-point and Fixed-point) with each other in -O0 then it is clear that the FPU consumes less energy than the other two options and the execution time is also faster.

In -O1 the FPU is executed faster than the other two options, however when looking at the total energy consumption then the difference between Fixed-Point and FPU is indistinguishable when taking the measurement error into account.

2500 2700 2900 3100 3300 3500 3700 3900 4100 4300 4500 0 5 10 15 20 25 30 En ergy (n J) in Sa m p le ar ea

Time(𝜇s) in Active mode

(39)

33

Chart 3. FIR Sample area, average power consumption

Chart 3 shows that the power consumption in -O0 with software Floating-point is 71% more than the power consumption used by the FPU, while in -O1 the power consumption is 46%

more. Optimization -O1 has been more beneficial for the software Floating-point compared to

the FPU and Fixed-point by around 15mW more but is still inferior to the FPU and Fixed-Point.

Chart 4. IIR Sample area, average power consumption

For the IIR filter it has been a challenge to distinguish which of the two options(FPU and Fixed-Point) with -O1 that has the most power consumption because of the small measurement error that exist. However even in this case, the Software Floating-point has been proven to be inferior to the other two options.

92,6 70,8 54,2 48,5 54,2 48,1 0 10 20 30 40 50 60 70 80 90 100

Opt -O0 Opt -O1

mW

FIR

Software Floating-point FPU Fixed-point 59,7 50,1 46,2 43,1 49,3 44,1 0 10 20 30 40 50 60 70

Opt -O0 Opt -O1

mW

IIR

Software Floating-point FPU Fixed-point

(40)

34

6. Discussion

There are no silver bullets when searching for a power optimal microprocessor, because the field is new, and there are aspects that are not known. Microprocessors are being used in areas where power consumption is a factor of high importance. This is a problem that needs to be handled in order to utilize the devices fullest potential. By looking into the power consumption of a Cortex-M4 with controlled experiments, a few key points were noted and these will be discussed in the subsections below.

6.1 Method discussion

The method that was used in this thesis can been seen as successful, by doing experimentation it was only a given that results were to be achieved. Regardless of how the system architecture is build, it is always rewarding to perform experiments. By following the workflow seen in figure 6, beginning at literature studies down to the controlled experiments can be seen as the preparation phase, where collecting as much data as possible was required in order to process, use and then implement to the controlled experiments. The controlled experiments were performed with two DSP algorithms with software Floating-point, Fixed-point and lastly with FPU in order to really be able to see the power consumption with each of these options(FPU, Fixed-point and software Floating-Fixed-point) individually but also compared to each other. A limitation that was encountered in this thesis that is worth mentioning is that there are more “guides” found in the databases than there is related works, especially in areas like power consumption with DSP in DSCs.

6.2 Discussion of the data measurement

When performing controlled experiments there has to be some form of data collection and processing. The subsections below will discuss the data that is observed and then collected with the power measuring system apparatus.

Joseph Yiu mentions that the power consumption for many Cortex-M microcontrollers is around 200 μA/MHz and that there are a few that can go under 100 μA/MHz. The results in tables 4 shows that the power consumption was at worst around 242 μA/MHz with -O1.

As mentioned in section 2.2 the in-system debugger is constrained when using -O2 and -O3, this leads to incorrect counting of DWT cycles which are used to calculate the energy consumption.

(41)

35

6.2.1 The power consumption when executing the FIR filter

The overall experience when performing the FIR filter algorithm in the SAM G55 DSC has worked as expected. The main points to discuss are,

• Optimization -O1 performed better than -O0 in term of energy consumption and execution time

• Optimization -O1 performed with less average current then -O0 in Active mode

• Floating-point calculations were performed faster with FPU than software Floating-point

• Lower power consumption when using FPU compared to using software Floating-point

• FPU executed the filter calculations faster than Fixed-point

• The average power during sleep mode is higher when enabling the FPU

• The average accumulated power consumption in the Sample area for the FPU and Fixed-Point are indistinguishable

• Speculations about -O2 and -O3 based on results from -O0 and -O1

The performance of -O1 has proven to be in favor for the energy consumption, with a higher order optimization it has resulted in faster code execution which in turn led to lower energy output in the Sample area in total. Optimization -O0 is slower than -O1 by ~20%-60%, and its more energy consuming by ~7%-31% . When considering IoT applications where the battery life is essential then even a small amount of saved energy is a win.

Optimization -O1 performed with less average current during Active mode compared to -O0, this can be due to “smarter” memory allocation and register handling. Martin Trevor mentions that the FPU is created and dedicated to execute Floating-point calculations as fast as integer calculations when executed in processors [12]. FPU is a hardware component that is not dependent of any software libraries to perform Floating-point calculations, while the software Floating-point needs to use software libraries for calculations. The results in tables 4A-C are measurement values that proves that the Floating-point calculations indeed were executed faster with FPU. This is because the use of software libraries were not needed to perform these

demanding DSP algorithms.

Joseph Yiu mentions that the speed of executions is critical to lower the overall power consumption [19]. The measurement values for power consumptions in tables 4A-C shows big difference between the use of FPU and software Floating-point. This is because the FPU ensures maximum throughput, this means that the FPU will spend less time in Active mode and longer time in Standby mode, where the DSC is in sleep mode. This in turn will make the overall power consumption to be lower than when using software Floating-point which is just like Joseph mentions and can also be seen in chart 3.

The statement Martin Trevor mentions above is not taking the use of Fixed-point into consideration, when using Fixed-point the conversion to the original value will take a

Energy monitoring of the Cortex-M4 core, embedded in the Atmel SAM G55 microcontroller

Engineering Degree Thesis

15 credits

Energy monitoring of the Cortex-M4 core, embedded in the Atmel SAM

G55 microcontroller

Zeid Bekli

William Ouda

Abstract

Acknowledgements

Contents

Acronyms

ADC

ADP

CMSIS-DAP

CMSIS_DSP

DMIPS

DP

DSC

DSP

DSP device

DWT

EDBG

FFT

FIR

FPO

FPU

HP

GCC

IIR

IoT

JTAG

MAC

MCU

MSB

Opt

SIMD

SP

1 Introduction

1.1 Problem domain

1.2 Research Questions

Main question

Sub questions

1.3 Limitations

2. Theoretical background

2.1 Digital Signal Processing

2.1.1 The FIR filter

2.1.2 The IIR filter

2.2 GNU Compiler Collection (GCC)

2.3 Fixed-point

2.4 Floating-point

●

Precision

Sign bit

Exponent

Mantissa

HP floating

1 bit

5 bits

10 bits

SP floating

1 bit

8 bits

23 bits

DP floating

1 bit

11 bits

52 bits

2.5 Energy monitoring in MCU

2.6 DSP device

2.7 The Cortex-M4

2.7.1 The SAM G55 DSC

2.8 Atmel power debugger

2.9 Interrupt latency

3. Related work

3.1 Martin Trevor -The designers guide to the cortex-m processor family

- Chapter 8

3.2 Li Tan, Jean Jiang - Digital Signal Processing - chapter 7, chapter 8,

and chapter 9

3.3 Joseph Yiu - The Definitive Guide to ARM® M3 and

Cortex®-M4 Processors