FPGA Co-Processing in Software-Defined Radios

(1)

INOM TEKNIKOMRÅDET EXAMENSARBETE

ELEKTROTEKNIK OCH HUVUDOMRÅDET

INFORMATIONS- OCH KOMMUNIKATIONSTEKNIK, AVANCERAD NIVÅ, 30 HP

STOCKHOLM SVERIGE 2019,

FPGA Co-Processing in Software-Defined Radios

LEON FERNANDEZ

(2)

Abstract

The Internet of Things holds great promises for the future. In the smart cities of tomorrow, wireless connectivity of everyday objects is deemed essential in ensuring efficient and sustainable use of vital, yet limited resources such as water, electricity and food. However, radio communication at the required scale does not come easily. Bandwidth is yet another limited resource that must be used efficiently so that wireless infrastructure for different IoT applications can coexist. Keeping up with the digitalization of modern society is difficult for wireless researchers and developers. The Software-Defined Radio (SDR) is a technology that allows swift prototyping and development of wireless systems by moving traditional hardware-based radio building blocks into the software domain. For developers looking to be on the bleeding edge of wireless technology, and thus keep up with the rapid digitalization, the SDR is a must. Many SDR systems consist of a radio peripheral that handles tasks such as amplification, AD/DA-conversion and resampling that are common to all wireless communication systems. The application-specific work is done in software at the baseband or an intermediate frequency by a host PC connected to the peripheral. That may include PHY-related processing such as the use of a specific modulation scheme as well as higher-layer tasks such as switching. While this setup does provide great flexibility and ease-of-use, it is not without its drawbacks. Many communication protocols specify a so-called round-trip time and devices wishing to adhere to the protocol must be able to respond to any transmission within that time. The link between the host and the peripheral is a major cause of latency and limits the use of many software-defined radio systems to proof-of-concept implementations and early prototyping since it prevents the round-trip time from being fulfilled. Overcoming the latency in the link would allow the flexibility of SDRs to be brought into field applications.

This thesis aims to offload the link between the host PC and the radio peripheral in a typical SDR system. Selected parts IEEE 802.15.4, a wireless standard designed for IoT applications, were implemented by using unused programmable logic aboard the peripheral as a co-processor in order to reduce the amount of data that gets sent on the link. Frame success rate and round-trip time measurements were made and compared to measurements from a reference design without any co-processing in the radio peripheral. The co-processing greatly reduced traffic on the link while achieving a similar frame success rate as the reference design. In terms of round-trip time, the co-processing actually caused the latency to increase. Furthermore, the measurements from the co- processing system showed a counter-intuitive behavior where the round-trip time decreased as the rate of the generated test frames increased. This unusual behavior is most likely due to internal buffer mechanisms of the operating system on the host PC. Further investigation is required in order to bring down the response time to a level more suitable for field applications.

Keywords

Internet of Things, Software-Defined Radio, USRP, GNU Radio, IEEE 802.15.4

(3)

Sammanfattning

Sakernas Internet, The Internet of Things (IoT), utlovar stora saker inom en snar framtid. I morgondagens smarta städer är trådlös uppkoppling av vardagliga ting en viktig komponent för effektiv och hållbar användning av begränsade resurser såsom vatten, elektricitet och mat. Desvärre är radiokommunikation i den skala som krävs en tuff utmaning. Bandbredd är ytterligare en begränsad resurs som måste användas effektivt så att trådlös infrastruktur för olika IoT- applikationer kan samexistera. Att hänga med i takten för det moderna samhäl- lets digitalisering är svårt för forskare och utvecklare inom trådlösa system. Den mjukvarudefinierade radion, Software-Defined Radio (SDR), är en teknik som möjliggör smidig utveckling av trådlösa system. Grunden i tekniken är att flytta traditionella hårdvarubaserade byggblock för radio in i mjukvarudomänen. För utvecklare som vill befinna sig i framkanten för trådlösa system, och på så vis hålla takt med den snabba digitaliseringen, är SDR ett måste. Många SDR system består av en extern radiomodul som hanterar sådant som är gemensamt för de flesta trådlösa system, exempelvis förstärkning, AD/DA-omvandling och omsampling. Applikationsspecifik funktionalitet sköts av mjukvara i basbandet eller på en mellanfrekvens där mjukvaran körs på en PC. Ett SDR-system bestående av en PC med en extern radiomodul ger användaren stor flexibilitet men det har sina brister. Många kommunikationsprotokoll anger en så kallad Round-Trip Time (RTT). Enheter som strävar efter att följa protokollet måste kunna svara på alla meddelanden inom den tiden som angetts som RTT. Länken mellan PC:n och radiomodulen är en stor bidragare till fördröjningar och begränsar användandet av SDR till konceptuella tester och tidiga prototyper efter som fördröjningarna oftar innebär ett brott mot protokollets RTT. Om problemet med fördröjningar kan undvikas skulle SDR kunna användas i fältapplikationer med all den flexibilitet som SDR innebär och därmed bli ett kraftfullt utvecklingsverktyg för forskare och utvecklare inom området.

Det här arbetet avser att avlasta länken mellan PC:n och radiomodulen i ett typiskt SDR system. Utvalda delar av IEEE 802.15.4, en standard för trådlös kommunikation inom IoT, implementerades med hjälp av programmerbar logik på USRP:n så att de flesta samplingarna konsumeras innan länken. Antalet framgångsrikt mottagna ramar samt RTT mättes och jämfördes med en referensdesign där samtliga beräkningar hanteras av PC:n. Användandet av den programmerbara logiken ledde till mycket reducerade datamängder på länken utan nämnvärd förändring i antalet framgångsrikt mottagna ramar jämfört med referensdesignen. Dock, vart fördröjningarna i systemet större när den programmerbara logiken användes. Dessutom visade systemet ett oväntat beteende där fördröjningen minskade under när trycket från den trådlösa trafiken ökade. Detta märkliga beteende beror högst troligt på interna buffermekanismer i operativsystemet i PC:n. Fortsatt utredning krävs innan fördröjningarna kan reduceras till en nivå som passar för fältapplikationer.

Nyckelord

Sakernas Internet, Mjukvarudefinierad Radio, USRP, GNU Radio, IEEE 802.15.4

(4)

Acknowledgements

I would like to thank my supervisor, Dr. Peng Wang, for his great enthusiasm and support throughout this project, for always being in a good mood and for being so happy to share his knowledge and thoughts about digital communications. I would also like to thank my examiner, Prof. Marina Petrova, for letting me be a little part of the research at Radio Systems Lab at KTH.

I would like to thank Thomas for urging me to pursue an engineering degree when I had no idea what to do with my life. So far, it has been one of the best decisions I have made and I am sure it will stay that way.

Lastly, I would like to thank Jessica for all the love and support and for putting up with my endless babbling and ranting about technology.

(7)

Chapter 1

Introduction

What is the Internet of Things (IoT)? The Internet is already made up of things, is it not? According to the International Telecommunications Union (ITU) there has traditionally been two dimensions to communications, namely TIME and PLACE [1]. Ensuring that a message arrives within a specified time interval or ensuring good message quality over long distances are typical examples that address the TIME and PLACE aspects of communications. The Internet today is indeed made up of things but it is lacking a third dimension, the THING dimension, also envisioned by ITU [1]. The THING dimension is meant to address more application-specific aspects of a certain system such as power consumption requirements, deployment cost and co-existence in densely populated areas where the electromagnetic (EM) spectrum is a limiting factor.

In other words, what differentiates today’s Internet from the IoT is the diversity of the connected devices. Managing the the heterogeneity both within networks and between networks in the IoT is a major challenge for researchers and developers all over the world and a key enabling technology in building the smart and sustainable cities of tomorrow.

1.1 Motivation

Telecom giants are predicting a huge increase in device connectivity and diversity over the next few years. Cisco estimates that the number of networked devices would increase from 18 billion in 2017 to 28.5 billion in 2022 and the percentage of machine-to-machine (M2M) communications will increase from 34 % to 51% [2]. The majority of these devices will be smart versions of basic household appliances such as white goods, entertainment systems and lighting installations. Huawei predicts that by 2025, each person will own 5 smart devices and all new vehicles will have Internet connectivity [3]. Furthermore, Ericsson is expecting that the number of short-range wireless IoT devices will increase from 7.5 billion in 2018 to 17.8 billion in 2024 [4]. The wide variety and sheer amount of new devices will reshape the information and communication technology (ICT) landscape. Therefore, new technologies and standards are needed in order to build a secure, reliable and scaleable infrastructure.

The software-defined radio (SDR) [5] has been proven to be a powerful tool for engineers in keeping up with the ever-increasing demands of the dawning IoT. The SDR can replace slow and costly electronic hardware development processes in proof-of-concept implementations and early prototyping stages.

Given the new challenges related to the IoT, a tool for more efficient design space exploration, such as the SDR, is much needed. A core idea behind the SDR is implementing as as much of the radio functionality as possible in source code and thereby reducing the amount of hardware that needs to be developed in order to have a functional device. A common SDR setup consists of a peripheral radio device (PRD) connected to a host PC via for example Ethernet or PCI.

(8)

When operating as a receiver, the PRD converts the received signal at the antennas to samples and sends them over the cable to the host PC where the can be processed in the host’s general purpose processor (GPP) or graphic processing unit (GPU). When operating as a transmitter the process is reversed:

the samples are generated in the host and then sent to the PRD for transmission.

Processing or generating samples in software using GPPs or GPUs provides a great deal of flexibility compared to developing application specific hardware and it opens up the field of digital radio research to a wider community.

1.2 Problem

While the host-based SDR setup provides great flexibility and prototyping possibility for wireless communication systems, the link between the host and the device is too much of a bottleneck to achieve compliance with commercial off-the-shelf (COTS) devices as it introduces a latency to the system that is intolerable in most commercially used communication protocols. How can the link between the host PC and the PRD be offloaded so as to achieve a COTS- compatible SDR system?

1.3 Purpose

This thesis presents a proof-of-concept that the link between the host PC and the PRD can be offloaded by executing certain parts of the receiver chain on the actual PRD. With a modular block-based design, the user will be able to create such a receiver without having to know any details about the PRD hardware, thereby maintaining the advantage of flexibility that the SDR offers.

1.4 Goal

The main goal of this thesis is to migrate the execution of certain functional blocks in a SDR transceiver from the host PC to the PRD. By utilizing otherwise unused programmable logic resources on the PRD for co-processing, it is thought that the latency of the host-PRD link will be lowered, perhaps even to levels that allows for compliance with certain COTS devices.

The SDR is a powerful development tool, but it still has a lot of unused potential. In order to tackle the challenges in research and development posed by the IoT, it is important to keep improving the tools used within the field.

1.5 Ethical Considerations

When it comes to more social and societal aspects of IoT and SDR technology, developers and researchers have some responsibilities that must not be ignored. With the advent of the IoT, a lot of personal information may become available online. Power consumption of household devices, data from medical IoT devices or metadata from smart appliances are some examples of information that should be kept private by all means. Developers and researchers therefore have the responsibility to protect this data and make the connections secure through, for example, cryptography. Furthermore, SDR systems will inherit any weaknesses in the language(s) and libraries used for implementation. This may make SDR implementations of standards

(9)

and protocols more vulnerable targets compared to hardware implementations, which are usually more difficult to crack. For that reason, it is important that developers and researchers are familiar with known possible flaws or exploits in the underlying software framework, and how to avoid the associated pitfalls.

1.6 Delimitations

Both SDRs and the IoT are very broad fields of study and the intersection of the two is, naturally, a big topic as well. IoT devices are often characterized by low-rate, low-power radio requirements and for that purpose there are many standards and protocols available. This thesis focuses solely on the IEEE 802.15.4 standard [6]. More specifically, it focuses on a receiver for the 2.4 GHz Offset Quadrature Phase Shift Keying (OQPSK) [7] PHY as proposed by the standard. The reason for this being that it is the most commonly used PHY among COTS devices, most notably through the Zigbee [8]

protocol. Moreover, this thesis emphasizes graphical programming of SDR devices. In theory and during design stages, radio chains and SP systems are often visualized by block diagrams. Graphical programming is often visualized in a similar manner and therefore, a smooth design-implementation transition can be achieved, something that is often sought after within electronic systems design. While more traditional means of programming SDR devices may give better performance and/or finer control, they are not investigated in this thesis in favor of more easy-to-use graphical frameworks that fit better into the idea of fast prototyping and early implementation. Lastly, since some software in a SDR system often runs on top of a traditional operating system (OS), such as Linux/UNIX variants, the final performance of the system may be affected by the choice of OS and what (if any) configurations has been made. The implications of such OS related parameters will not be investigated in this thesis and a stock Ubuntu system will be used.

1.7 Outline

This thesis is divided into six chapters in the order as follows. This chapter has served as an introduction to the general area of the usage of SDR technology within the IoT and defined the scope of the thesis. Chapter 2 gives some technical background for low-power wireless communications and the underlying software stack for the targeted SDR technology. Methods for testing the implementation are presented in Chapter 3 while the implementation itself is studied in detail in Chapter 4. Chapter 5 presents the test results and in the last chapter, Chapter 6, the results are discussed together with conclusions and possible future work.

(10)

Chapter 2

Background

2.1 Wireless Communications in the ISM-bands

The Industrial, Scientific and Medical (ISM) radio bands have over the last 20 years become immensely popular for wireless communications. The ISM bands were originally defined to be a ”buffer zone” for non-communication radiation from applications such as industrial heating, microwave ovens and medical treatments. Because of their history, radio communications in the ISM bands require no license. At the time of writing the ISM bands are defined by the International Telecommunications Union in the final acts of the World Radiocommunications Conference 2015 [9].

While there are many frequencies that belong to the ISM bands, ranging from as low as 6.78 MHz all the way up to 245 GHz, the 2.45 GHz band has become the most commonly used band with technologies such as IEEE 802.11/Wi-Fi, Bluetooth and IEEE 802.15.4/Zigbee [10][11][?][8][6].

The unlicensed status of the ISM-bands is essentially what makes the IoT possible since there is virtually no limits on what application the bands are used for.

2.2 The IEEE 802.15.4 Standard

The IEEE 802.15.4 standard is being developed for the purpose of low- energy and low-rate wireless communications [6]. Since many of the devices that make up the IoT are expected to be basic household appliances, simple sensors and lighting installations, IEEE 802.15.4 is suited to be one of the main communication standards in the IoT. There are may ways of implementing IEEE 802.15.4, it specifies a number of different physical (PHY) layers that can be used over a range of frequencies and it describes in very broad terms what the expected behavior of the medium access control (MAC) layer is.

As mentioned in Section 1.6, the focus of this thesis is the implementation of one particular PHY that IEEE 802.15.4 proposes. It operates in the 2.4 GHzband and uses OQPSK modulation with a half-sine pulse-shaping filter.

Furthermore, the PHY specification employs a direct-sequence spread spectrum (DSSS) technique for robustness, step 3 below. Step-by-step, the signal is generated as follows:

1. The binary stream to be transmitted is divided into chunks of four.

2. Each chunk is mapped to a symbol according to Table 2.1.

3. Each symbol is mapped to a chip sequence according to Table 2.1.

4. Starting with the least significant chip in the sequence corresponding to the chunk of least significant bits, the chip stream is modulated using OQPSK

(11)

with even-numbered chips going on the I-channel and odd-numbered chips going on the Q-channel.

5. Employ a half-sine pulse shaping filter to the chips on the respective channels.

IEEE 802.15.4 specifies a bit rate of 250 kbit/s which translates into a chip rate of 2 Mbit/s and ultimately a constellation symbol duration of Ts= 1µs for the pulses on the respective channels. Note that what the standard refers to as

”symbols” is basically the index of the chip sequences and thereby more related to the DSSS technique rather than the OQPSK constellation. To maintain a distinction throughout this thesis the term ”symbol” will be used to describe the

”chip sequence index” while the term ”constellation symbol” will be used as the name for the points in a constellation diagram. The waveform representing the symbol 0 can be seen in Figure 2.1. There is a relationship between OQPSK and another modulation scheme called minimum shift keying (MSK) that is important for understanding the receiver design. This relationship is addressed in Section 2.6.

Table 2.1: Mapping table between the binary data and the chip sequences.

Chunk (b₀b₁b₂b₃) Symbol Chip Sequence (c₀c₁...c₃₁) 0000 0 11011001110000110101001000101110 1000 1 11101101100111000011010100100010 0100 2 00101110110110011100001101010010 1100 3 00100010111011011001110000110101 0010 4 01010010001011101101100111000011 1010 5 00110101001000101110110110011100 0110 6 11000011010100100010111011011001 1110 7 10011100001101010010001011101101 0001 8 10001100100101100000011101111011 1001 9 10111000110010010110000001110111 0101 10 01111011100011001001011000000111 1101 11 01110111101110001100100101100000 0011 12 00000111011110111000110010010110 1011 13 01100000011101111011100011001001 0111 14 10010110000001110111101110001100 1111 15 11001001011000000111011110111000 The structure of a transmitted frame can be seen in Figure 2.2. The segments are used as follows:

• The synchronization header (SHR) is used to detect the presence of a frame and to aid the receiver in carrier and timing recovery. Timing and carrier recovery is discussed in Section 2.6.1 and 4.1.1.

• The PHY protocol data unit (PPDU) that contains the actual usable PHY data.

• The preamble is a repeated transmission of the zero symbol. Due to the DSSS technique, the preamble is a repeated sequence of OQPSK constellation symbols that is easily discernible from the background noise.

(12)

I t

Q t

Ts

2

Figure 2.1: A time-domain view of the of the waveform that represents the symbol 0.

SHR PPDU

Preamble SFD PHR MHR + MSDUPSDU MFR

0x00 0x00 0x00 0x00 0xA7 1 B 1 B–125 B 2 B

Figure 2.2: Frame structure for the IEEE 802.15.4 2.4 GHz OQPSK PHY.

• The SHR delimiter (SHD), the symbol 7 followed by the symbol 10 to mark the end of the SHR and the start of the usable data.

• The PHY header (PHR), one byte that indicates the number of bytes in the PHY payload.

• The PHY service data unit (PSDU), the PHY payload. Limited to 127 bytes.

• The MAC header (MHR) and MAC service data unit (MSDU) contain information about the frame type, such as beacon (BCN) or acknowledgement (ACK), as well as the data being sent to higher layers.

• The MAC footer (MFR) is a two-byte error-detecting code calculated from the bytes in the MHR and MSDU. The algorithm used for the calculation is commonly known as CRC-CCITT [12].

2.2.1 Cyclic Redundancy Check

The IEEE 802.15.4 standard specifies the usage of a frame check sequence (FCS) in order to detect errors in the received frame. The algorithm used with the OQPSK PHY, CRC-CCITT [12], is a 16-bit cyclic redundancy check (CRC) defined by the generator polynomial

G₁₆(x) = x¹⁶+ x¹²+ x⁵+ 1.

By calculating the remainder polynomial

R(x) = r₀x¹⁵+ r₁x¹⁴+ ... + r₁₄x + r₁₅= M (x)x¹⁶mod G₁₆(x), ri∈ Z2

where

M (x) = b0x^p−1+ b1x^p−2+ ... + bp−2x + bp−1, ri∈ Z2

(13)

and b0...b_p−1 are the bits in the MHR+MSDU segment in Figure 2.2, a 16-bit word, r0...r₁₅, is obtained to form the MFR segment in Figure 2.2. Upon frame reception, the CRC is calculated over the MHR+MSDU segment and compared with the MFR. If there is a match, the frame is most likely intact and can be sent for further processing. Otherwise, some bits in the received MHR+MSDU segment may have been misinterpreted and the frame is discarded.

2.3 USRP and UHD

The Universal Software Radio Peripheral (USRP) [13] is a family of SDR devices. The USRP and its associated drivers, called USRP Hardware Drivers (UHD) [14], are developed by Ettus Research [15]. Most USRP models can operate over frequencies ranging from 6 GHz down to a few MHz or even DC, depending on the model, with analog bandwidths of 40 MHz-160 MHz, also depending on the model [16].

During operation, the user chooses a center frequency at which the device will operate. In receiving mode, the spectrum around the center frequency gets shifted down to an intermediate frequency (IF) and an analog-to-digital converter (ADC) subsequently converts the signals into the digital domain. In transmitting mode, a digital-to-analog converter (DAC) is fed with samples coming in from the host and the output of the converter is shifted up to the center frequency and then output at the antenna. A general overview of the USRP architecture can be seen in Figure 2.3. The blue box denotes the parts of the system that are implemented using programmable logic. The two core blocks in the digital domain are the digital down converter (DDC) and the digital upconverter (DUC) as they allow the user to select what sample rate the software will have access to.

Section 2.5 will discuss this in more detail.

Figure 2.3: An overview of the general USRP architecture. Original image can be found on [17]. The blue box were added by the author to point out the parts of the USRP X310 system that the FPGA handles.

(14)

UHD provides an abstraction of a generic USRP device, allowing the host PC to seamlessly communicate with different USRP models. It is a C/C++ library that provide basic tasks such as setting the center frequency and streaming samples to and from the device.

2.4 GNU Radio

GNU Radio [18] is a toolkit for general signal processing (SP) applications.

It is free and open source software (FOSS) written mostly in C/C++ and Python. GNU Radio further abstracts away the PRD by handling things such as handling streams between the client and the device. GNU Radio also provides the USRP/UHD developer with more general SP utilities such as filters and different modulation/demodulation schemes. Note that GNU Radio is not limited to SDR development on the USRP platform.

Furthermore, GNU Radio allows the user to graphically construct SP applications. Typical SP modules are represented by blocks with ports for inputs and outputs. An application is made by dropping the desired blocks onto the workspace and then connecting them in a way so that the desired functionality is achieved. This way, prototypes for radio architectures can be made, swiftly and intuitively.

Figure 2.4 shows the software stack of the host PC-USRP system. The developer implements radio functionality through GNU Radio Companion (GRC), a graphical programming environment on the host PC. GRC then interacts with the GNU Radio library that implements typical radio and SP algorithms. Samples coming from GNU Radio to the USRP or vice-versa are handled by UHD, which sends them over the host-PRD link. The link between the host PC and the USRP PRD is typically Ethernet or PCI. In the USRP domain, incoming and outgoing samples are handled by firmware running on a microprocessor and fed into or out of the digital circuitry that handles the processing at the IF.

2.5 The Field-Programmable Gate Array

Since the mid 80’s, programmable logic integrated circuits (IC) have been an important component for computer scientists and electrical engineers. While a number of different such technologies exist, the field-programmable gate array (FPGA) [19] is probably the most prominent. An FPGA consists of a large number of basic digital building blocks such as registers, flip-flops, adders and look-up tables (LUT). A reconfigurable interconnect allows these blocks to be connected in many different ways so as to realize a digital circuit made by a designer. The designer typically uses a so-called hardware description language (HDL) such as VHDL [20] or Verilog [21] to describe the desired digital circuit.

The HDL is then processed by an external program that outputs a low-level description of the digital circuit expressed in terms of the available blocks on the FPGA, a process called digital synthesis. The low-level description can then be used to program the interconnect on the FPGA, thereby realizing the circuit inside the FPGA. The circuit can then be interacted with using some of the pins of the FPGA IC package. This workflow provides a fast way of prototyping and developing digital circuits purely by writing source code. While the performance may not be as good as that of an application-specific integrated circuits (ASIC),

(15)

UHD GNU Radio

GRC

Firmware USRP HDL

Host PC

USRP

PCI or E the rne t

Figure 2.4: A conceptual illustration of the UHD/GNU Radio stack.

it is often good enough for many applications and for prototyping purposes.

For a better understanding of FPGA fundamentals the reader is referred to [22]

or [23].

2.5.1 Xilinx Vivado

The USRP platform is largely based on FPGA technology from one of the major FPGA vendors, Xilinx. For programming their FPGAs, Xilinx provides a tool suite called Vivado [24]. Vivado also gives the user some control over how the HDL code is synthesized and realized aboard the FPGA, allowing fine-grained tuning of the digital circuit that is generated from the HDL code. Additionally, Vivado lets the user generate HDL code from higher- level languages such as C/C++ or by graphical means such as MATLAB Simulink [25] [26] or through the Vivado graphical user interface (GUI).

Depending on the user’s preference, Vivado can be run entirely through scripting [27], entirely through the GUI or a combination.

2.5.2 RF Network-on-Chip

As mentioned in Section 2.3, at the IF the signals exit or enter the digital domain (depending on whether the PRD is transmitting or receiving). The digital domain in most USRP devices is actually the FPGA. A lot of core processing such as interpolation/decimation and IQ-imbalance correction is done inside the FPGA by circuits generated from HDL code. But even though these are complex circuits, a lot of the resources on the FPGA still go unused. That means a lot of potential digital processing circuitry exists aboard the USRP. By utilizing the unused digital blocks on the USRP’s FPGA, some of the processing can be moved away from the host, thus offloading the link between the host PC

(16)

and the PRD. RF Network-on-Chip (RFNoC) provides a means for the developer to use the FPGA aboard most USRP devices not only for core processing, but for customized digital signal processing (DSP) as well. In other words, RFNoC acts as a glue between the Vivado tool suite, Ettus’ USRP/UHD platform and GNU Radio. A conceptual illustration of the software stack and how RFNoC fits in can be seen in Figure 2.5. In terms of the USRP hardware architecture, this can be thought of as inserting custom DSP blocks anywhere between the ADC/DAC blocks and the DDC/DUC blocks in Figure 2.3. The user can also replace or remove any DDCs or DUCs, if the user wants to interface with the ADCs or DACs more directly.

UHD GNU Radio

GRC

Firmware

USRP HDL

RFNoC

Vivado

Custom HDL

Figure 2.5: A conceptual illustration of how RFNoC fits into the UHD/GNU Radio stack.

2.6 OQPSK viewed as Minimum Shift Keying

OQPSK with a half-sine pulse-shaping filter generates signals that have a constant envelope. This simplifies the design of RF power amplifiers and is therefore a desirable property. It does have some drawbacks, nonetheless. Under common impairments such as carrier mismatches, the offset between the I- and Q-channels makes the signal more difficult to recover as compared to regular QPSK.

Due to the Q-channel being offset by one half constellation symbol period, an OQSPK signal must be sampled at twice the constellation symbol rate. By studying Figure 2.6 it should be clear that if the sampling is done whenever the I- or Q-channel peaks, the phase difference between subsequent samples is always θ = ±90^◦. In [28] it is shown mathematically that viewing an OQPSK signal with a half-sine pulse shape as a series (θm)_m∈N of ±90^◦ phase shift is how one would treat an MSK signal and that the two are equivalent. Tracking

(17)

these phase shifts rather than the peak amplitudes of the pulses on the I- and Q-channels is a fundamental concept in the receiver.

r_m r_m+1

θ_m

r_m+2

θ_m+1

r_m+3

θ_m+2

r_m+4

θ_m+3

r_m+5

θ_m+4

r_m+6

θ_m+5

r_m+7

θ_m+6

I t

Q t

Figure 2.6: A visualization of the phase shift between subsequent samples when sampling is done with optimal timing.

An important note is that one cannot directly decode the sequence of phase shifts (θm)_m∈N and obtain the chip sequences in Table 2.1. An intermediate coding step is required. By looking at Figure 2.6 it becomes clear that the sample sequence rm, rm+1, rm+2, rm+3 = (1 + 0j), (0 + 1j), (−1 + 0j), (0 + 1j) gives the binary chip sequence 1101 while the phase shifts in between the samples θm, θm+1, θm+2 = 90^◦, 90^◦, −90^◦ would give the binary ”pseudo-chip” sequence of 110. One can move between the two interpretations, chips and pseudo-chips, by using a trellis diagram. In Figure 2.7 the accumulated phase shown in the nodes correspond to the phases of the complex baseband sample sequence (r_m)_m∈N (the chips) and the path in between correspond to the sequence of phase shifts (θm)_m∈N (the pseudo-chips).

0^◦

90^◦

180^◦

270^◦

0^◦

90^◦

180^◦

270^◦ +90^◦

−90^◦

Figure 2.7: The trellis for moving between the OQPSK and MSK interpretations of the signal.

(18)

In [29] and [30], an OQPSK receiver that uses this MSK-OQPSK equivalence is discussed in more detail. The core concept of the architecture is to extract the phase difference between subsequent samples instead of tracking the I- and Q-parts individually. The design is further simplified in [30], which eliminates the need for a trellis decoder by simply identifying which pseudo-chip sequences correspond to which symbol. Since there are only sixteen symbols this is not an infeasible task. Table 2.2 presents the mapping between the symbols and pseudo- chip sequences. Note that the first pseudo-chip in each sequence is always unknown since there is no way of knowing what the previous shift was; either the shift came from a noise sample or from the last sample of another sequence, which could be either 0 + 1j or 0 − 1j. This is no major problem since the pseudo-chip sequences are still easily distinguishable from each other even when the first pseudo-chip is ignored and a simple table look-up can be used to decode the pseudo-chip sequences directly into bit chunks.

Table 2.2: Mapping table between the binary data and the pseudo-chip sequences.

Chunk (b0b1b2b3) Symbol Pseudo-Chip Sequence (c⁰₀c⁰₁...c⁰₃₁) 0000 0 x1100000011101111010111001101100 1000 1 x1001110000001110111101011100110 0100 2 x1101100111000000111011110101110 1100 3 x1100110110011100000011101111010 0010 4 x0101110011011001110000001110111 1010 5 x1111010111001101100111000000111 0110 6 x1110111101011100110110011100000 1110 7 x0000111011110101110011011001110 0001 8 x0011111100010000101000110010011 1001 9 x0110001111110001000010100011001 0101 10 x0010011000111111000100001010001 1101 11 x0011001001100011111100010000101 0011 12 x1010001100100110001111110001000 1011 13 x0000101000110010011000111111000 0111 14 x0001000010100011001001100011111 1111 15 x1111000100001010001100100110001

2.6.1 Common Impairments

Two important aspects of successfully receiving a transmission is being able to recover the signal from three common impairments:

• Carrier phase offset (CPO) - Caused by mismatches in phase of the respective oscillators that the sender and receiver use to move the signal between the IF and the passband, or vice versa.

• Carrier frequency offset (CFO) - Caused by mismatches in frequency of the respective oscillators that the sender and receiver use to move the signal between the IF and the passband, or vice versa.

• Sample timing offset (STO) - Caused by mismatches in phase of the

(19)

respective clocks that the sender and receiver use to move the signal from the IF to the baseband (BB), or vice versa.

The following sections present the mathematical models for these impairments.

Carrier Phase Offset

Consider the complex sample at the IF in the transmitter, Spass. It gets upconverted to the passband by an oscillator with frequency ωT X and the sent over the air to the receiver. The passband signal can be written as follows:

S_n,pass= e^jω^{T X}^tS_n, nTpass≤ t < (n + 1)Tpass,

where Tpassis the duration of a sample at the IF. When the signal arrives at the receiver antenna it is downconverted to the IF by an oscillator with frequency ω_RX = ω_{T X}. However, the receiver oscillator also has a phase offset φ with respect to the transmitter oscillator. The sample, sn, at the IF in the receiver therefore becomes:

s_n=LPFn

e^j(ω^RX^t+φ)e^jω^{T X}^tS_no

= e^jφS_n,

nTpass≤ t < (n + 1)Tpass. (2.1) The resulting impairment on the signal is that the received sample at the IF, sn

is a rotated version of the transmitted sample at the IF, Sn. Note that any time delays between the transmitted and the received signal, such as propagation delay, can be incorporated to φ as well, which is why this model works without assuming any time delay between the two devices.

Carrier Frequency Offset

The CFO case is very similar to the CPO impairment. Consider a case where ωRX 6= ωT X. For simplicity it can be assumed that φ = 0. Equation 2.1 then becomes:

sn=LPF e^jω^RX^te^jω^{T X}^tSn = e^j(ω^RX^−ω^{T X}^)tSn = e^jω^∆^tSn, nT_pass≤ t < (n + 1)T_pass.

The resulting impairment on the signal is that the received sample at the IF, sn, is a rotated version of Sn that is changing over time. Again, any time delays can be accounted for by adjusting the value of φ.

Sample Timing Offset

Consider the pulse-shaped chip in Figure 2.8a, no assumption needs to be made about whether it is a chip on the I- or Q-channel. It consists of six samples at the IF, sn, s_n+1..., s_n+5. However, only one of these samples is needed in order to obtain the BB version of the chip, rm. Throughout this thesis, the problem of choosing one BB-sample, rm, out of K IF-samples, s_n, s_n+1..., s_n+(K−1), so as to maximize the SNR will be referred to as the STO problem. It should not be confused with the problem of choosing an optimal sampling instant for the IF-samples. While that is also an important problem in some applications, it is not dealt with in this thesis since the USRP platform

(20)

provides no easy means of controlling the sampling clock that is driving the ADC. Furthermore, if K is large, there is little need for adjusting the ADC sampling clock. Figure 2.8b illustrates how the IF-signal may look different in the digital domain depending on the phase difference between the transmitter’s and receiver’s respective sampling clocks. Note that high-amplitude IF-samples that make a good choice for rm are still available in both cases, even with a relatively small number, K = 6.

When downconverting from the IF to the BB, an offset p has to be found when selecting every K:th sample so that rm= sKm+p has an amplitude that is as high as possible, thus maximizing the SNR. In the case of Figure 2.8a, p = 3 would make a good choice that extracts a high-amplitude sample from every chip to make up a noise-resistant BB-signal while p = 0 would yield a BB- signal of low-amplitude samples that can easily be misinterpreted under noisy conditions.

s

n

s

n+1

s

n+2

s

n+3

s

n+4

s

n+5

t

(a) A chip consisting of six samples at the IF.

s

_n

s

_n+1

s

_n+2

s

_n+3

s

_n+4

s

_n+5

t

(b) The same chip, but a different sampling clock phase.

Figure 2.8: Two discrete-time representations of the same chip, but with different sampling clock phases.

(21)

2.7 Related Work

This section brings up some notable research projects that the work behind this thesis relates to in some way.

2.7.1 RFNoC

In the context of this thesis, RFNoC is one of the main development tools.

However, it is also a research project in its own right. Largely based on the VITA-49 standard [31] for packetizing and handling RF samples and the AXI bus architecture [32], RFNoC has been an integral part of the work done at Ettus Research since their third generation of devices. In [33], Malsbury and Ettus introduce RFNoC as a core architectural component in the then-new USRP systems. The technology is then extended to a more general development framework and described in [34] by Pendlum and Braun.

2.7.2 Wime

Wime is a research project that has made a number of SDR-based implementations using GNU Radio [35]. Perhaps most relevant in the context of SDR for IoT is their implementation of and testbed for IEEE 802.15.4 [36] [37].

It uses a mix of existing GNU Radio libraries and custom libraries to build a full network stack with IEEE 802.15.4 as the bottom layer, a networking layer called Rime [38] in the middle and UDP sockets for the top layer, serving as an interface to some hypothetical application. Additionally, the Wime project provides a number of utilities for testing and debugging IEEE 802.15.4-based systems. Examples include randomized PDU sources for generating dummy traffic and integration with tools for packet dissection.

2.7.3 Other SDR setups

While the host-PRD setup offers great flexibility and ease-of-use at a relatively low cost, there are SDR setups available for users who are looking willing to sacrifice these perks to get a higher performance. In [39][40], the architecture for a system-on-chip (SoC) implementation of the SDR concept is presented. The system, called RFSoC, mainly targets 5G base station applications, which requires a lot of flexibility, and is said to be able to provide the required performance.

Another more tightly integrated implementation is a system called KUAR [41].

While not as SoC-based as RFSoC, it is still FPGA-based and achieves a tight CPU-DSP integration by embedding processors on the FPGA (called soft-processors). Signal processing tasks can be shared by pure DSP logic and FPGA-instantiated CPUs. Higher-lever tasks, such as networking stacks, can be delegated to a processor external to the FPGA.

(22)

Chapter 3

Method

To test the performance of the RFNoC-based IEEE 802.15.4 implementation, the system will be studied in a quantitative sense. Two tests will be carried out in order to measure the most important performance indicators of the system.

• Rate of successful reception, how many PDUs that can be successfully passed to the MAC layer.

• Round-trip time (RTT), the time it takes for the system to reply to a specific transmission.

Two tests were designed to measure the two indicators. The Frame Error Rate test is described in Section 3.1 and used to test the rate of successful reception and the Loopback Test for measuring the RTT is described in Section 3.2. A description of the software and hardware used during development and testing can be found in Section 3.3 and Section 3.4, respectively. Lastly, some results from the HDL simulation that is part of the HDL development workflow are presented in order to shed some light on the internal workings of the FPGA implementation. Section 3.5 briefly explains how those results were obtained.

3.1 Frame Error Rate Test

Perhaps the most fundamental property of a wireless receiver is the rate of successfully receiving a transmitted frame. In this test a frame will be considered as successfully received if the error detecting code in the MAC footer (see Figure 2.2) detects no error. For completeness, a number of different cases will be investigated. The parameters to be varied over the cases are the inter-frame spacing (IFS), the delay between two frames, and the PSDU size in bytes. Each case consists of 1000 transmitted frames and all combinations of the following parameter values will be investigated:

• PSDU sizes 16 B, 60 B and 127 B.

• IFS delays ranging from 100 ms down to 10 ms in decrements of 10 ms.

3.2 Loopback Test

In order to successfully implement the IEEE 802.15.4 standard, the system must fulfill certain timing requirements. That is, it must be able to respond to certain transmissions within a specific time limit. Figure 3.1 depicts the scenario used in the test. All devices in the test, the device under test DUT, the request generator and the monitor consists of a host-PRD pair. The testbed presented in [36] is used in the generator to generate requests and in the monitor to timestamp detected frames and store them in .pcap-format [42]. By comparing the timestamps in the .pcap-file in the host part of the monitor, the time delay

(23)

between a request frame and the corresponding response frame gives an estimate of the RTT of the DUT. In the DUT, frames are received and the type is checked.

If the type matches a predetermined type, a 34-byte response is transmitted.

The same cases of IFS and PSDU sizes as in Section 3.1 will be tested and a total of 1000 requests will be generated and the first 500 request-response pairs detected by the monitor will be used for further analysis.

Request Response

DUT

Monitor

Generator

Figure 3.1: The Loopback Test consists of a request generator device, a DUT that responds to the requests and a monitor device that sniffs the traffic and timestamps all detected frames.

3.3 Software Setup

The software that was used development and testing is listed in Table 3.1.

The GNU Radio block diagrams that implement the tests are found as examples in the repo for this thesis called rfnoc-zluudgbee [43].

3.4 Hardware Setup

Table 3.2 lists the hardware that was used for the tests. Note that the Monitor device in the Loopback Test was implemented on a different host PC than in all the other cases.

3.5 HDL Simulation

In the typical HDL/FPGA development workflow, simulations play an important part. It is often very hard to debug an FPGA design since most signals are internal to the IC package and thus cannot be studied with external measurement tools. Therefore, many HDL development tools include some

(24)

Table 3.1: A listing of the software that was used during development and testing.

Name Branch Commit Link

Ubuntu 18.04 - - releases.ubuntu.com/

Vivado 2017.4 - - xilinx.com/support/download.html Wireshark master-2.4 cf801a2 github.com/wireshark/wireshark uhd rfnoc-devel eec24d7 github.com/ettusresearch/uhd gnuradio maint-3.7 9e04b27 github.com/gnuradio/gnuradio gr-ettus master e0d2b91 github.com/ettusresearch/uhd gr-foo maint-3.7 a2d8670 github.com/bastibl/gr-foo gr-ieee802-15.4 maint-3.7 d3d9402 github.com/bastibl/ieee02-15-4 rfnoc-zluudgbee master 2afa110 github.com/zluudg/rfnoc-zluudgbee Table 3.2: A listing of the hardware that was used during development and testing.

Component Name Note

CPU Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz -

NIC Intel(R) 82599ES 1GE mode

CPU (Monitor) Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz - NIC (Monitor) RTL8111/8168/8411 PCIe GE Controller

SDR USRP X310 -

FPGA xc7k410tffg900-2 Part of X310

means of simulating the code so that hard-to-reach internal signals can be easily studied in, for example a GUI. Vivado has an integrated simulator that can be invoked either via the build system that RFNoC provides or by creating a standalone Vivado project. The RFNoC-based simulation flow uses System Verilog and provides a means to co-simulate multiple RFNoC components.

However, for this project, the ”standalone” approach was chosen because of the author’s familiarity with the ”pure” Vivado design flow and with writing VHDL testbenches.

To obtain simulation results for this project, input stimulus for the testbenches are generated using MATLAB and written to file. The file is then read by the VHDL testbench before starting the simulation. The output signals that result from the stimulus are then analyzed by the testbench after the simulation is concluded and any internal signals that the user chose to record can be studied in the Vivado GUI.

In Section 5.0.3, a handful of important internal signals are discussed and the note made in Section 5.0.4 is based on observations made during simulations.

(25)

Chapter 4

Implementation

This chapter studies the two main parts of IEEE 802.15.4 that were implemented directly on the FPGA inside the PRD in order to offload the PRD-host link. The first and most complex part is a PHY receiver block, which was based on the architecture described in [29] and [30]. The second part is a CRC-CCITT [12]

block for error checking.

RFNoC allows input and output word widths of 32 bits and they are read from or written to the RFNoC crossbar using AXI4-Stream (AXI4-S) [32]

interfaces. An AXI4-S interface is either a slave (input interface) or a master (output interface) and consists of the following four signals:

• valid- A 1-bit wide signal driven by a master and read by a slave. If this signal is asserted on a positive clock flank of the slave’s clock, the slave should sample the data signal.

• data- A 32-bit wide signal that is driven by a master and read by a slave.

This signal represents the actual data.

• ready- A 1-bit wide signal driven by a slave and read by a master. When the master sees this signal and valid asserted during a positive flank of its clock, a successful transaction has been carried out.

• last - A 1-bit wide signal driven by a master and read by a slave. If a successful transaction has occurred and this signal is asserted, it marks the end of a AXI burst.

The concept of an AXI burst is important since RFNoC packetizes each AXI burst in a VITA-49 [31] packet, also called a CHDR packet, which is what the RFNoC crossbar handles in order to pass data between different blocks on the crossbar.

The AXI4-S is also the main means of communication between the internal blocks of the receiver.

4.1 IEEE 802.15.4 Receiver

Figure 4.1 shows an overview of the entire receiver. The receiver takes an endless stream of complex samples, sn, as its input and outputs a PHY-level PSDU, as shown in Figure 2.2, upon detection of a frame. The input complex samples are 32-bit words with the upper and lower half being Q0.15 fixed- point representations of the I- and Q parts respectively. The output PSDU is implemented as an AXI burst of 32-bit words where the lower eighth bits are used to carry information. The first word in the burst correspond to the first byte in the PSDU and so on. The following subsections individually describe the parts of the receiver as they appear in Figure 4.1.

(26)

Chip Synchro-

nizer

Preamble Detector

SequenceChip Demap-

per

Frame Packager Clear Detection

sn PSDU

Figure 4.1: An overview of the entire receiver.

4.1.1 Chip Synchronizer

The chip synchronizer’s main purpose is to deal with the impairments discussed in Section 2.6.1. It is the most complex part of the receiver and the performance of the receiver is largely dependent on the performance of the synchronizer. In the targeted system, the USRP X310, the sample rate at the IF is 200 MHz. The IEEE 802.15.4 standard specifies a bit rate of 250 kbit/s and 32 chips per 4 bits. This translates into a chip rate of

32

4 ∗ 250 kbit/s = 2 MHz.

Given that two chips are transmitted in parallel on the I- and Q-channels (albeit with an offset) the pulse-shaped chips have a duration

Ts= 2

2 MHz = 1µs.

In terms of IF-samples in the X310 system, the chips have a duration of 200 samples. In order to downconvert the signal to the BB, every 200:th sample on each of the I- and Q-channels should be selected. But in the case of OQPSK, this is NOT the case. Since the I- and Q-channels are offset by ^T₂^s, or 100 IF-samples, this means that every 100th sample must be selected in order to reconstruct the BB signal. From Figure 2.6 it should be clear why; every 200th IF-sample corresponds to an even-indexed chip (the I-channel) and 200th IF- sample corresponds to an odd-indexed chip (the Q-channel), with an offset of 100 IF-samples between the two.

Figure 4.2 illustrates the FPGA implementation of the chip synchronizer.

IF-samples, sn, enter the decimator and a mod 100-counter inside selects every 100:th sample for output, rm = s100m+p. Additionally, the decimator always outputs the complex conjugate of the IF-sample that is exactly 100 samples behind. That is, r⁰m= s_100(m−1)+p. Lastly, the decimator has a 4-level input u. Depending on the state of u, when the mod 100-counter wraps around, the offset p may be incremented by 1, 2 or 3 steps, allowing the decimator to gradually adjust its offset until the optimal value for p is found. Due to this gradual adjusting of the offset, the samples used as BB-samples may not always be equally spaced, which is why the notation rm⁰ is used to refer to the IF-sample that is exactly 100 IF-samples behind rm, while rm−1may in fact be 99,98 or 97 IF-samples behind rm. When the synchronizer is tracking the phase difference θ_mbetween BB-samples, it is actually tracking the phase difference between rm

and rm⁰ . However, when it has locked on to the optimal value for p, rm−1= r⁰_m will be true. The gradual adjustment of p is made possible by the SHR field in Figure 2.2, which contains no data and therefore it does not matter if chips get misinterpreted in the SHR while p is being adjusted.

(27)

As mentioned in Section 2.6, the received signal tracks phase differences between BB-samples. The product of rm and (rm⁰ )^∗ is calculated and its phase extracted, thereby obtaining the phase difference, θm, between rmand r⁰m, which then gets sent to the subsequent block. Under CPO and CFO, the output from the phase extracting block becomes:

arg {Zrm(W r⁰_m)^∗} = argZs100m+p(W s_100(m−1)+p)^∗ =

arg {Z} + arg {s100m+p} − arg {W } − args100(m−1)+p

(4.1)

where the impairments due to CPO and CFO, Z and W, are:

Z = e^j(ω^∆^(100m+p)T^pass^+φ) (4.2) and

W = e^j(ω^∆(100(m−1)+p)Tpass+φ) (4.3) Substituting 4.2 and 4.3 into 4.1 yields:

(ω∆(100m + p)Tpass+ φ) + arg {s100m+p} − args_100(m−1)+p

− (ω_∆(100(m − 1) + p)T_pass+ φ). (4.4) As mentioned θm is defined as rm− r⁰_m in the synchronizer and for the non- optimal choice of p, θm will be accompanied by an error that depends on p, δm(p). With this in mind, 4.4 can be written as:

θm+ δm(p) + ω∆(100m + p)Tpass+ φ − ω∆(100(m − 1) + p)Tpass− φ

= θm+ δm(p) + 100ω∆Tpass= θm+ δm(p) + ωO

(4.5) Thus, the output of the phase extractor is the pseudo-chip θm, an error δm(p) which is zero for an optimal choice of p and a constant error ωO due to CFO.

This all gets fed into an infinite impulse response (IIR) filter which attempts to filter out the DC component ωO and outputs estimates of θm, and δm(p) which then gets fed to the preamble detector. The IIR can be toggled off if CFO is not too much of a problem. Additionally, the chip synchronizer also contains a feedback loop, the dotted line in Figure 4.2. For an optimal choice of p, δm(p) = 0. The feedback loop calculates the magnitude of this error:

|δm(p)| = |90^◦− |θm||

and then feeds the error into a moving average (MA) filter to smooth out the error signal, that is:

em(p) = 1 L

L−1

X

i=0

|δm−i(p)|

where L can be set by the user to 4, 8 or 16. em(p)is then compared to a user- settable threshold. Depending on how much the threshold is exceeded, um(p)is raised to signal that an increment of 1, 2 or 3 in p will be performed. As long as em(p)does not exceed the threshold, no increment is performed. Whenever an increment happens, the feedback loop is reset by clearing out the history in the MA filter.

In Section 2.6, it was discussed how the OQPSK signal can be viewed as a series of ±90^◦ phase shifts when IF-to-BB downsampling is done with optimal offset p. By utilizing this fact in a feedback loop, the chip synchronizer gradually adjusts its offset until what it outputs is a series of ±90^◦ phase shifts.

(28)

Decimator Multiplier Phase Single- Pole

IIR To Preamble Detector...

Moving Average Threshold

sn

rm

(r⁰_m)^∗

rm(r⁰_m)^∗

θm+ δm(p) + ωO

θm+ δm(p)

em(p) um(p)

Figure 4.2: The general structure of the Chip Synchronizer.

4.1.2 Preamble Detector

The preamble detector receives a sequence of phase angles as input from the synchronizer. The sign bit of every incoming phase angle is stored in a shift register. Only the sign bit is needed because when the synchronizer has locked on, the incoming phase angles should always be close to ±90^◦. The hamming distance (HD) [44] between the contents of the shift register and the contents of a read-only memory (ROM) is calculated. The ROM contains the last two bytes of the Preamble field and the SFD field seen in Figure 2.2 encoded as pseudo- chips. When the sign-bits in the shift register matches the contents of the ROM (the HD between the two is below a certain threshold) it means a frame has been detected and the detector changes its internal state from SCANNING to FRAME FOUND. In the FRAME FOUND state, the detector buffers 32 incoming sign bits and outputs them as a single word to the demapper block. The detector state can be reset to SCANNING by a signal that is being driven by a downstream block, the packager (shown as a dotted line in Figure 4.1. In the SCANNING state, no output is generated. As mentioned in Section 2.6, the pseudo-chips marked with x:es in Table 2.2 cannot be known and are therefore ignored when calculating the HD.

4.1.3 Chip Sequence Demapper

The demapper takes as its input 32-bit words that represent the pseudo-chip sequences seen in Table 2.2 and outputs the corresponding bit chunks. This is not as simple as a table lookup, however. Even though a suitable value for p has been found, δm(p)may still be nonzero in practical scenarios and the constant error due to CFO, ωO, may not be completely removed by the IIR. This, along with the presence of noise, results in occasional misinterpreted pseudo-chips in the 32-bit input word. While the pseudo-chip sequences are still discernible from one another, the problem cannot be solved as easily as by a table look- up. Therefore, the demapper calculates the HD between the input word and every pseudo-chip sequence (x:s are ignored) and outputs the bit chunk whose pseudo-chip sequence yields the shortest HD. Lastly, if the lowest HD is above a certain threshold, a flag will be placed on the output chunk to indicate that the demapping was a poor one.

4.1.4 Frame Packager

Up until this point, no real effort has been made to neatly package the data stream. The frame packager’s job is to take the input bit chunks, which arrive at very low intervals compared to the FPGA clock rate, and assemble them into bytes that are buffered and the quickly bursted out to create a CHDR

(29)

packet containing the PSDU, as seen in Figure 2.2. The packager is essentially the finite state machine (FSM) seen in Figure 4.3. Starting in the IDLE state, transitions in the FSM are triggered by the arrival of a 4-bit chunk. First, the lower chunk of the PHR arrives, followed by the upper chunk. The value in the PHR represents the number of bytes following and gets stored as an internal variable. Then every payload byte is processed, lower chunk first, and the internal variable is decremented to keep track of how many bytes remain of the payload. The assembled payload bytes ready for output gets stored in an output buffer When no bytes remain to be processed, the FSM signals to reset the state of the preamble detector, flushes the output buffer to produce an AXI-burst that RFNoC receives and then returns to its IDLE state.

IDLE

PHR LOWER

PHR UPPER

PAYLOAD LOWER

PAYLOAD UPPER

X ← PHR; X > 0/

X − −;

X = 0/

Figure 4.3: The FSM used to restructure incoming bit chunks. Conditionals for transitions written in blue where applicable. Actions taken upon transitions are written in red.

4.2 CRC-16

As mentioned in Section 2.2.1, the IEEE 802.15.4 standard specifies the usage of a CRC in order to detect misinterpreted frames. The CRC block implemented with RFNoC expects AXI bursts as input, with each word in the burst being a payload byte from the PSDU field in Figure 2.2. If the CRC is valid, the output is a similar burst but with the MHR field removed. If the CRC is not valid, the block generates no output.

Many CRCs and related algorithms can often be efficiently implemented in digital hardware using simple bitwise operations [45]. In order to be able to handle the quick AXI-bursts, however, this implementation uses the technique presented in [46] to implement the CRC with byte-wise operations and table look-ups instead.

(30)

Chapter 5

Results

This chapter presents the results from the tests described in Sections 3.1 and 3.2. The tests were done with two setups: firstly, a pure software implementation that was taken from the Wime project [35] to serve as a reference and secondly the FPGA-software hybrid system that was developed for this thesis. Throughout this section these two implementations are referred to as the ”software implementation” and the ”hybrid implementation” respectively.

This chapter is concluded by some results from HDL simulations during the development phase. The simulation results were included mainly to help the reader gain some insights into what is going on (roughly) aboard the FPGA.

5.0.1 Frame Error Rate Test

Figure 5.1, 5.2 and 5.3 shows the test results for the Frame Error Rate Test for PSDU sizes of 16, 60 and 127 bytes, respectively. In terms of frame success rate, the pure software implementation provided by Wime [35] and the hybrid version developed for this thesis perform similarly. While the Wime implementation does seem to be a little bit better in some cases, it is hard to say for sure since the measurements were made in a regular office space and no effort was made to shield the tests from interference such as Wi-Fi or Bluetooth.

The tests where made over a period of roughly 6-8 hours and the presence of interfering sources likely changed during that time. Therefore, some cases may have been subjected to more interference that others. Nonetheless, the measured frame success rate performance is still acceptable and shows no surprises.

5.0.2 Loopback Test

Loopback testing showed some unexpected behavior. Figures 5.4, 5.5 and 5.6 show, rather counter-intuitively, that as the packet rate increases, the RTT of the hybrid system decreases. One possible explanation is that the delays are caused by oversized buffers somewhere in the system. In the ”pure” software implementation, there is a steady flow of traffic (4 MS/s across the Ethernet link, which keeps the buffers at optimal fill rates). On the other hand, the hybrid implementation consists of bursty, low-rate transfers and it is likely that somewhere in the system, data is buffered until a certain fill rate has been achieved before passing it on, thereby causing latency. The fact that the latency decreaseswhen the wireless traffic increases speaks for such a scenario. It could be the GNU Radio runtime environment, the UHD drivers, the USRP firmware or the Linux OS itself, it is hard to say what part of the system is causing this behavior. However, it is most likely not within the FPGA itself. The reason for this is discussed in Section 5.0.4. Increasing packet size was noted to have a similar, but less dramatic, effect and the figures suggest this as well.

(31)

20 40 60 80 100 950

960 970 980 990 1,000

Inter-frame spacing [ms]

#ofsuccessfullyreceivedframes

SoftwareFPGA

Figure 5.1: The number of successfully received frames out of 1000 for different IFS times and a PSDU size of 16.

20 40 60 80 100

950 960 970 980 990 1,000

Inter-frame spacing [ms]

#ofsuccessfullyreceivedframes

SoftwareFPGA

Figure 5.2: The number of successfully received frames out of 1000 for different IFS times and a PSDU size of 60.

5.0.3 HDL Simulation Insights

The Vivado simulations provides some insights into the natures of some of the internal signals depicted in Figure 4.2 as well as in other parts of the system.

In Figure 5.7, the FPGA implementations of the error signal em(p)and the shift signal um(p)can be seen. Note how the shift signal causes large offset steps in the beginning and then lower steps that occur more scarcely until it settles on an offset that yields an error within the threshold limit. The error exhibits the

FPGA Co-Processing in Software-Defined Radios

FPGA Co-Processing in Software-Defined Radios

LEON FERNANDEZ

Abstract

Keywords

Sammanfattning

Nyckelord

Contents

Acknowledgements

Chapter 1

Introduction

1.1 Motivation

1.2 Problem

1.3 Purpose

1.4 Goal

1.5 Ethical Considerations

1.6 Delimitations

1.7 Outline

Chapter 2

Background

2.1 Wireless Communications in the ISM-bands

2.2 The IEEE 802.15.4 Standard

I t

Q t

2.2.1 Cyclic Redundancy Check

2.3 USRP and UHD

2.4 GNU Radio

2.5 The Field-Programmable Gate Array

UHD GNU Radio

GRC

Firmware USRP HDL

Host PC

USRP

PCI or E the rne t

2.5.1 Xilinx Vivado

2.5.2 RF Network-on-Chip

UHD GNU Radio

GRC

Firmware

RFNoC

Vivado

2.6 OQPSK viewed as Minimum Shift Keying

2.6.1 Common Impairments

s

s

s

s

s

s

t

s

s

s

s

s

s

t

2.7 Related Work

2.7.1 RFNoC

2.7.2 Wime

2.7.3 Other SDR setups

Chapter 3

Method

3.1 Frame Error Rate Test

3.2 Loopback Test

DUT

3.3 Software Setup

3.4 Hardware Setup

3.5 HDL Simulation

Chapter 4

Implementation

4.1 IEEE 802.15.4 Receiver

4.1.1 Chip Synchronizer

4.1.2 Preamble Detector

4.1.3 Chip Sequence Demapper

4.1.4 Frame Packager

4.2 CRC-16

Chapter 5

Results

5.0.1 Frame Error Rate Test