Software based memory correction for a miniature satellite in low-Earth orbit

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

,

STOCKHOLM SWEDEN 2017

Software based memory correction

for a miniature satellite in

low-Earth orbit

JOHAN SJÖBLOM

JOHN WIKMAN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Software based memory

correction for a miniature

satellite in low-Earth orbit

JOHAN SJÖBLOM AND JOHN WIKMAN

Bachelor in Computer Science Date: June 4, 2017

Supervisor: Roberto Guanciale Examiner: Örjan Ekeberg

Swedish title: Mjukvarustyrd rättning av minnesfel för en miniatyrsatellit i låg omloppsbana

(3)

(4)

iii

Abstract

The harsh radiation environment of space is known to cause bit flips in computer memory. The conventional way to combat this is through error detection and correction (EDAC) circuitry, but for low-budget space missions software EDAC can be used. One such mission is the KTH project Miniature Student Satellite (MIST), which aims to send a 3U CubeSat into low-Earth orbit.

To ensure a high level of data reliability on board MIST, this thesis investigates the performance of different types of EDAC algorithms. First, a prediction of the bit flip susceptibility of DRAM memory in the planned trajectory is made. After that, data reliability models of Ham-ming and Reed-Solomon (RS) codes are proposed, and their respective running times on the MIST onboard computer are approximated. Fi-nally, the performance of the different codes is discussed with regards to data reliability, memory overhead, and CPU usage.

(5)

iv

Sammanfattning

Rymdens strålningsmiljö är känd för att orsaka bitflippar i datormin-nen. Vanligtvis motverkas detta genom att felrättande hårdvara instal-leras på satelliten, men för lågkostnadssatelliter kan rättningen istället skötas i mjukvaran. Ett exempel på en sådan satellit är KTH-projektet Miniature Student Satellite (MIST), vars mål är att skicka upp en 3U CubeSat i låg omloppsbana.

Den här uppsatsen undersöker hur olika felrättningsalgoritmer kan användas för att skydda data ombord på satelliten från att bli korrupt. Först görs en uppskattning av hur strålningskänsliga DRAM minnen är i den planerade omloppsbanan. Därefter föreslås datakorruptions-modeller för Hamming- och Reed-Solomonkoder (RS) tillsammans med en uppskattning av deras respektive körtider på satellitens ombord-dator. Slutligen diskuteras de föreslagna koderna med hänsyn till da-takorruptionsskydd, minnesanvändning och processoranvändning.

(6)

List of Abbreviations

COTS . . . Commercial off-the-shelf ECC . . . Error-correcting code

EDAC . . . Error detection and correction ESA . . . European space agency

GCR . . . Galactic cosmic ray

ISIS . . . Innovative solutions in space LEO . . . Low-Earth orbit

MBU . . . Multiple-bit upset

MIPS . . . Million instructions per second MIST . . . Miniature student satellite MTTDL . . . Mean time to data loss OBC . . . On-board computer RS . . . Reed-Solomon

SAA . . . South Atlantic Anomaly

SECDED . . . Single-error correcting and double-error detecting SEE . . . Single-event effect

SEU . . . Single-event upset

SPENVIS . . . Space environment information system

(9)

viii CONTENTS

(10)

List of Figures

2.1 Geographic map of all upsets registered on Alsat-1 dur-ing a 9 year period. [2] . . . 4 2.2 Geographic map of all upsets registered on ADEOS-II.

[10] . . . 5 2.3 World map of the AP-8 MAX integral proton flux >10

MeV at 500 km altitude [18] . . . 5 2.4 Quality of predictions for proton induced upsets for low

Earth orbits. These upsets are produced in the South Atlantic anomaly. [14] . . . 6 2.5 Quality of space prediction for heavy ion induced

up-sets for low Earth orbit satellites. [14] . . . 7 2.6 Daily SEU rates on Alsat-1. [2] . . . 7 2.7 Layout and size of the data and the parity in a

Reed-Solomon code. . . 9 4.1 Expected number of blocks lost for each error correcting

code in respect to scrubbing time. (block size = 1 byte) . . 25 4.2 Expected number of blocks lost for each error correcting

code in respect to scrubbing time. (block size = 64 bytes) 25 4.3 Expected number of blocks lost for each error correcting

code in respect to scrubbing time. (block size = 256 byte) 26 4.4 Expected number of blocks lost for each error correcting

code in respect to CPU usage. (block size = 1 byte) . . . . 26 4.5 Expected number of blocks lost for each error correcting

code in respect to CPU usage. (block size = 64 bytes) The curves for RS(66,64) and Hamming(72,64) overlap. . . 27 4.6 Expected number of blocks lost for each error correcting

code in respect to CPU usage. (block size = 256 byte) The curves for RS(66,64) and Hamming(72,64) overlap. . 28

(11)

List of Tables

2.1 OBC hardware specification. . . 10 4.1 SEU/bit-sec . . . 22 4.2 Collected and adjusted running times for error

correc-tion codes over 10 MB of data. . . 23 4.3 Adjusted running times for error correction codes over

30 MB of data. . . 23 4.4 20% CPU usage. Block size 64 bytes. One orbit of

oper-ation. . . 29

(12)

Chapter 1 Introduction

This thesis is a part of the KTH project Miniature student satellite (MIST). MIST is a student project carried out at KTH Space Center with the goal to build a real satellite and launch it into space. It is su-pervised by Christer Fuglesang and managed by Sven Grahn, but all engineering work is carried out by several interdisciplinary student teams. The MIST project started in the spring of 2015 and since a new set of students are recruited each semester, it is at the time of writing the fifth student team at work. 1

The main purpose of the satellite is to conduct eight different re-search experiments. The data generated by the experiments will be sent to a ground station on Earth. To facilitate this communication, as well as overall control, the satellite has been equipped with an On-Board Computer (OBC). The OBC also has to reliably store the gener-ated data while the satellite is not in range of the ground station and that, reliable storage of data, is the subject of this thesis.

The harsh radiation environment of space can cause single-event upsets (SEU), i.e. bit flips, in the memory. A statistical analysis of SEU-data from the satellites Alsat-1, FASat-Bravo, ThaiPutt and UoSAT-12 suggests an SEU rate of somewhere in between 0.2×10−6and 1.3×10−6 SEU per bit and day is to be expected in the (commonly used) mem-ory architecture SRAM [20][2]. Moreover, the main memmem-ory of MIST consists of DRAM which could be even more susceptible to SEU, why the phenomenon certainly cannot be ignored if high data reliability is required [13].

Many spacecraft are equipped with hardware protection against

1_{Source: MIST internal document M510}

(13)

2 CHAPTER 1. INTRODUCTION

SEU in the form of triple redundancy or error correcting codes, such as Hamming(12,8), but since MIST is a low-budget project only a small part of the memory is resistant to SEU [6]. The actual program code, as well as flight parameters and other mission-vital data, will have hardware protection, but the larger DRAM storage is a commercial off-the-shelf (COTS) product and thus highly susceptible to SEU.

In order to protect the data from corruption, a software imple-mented Error Detection and Correction (EDAC) algorithm can be used. Algorithms of interest to this thesis are Reed-Solomon and Hamming codes. Both have previously been investigated as candidates for soft-ware based EDAC [17].

In a broader perspective, enabling the usage of COTS components as cost-effective alternatives in spacecraft could also play an important role in making space available for a larger audience.

1.1 Problem Statement

(14)

Chapter 2 Technical Background

The aim of this technical background is threefold. Section 2.1 provides a scientific background for the SEU rate models proposed in chapter 3. Section 2.2 introduces the technical basis of the EDAC algorithms which will be used. Finally, section 2.3 presents important information specific to the MIST mission.

2.1 Environmental Effects on Memory

Relia-bility

Single-event effects (SEE) are a collection of radiation effects in elec-tronic devices. They are caused when highly energetic particles strike sensitive nodes in microelectronic circuits. Depending on different fac-tors, the result can either be permanent damage to the circuit’s func-tionality or a change of its logic state. The focus of this paper is on non-destructive SEE which cause bit flips in memory circuitry, com-monly referred to as SEU, soft errors or simply upsets. [6]

The first confirmed occurrence of SEU was published in the sev-enties when a satellite, during 17 years of operation, could observe a total of four memory errors [6][4]. That might sound like an insignifi-cant number, but the large increases in memory size, clock speed, and circuitry complexity as well as greatly decreased architecture sizes has resulted in a much larger SEU frequency in today’s hardware [6]. An-other consequence of smaller and more complex circuitry is suscepti-bility to single-event multiple-bit upset (MBU). A single particle may diffuse its charge in a closely spaced junction, or hit multiple sensitive

(15)

4 CHAPTER 2. TECHNICAL BACKGROUND

regions due to its angle, and thus affect multiple bits [6].

2.1.1 Radiation Environment

The MIST reference orbit is a low-Earth orbit (LEO) with an approxi-mative altitude of 640 km (see section 2.3.1 for details). In LEO, pre-dominately two types of radiation induce SEU: galactic cosmic rays (GCRs) and protons trapped in Earth’s radiation belt [6]. Empirical data shows that protons trapped in the South Atlantic Anomaly (SAA) region are the dominating source of interference [6][5][21][2]. Figure 2.1 and 2.2 visualize the geographical distribution of upsets recorded on the LEO satellites Alsat-1 and ADEOS-II.

(16)

CHAPTER 2. TECHNICAL BACKGROUND 5

Figure 2.2: Geographic map of all upsets registered on ADEOS-II. [10] The de-facto standard for modeling the trapped proton population is NASA 8 (which builds on the earlier models 1, 5, AP-6, and AP-7). It consists of maps that contain omnidirectional pro-ton fluxes, both for solar maximum (AP8MAX) and solar minimum (AP8MIN). Figure 2.3 shows a world map of the AP8MAX at 500 km altitude.

For GCR a commonly used model is CREME96 [14][19].

(17)

Predicting the SEU rate for space missions has historically proven to be hard. Petersen’s summary [14] of 126 comparisons of predicted and observed SEU rates shows that, while it is possible to make good predictions, there are a lot of sources of error. The most prominent source of differences between predictions and results seems to be the usage of generic parts (especially COTS parts), and Petersen suggests that calculations, when data for specific parts are not available, should allow for a factor ten underestimate. Another important factor which can cause deviation is incorrect shielding values.

Petersen divides the result of his study into two different popu-lations, one for heavy ion induced upsets (used interchangeable with GCR) and one for proton induced upsets. He also presents results from LEO satellites separately and concludes that they constitute a dispro-portionately large part of cases with a big difference between predic-tion and observapredic-tion. Further, the NASA AP-8 model, while working correctly for elliptic orbits, seems to underpredict the proton popula-tion in the SAA region by a factor 1.4. Figure 2.4 shows the predicpopula-tion quality for proton induced upsets. The GCR predictions, on the other hand, reveal a mean overestimation by a factor 2, as visible in figure 2.5. [14]

(18)

Figure 2.5: Quality of space prediction for heavy ion induced upsets for low Earth orbit satellites. [14]

An event which can temporarily increase the SEU rate is a large solar flare. While the effect of solar flares is somewhat mitigated in the LEO (presumably due to Earth’s magnetic field) the increased proton flux of a solar flare can greatly increase the number of SEU. As can be seen in figure 2.6, and further reinforced by Campbell et al, the solar events large enough to cause a noticeable increase in SEU are rare. Furthermore, most of the effect tapers off in a day, and after a week the SEU rate is back to normal. [5][2]

(19)

2.2 Detection and Correction of Memory

Up-sets

To increase the reliability of data storage and data transfers, a com-mon technique is to use EDAC algorithms from the coding theory field. This paper will focus on Hamming codes and Reed-Solomon codes. They work by splitting up the data into chunks and calculating a number of check bits for each chunk; the number of check bits and how they are calculated depends on which code is used. A method called scrubbing is then employed to read the stored chunk of data, compare it with the check bits and, if an error was detected, try to cor-rect it. It is essential that the scrubbing frequency is not too low, or errors will accumulate until they are no longer correctable.

Other EDAC codes, which will not be investigated in this paper, in-clude Golay codes [46], Euclidean geometry low-density parity-check codes [47] and two-dimensional error codes [48].

2.2.1 Hamming Codes

Hamming codes were invented by R. W. Hamming in 1950 and are a family of linear codes that can either detect two errors or correct one error. This property makes Hamming codes into perfect codes. It means that each possible sequence of bits (with the same size as the codewords) is either a codeword or can be corrected to a codeword.

Hamming codes are denoted Hamming(n,k) where n is the number of bits in each codeword and 2k is the number of codewords in the code. The number of parity bits in a code is then n − k. In a Hamming code, n is always one less than a power of two such that n + 1 = 2r. Given r, then k is given by k = n − r. [3]

Hamming codes have been extended to include both single error correction and double error detection (SECDED). This is done by adding an extra parity bit to the codeword. All Hamming codes presented throughout the thesis will be of this type. They will be denoted Ham-ming(n,k) even though they are not perfect codes.

(20)

Hamming(72,64) is therefore called a systematic code. [17][15] This allows for data and parity to be stored at different locations.

Applications of Hamming codes include areas such as ECC mem-ories. [7]

2.2.2 Reed-Solomon Codes

Reed-Solomon codes are error detecting and correcting linear block codes. A code is denoted RS(n,k), where n is the word length and k is the actual data. Each word consists of symbols which in turn consists of s bits each. For example, RS(255,223) consists of 255 s-bit symbols where 223 of them represent the actual data being sent. In the case that s = 8, each symbol can be represented by a byte. [15]

A Reed-Solomon code can correct up to t errors where 2t = n − k. In the case of RS(255,223) t is given by: 2t = 255 − 223 = 32 ⇒ t = 16. Thus, RS(255,223) can correct 16 errors (corrupted symbols) and the parity of the code word is 32 symbols. [15]

Figure 2.7: Layout and size of the data and the parity in a Reed-Solomon code.

Given a symbol size s (in bits), the maximum codeword length for a Reed-Solomon code is 2s _{− 1. In the case of bytes (symbol size in}

bits, s = 8) the maximum length for a Reed-Solomon code is 28 _{− 1 =}

256 − 1 = 255[15]. But there exist extensions to 256 byte codes [22]. The amount of computation needed for encoding and decoding Reed-Solomon codes scales relatively to the number of parity symbols in a code word. However, encoding Reed-Solomon codes requires less computation than decoding Reed-Solomon codes. [15]

(21)

2.3 The MIST Mission

This subsection contains important information specific to the MIST mission.

2.3.1 Reference Orbit

The orbital element sets that are used for analysis of the MIST mission are expressed in the commonly used data format Two-Line Elements (TLE), defined by the U.S. Space Command. TLE encodes a list of or-bital elements of an Earth-orbiting object, starting at a reference point in time called the epoch. For MIST, this time is set to 0000:00 UT on 21 June 2017. The TLE description of the MIST reference orbit is:

MIST Ref 1

1 99999U 17040A 17172.00000000 .00002669 00000-0 00000-0 0 0010 2 99999 097.9430 250.6332 0010000 000.0000 000.0000 14.75896000 00000

These orbital elements correspond to a 636.8-650.8 km altitude sun-synchronous orbit with a nodal period (the time from one northbound equator crossing to the next) of 97.63 min. In calculations, an orbital altitude of 640 km can be assumed.1

2.3.2 Hardware

The OBC of the satellite is provided by Innovative Solutions in Space (ISIS). The specifications of the OBC are listed in table 2.1.

Component Information

CPU 32-bit ARM9, 400MHz (AT91SAM9G20)

RAM 32MB DRAM

Code storage 1MB NOR-Flash

Table 2.1: OBC hardware specification.

The main tasks that the OBC will perform throughout the lifetime of the satellite are the following:

• Control the attitude of the satellite,

(22)

• Handle all communication with ground station, • Gather housekeeping data,

• Handle control and communication for all experiments.

The CPU resources required by each task is unknown at the time of writing. However, all of the listed tasks will have priority over the EDAC task.

2.3.3 Payload

The payload of the MIST satellite consists of eight different experi-ments which will periodically run and collect data. Even though it is not yet known exactly what data the experiments will generate, a short summary of their purpose is provided to give an idea:2

• CubeProp will test the precision control of a complete propul-sion system including four individually controllable thrusters, a propellant tank and a feed system, heaters and sensors.

• LEGS will assess the lifetime and over time performance of their piezoelectric micro motors in the space environment.

• CUBES will study the space radiation on a low earth orbit us-ing special detectors comprisus-ing a silicon photomultiplier and scintillators (scintillating materials which produce a flash of light when struck by radioactive particles).

• Ratex-J will, similar to cubes CUBES, detect space radiation. But the measurements of Ratex-J are performed using a solid state detector based anti-coincidence system with ceramic channel elec-tron multipliers and a multichannel plate.

• Morebac will transport freeze-dried micro-organisms into orbit, revive them via media addition and finally measure their growth characteristics in orbit. The development of the bacteria will be tracked through measurements of temperature, light transmit-tance, and pressure.

• SEUD will test the performance of their self-healing/fault-tolerant computer system in the hostile environment of space.

(23)

• SiC will flight test and collect long-term measurements from in-tegrated circuits built with silicon carbide (SiC).

• Camera will capture images of Earth and compress them using a new learning based method which adapts to the input data. The type of camera that will be used is still under discussion.

(24)

Chapter 3 Method

This chapter presents the four steps which were taken in order to ar-rive at the result presented in chapter 4. Section 3.1 provides a basis for the SEU rate prediction, considering both the planned trajectory and the memory type. Section 3.2 proposes suitable EDAC algorithms. Section 3.3 explains how the running times of the proposed algorithms are estimated. Finally, section 3.4 proposes reliability models for the chosen algorithms.

3.1 Memory Upset Rate Prediction

Many earlier attempts to model the probability of uncorrectable errors (often measured in Mean Time To Data Loss (MTTDL)) work based on the assumption that upsets occur with a Poisson distribution. While this might be a reasonable assumption for higher altitude orbits, it is not very accurate for the LEO of MIST considering the heavy proton influx in the SAA region. Unfortunately, it is very hard to make an ac-curate model which takes shifting environmental aspects into account. To mitigate this problem, the MIST trajectory was split into two parts. One critical part which corresponds to the SAA region, and one calm part which corresponds to everywhere except inside the SAA. The up-sets in these respective parts were then assumed to adhere to a Poisson distribution.

(25)

14 CHAPTER 3. METHOD

3.1.1 Space Environment System

To get a first overall estimate of the upset rate a software called Space Environment System (SPENVIS)1 _{was used. SPENVIS is provided by}

the European Space Agency (ESA) and can be used to calculate a broad variety of effects on spacecraft. Both the CREME96 and NASA AP8MAX models are implemented and readily accessible through a web inter-face. Required input is orbital data, aluminum equivalent shielding and a memory device from a library provided in the SPENVIS soft-ware. For orbital data, the numbers presented in section 2.3.1 was used. The shielding was approximated to 0.2 cm by the MIST ther-mal team. The memory chosen was D424100V, a DRAM memory with similar architecture to the one used in MIST.

3.1.2 Fine-tuning the Predicted Upset Rate

SPENVIS outputs a prediction of the upset rate split into direct ion-ization (i.e. GCR) induced upset rate and proton induced upset rate. These numbers were then adjusted in accordance with the findings of Petersen [14], i.e. the GCR rate was multiplied by a factor of 0.5 and the proton rate by a factor of 1.4.

Finally, the the proton upset rate was normalized for time spent inside and outside of the SAA. That time was calculated (based on the orbital data presented in section 2.3.1) using a software package called Systems Tool Kit (STK) (developed by Analytical Graphics, Inc). The GCR upset rate was assumed to be constant regardless of whether inside the SAA or not, probably resulting in a slight overestimate of the upset rate in the SAA (due to environmental shielding effects), and conversely a slight underestimate outside of the SAA.

3.2 Error Correction

The error correcting codes chosen to be investigated are Hamming codes and Reed-Solomon codes.

Hamming codes are more commonly found in hardware such as error correcting code memory (ECC memory). However, Hamming codes have been investigated as a candidate for software based EDAC. [17]

(26)

CHAPTER 3. METHOD 15

Reed-Solomon codes are commonly found in software. Examples of this are compact disc and the Voyager expeditions.

The reason why these error codes were chosen is to see if there are advantages with using an error correcting code in software that is com-monly found in hardware compared to codes more comcom-monly found in software. Reed-Solomon and Hamming codes are well established and are because of that chosen to be investigated.

The different variants of Hamming and Reed-Solomon codes was selected based on how long it takes to decode and how much overhead is provided by the parity of the code. The Hamming codes that will be investigated are Hamming(13,8) and Hamming(72,64). The Reed-Solomon codes that will be investigated are RS(255,253), RS(255,251), RS(196,192), RS(66,64) and RS(28,24).

3.3 Average Algorithm Running Time

An important factor in the reliability models presented in section 3.4 is the cycle time, i.e. the time between two consecutive scrubbings. The cycle time depends on the CPU overhead of the EDAC algorithm used, as well as how much CPU resources is available to the scrubbing algorithm. Since it is predicted only a few upsets will occur each orbit (see section 4.1), the performance overhead of the algorithms is well approximated by their decoding speed.

Encoding speed will not be measured for two main reasons:

1. Data gathered from an experiment will never be modified after being stored in a block. Thus a block only has to be encoded once.

2. The time interval when the satellite is not able to send data to ground station is many times greater than the expected scrub-bing time (see plots in 4.3.1). A block is thus expected to be de-coded many times while only being ende-coded once.

3.3.1 Measuring Average Running Time

(27)

Since the expected number of SEUs are less than 5 over 1400 sec-onds (calculated using the result in section 4.1) the decoding algorithm is mostly expected to run when no SEUs have occurred. Thus, the time is measured for a decoding algorithm in the optimal case when no codewords have been corrupted. Each decoding algorithm will be run 4 times over 10 MB of data and the resulting time will be the average of all 4 runs.

The computer that performs the tests runs at 4800 million instruc-tions per second (MIPS). To get an estimate on the models running time on the OBC, the resulting times have to be scaled up.

Let RTref be the running time, M IP Sref be the MIPS of the

com-puter that the tests are performed on, and M IP SOBC be the MIPS of

the OBC. Then the scaled up running time is given by the following formula:

RTOBC = RTref ·

M IP Sref

M IP SOBC

Locating the MIPS value of the OBC has been unsuccessful. There-fore it will be assumed that the OBC operates at 400 MIPS. This is based on the fact that the OBC has a clock speed of 400MHz. This simplifies the formula to:

RTOBC = RTref ·

4800

400 = 12 · RTref

3.3.2 Hamming Codes

For Hamming codes, the average time for decoding all codewords is measured. The decoding algorithm used is a parity check for each codeword2.

3.3.3 Reed-Solomon Codes

For Reed-Solomon codes, the average time for decoding all codewords is measured. The algorithm used to decode a codeword is a combina-tion of Berlekamp-Peterson and Berlekamp-Massey, implemented by Henry Minsky3.

(28)

3.4 Modelling Data Loss

The goal of this section is to propose models for calculating the ex-pected data loss for each of the EDAC algorithms investigated.

3.4.1 Measuring the Expected Data Loss

A problem when modeling the data loss is that it is, as explained in section 2.3.3, currently not known exactly what data will be gathered by the experiments. This makes it hard to find a single, meaningful, way to measure the data loss; if an uncorrectable error (UE) occurs in a word, how much data has to be thrown away? The answer will vary greatly between the experiments.

To account for this lack of information, three different “block sizes” are introduced. First, for independent data, there is a block size of 1 byte. Secondly, for typical experiment measurements, there is a block size of 64 bytes. Finally, there is block size of 256 bytes or larger. If an UE occurs in a codeword, it is assumed that all blocks spanned by that codeword are rendered useless. If, for example, a UE occurs in a Hamming(13,8) codeword and the block size is 64 byte, then all 64 bytes are considered corrupt. Thus, it is natural to measure the expected data loss in a number of lost blocks, EB.

3.4.2 Assumptions

In order to make the calculations feasible, a few assumptions regard-ing the circumstances around the upsets have to be made. The pro-posed models build on the following five assumptions:

1. Upsets occur with a Poisson distribution. 2. Upsets are statistically independent.

3. Consecutive bit flips do not correct previous bit flips.

4. Bits storing zeroes and bits storing ones have identical upset rates. 5. All UEs occur in separate blocks.

(29)

The first assumption is valid only if the environmental aspects do not change. After splitting the orbit into two parts, one inside SAA and one outside, this assumption is believed to be justified.

The second assumption is perhaps the biggest. It basically says that one particle can only flip one bit, i.e. that no MBUs occur. While this is technically not the case, MBUs constitute of approximately 1-7 % of all upsets [21][9], as discussed by Shirvani and Saxena [17] software implemented EDAC can be designed to store the data so that physi-cally adjacent bits belong to separate code words. Thus, an MBU can simply be viewed as several SEUs.

The third assumption has a conservative effect on the result.

The fourth assumption is somewhat supported by the result of Harboe-Sorensen, Müller, and Fraenkel [8] who radiation tested 15 different DRAM memories and found that all of the tested devices except one showed a close to 50/50 distribution of 1-0 and 0-1 transitions for both proton and heavy ion induced upsets.

The fifth assumption is also conservative, and considering the very low number of predicted UEs, the risk of more than one error in a block is low. Note that the overestimation of lost blocks grows with the block size.

3.4.3 Hamming Code Data Loss Model

The model used for the Hamming codes is based on a model proposed by Allen et al. [1]. With the following symbols:

• Nw: Number of words in the memory.

• Nc: Number of cycles.

• P1B: Probability of a bit flip in a single bit during a single cycle.

• PU E: Probability of an UE in a single word during a single cycle.

• NB/w: Number of bits per word, including check bits.

• Tc: Cycle time, i.e. the time between two consecutive scrubbings.

• RBF: SEU rate per bit in the relevant environment.

• Ew: Number of words with UEs.

they found that the statistical average number of words with UEs can be expressed as:

(30)

where

PU E = 1 − (1 − P1B)NB/w − NB/w· P1B(1 − P1B)NB/w−1 (2)

and

P1B = 1 − e−RBF·T c (3)

They also found that PU E and P1B can be bracketed between the

bounds: 1 2·NB/w(NB/w−1)·P 2 1B[1− 2 3(NB/w−2)·P1B] ≤ PU E ≤ 1 2NB/w(NB/w−1)·P 2 1B and RBF · Tc− 1 2(RBF · Tc) 2 _{≤ P} 1B ≤ RBF · Tc

so that in cases where the equations (2) and (3) are differences be-tween nearly equal numbers, the bounds can be used as good esti-mates to avoid carrying out very high-precision arithmetic.

Finally, the blocks introduced in subsections 3.4.1 have to be incor-porated in the model. Three more symbols are introduced:

• NB/Block: Number of bits per block

• ND/word: Number of data bits per word

• EB: Expected number of blocks lost

Because of assumption five (presented in the previous subsection) the expected number of blocks lost is:

EB = Ew·

ND/word

NB/Block

3.4.4 Reed-Solomon Code Data Loss Model

The Reed-Solomon data loss model used was based on the one pro-posed by McEliece and Swanson [12]. They calculate the probability of a codeword not being able to be corrected to the original codeword. That probability will be used data loss model for Reed-Solomon codes. Variables used in the model:

(31)

• q: Number of different possible values of a symbol. In the pre-sented model, a symbol is reprepre-sented by a byte. Therefore q is given by q = 28 _{= 256}_.

• b: Number of bits in a codeword. Since q will be represented as a byte in the model, b = 8n.

• u: Number of upsets.

• r: Number of parity symbols in a codeword. Given by r = n − k. • d: The minimum number of symbols that need to change in order to transform a codeword into another valid codeword. Given by d = n − k + 1.

• t: Number of symbol errors that can be corrected. Given by t = r k.

• Tc: Cycle time as described in 3.4.3.

• Nc: Number of scrubbing cycles. An example is the number of

cycles for an orbit: Nc= TOrbit_T_c .

• NB: Number of symbols in a block.

The data loss model for Reed-Solomon codes is constructed using a more general formula presented by McEliece and Swanson [12]:

PE(u) =                0 for u ≤ d − t − 1 (q − 1)−r t X s=d−u n s (q − 1)s for d − t ≤ u ≤ d − 1 (q − 1)−r t X s=0 n s (q − 1)s for u ≥ d

Where PE(u) is an upper bound on the probability that a

Reed-Solomon code cannot correct u upsets. The upper bound will be used as if it was the actual probability of a Reed-Solomon code not correct-ing u upsets.

Using the assumptions in section 3.4.2 regarding the independence of events, the probability that no upsets occur in a bit during a period of Tcseconds is then the following:

pN S = (1 − pss)Tc

Where pssis the probability that an upset occurs during the period

(32)

during the cycle time Tccan be expressed as pS = 1 − pN S. Using the

assumption that all upsets that occur more than once do not revert the bit back to its original state, it is then clear that pS also represents the

probability of a single upset in a bit.

Since the probabilities are independent, the probability that u up-sets occur in b bits is given by a binomial distribution:

PU(u) = b u · pu S· p (b−u) N S

The probability that a Reed-Solomon code will encounter an UE during the cycle time Tcis then given by:

PRS = b

X

u=0

PE(u)PU(u)

The expected number of times that a codeword will be corrupted during Nccycles is given by:

ERS = NcPRS

The expected number of blocks lost for a Reed-Solomon code RS(n,k) is given by taking the product of the expected number of corrupt Reed-Solomon codewords and the number of blocks that is protected by a single Reed-Solomon codeword. It is assumed that a codeword always spans the minimum number blocks possible.

The expected number of corrupt codewords can be expressed as the product of the number of codewords in memory with how many times a codeword is expected to be corrupt: ERS · M_k.

The expected number of blocks lost for a Reed-Solomon code is then given by the following:

(33)

Chapter 4 Result

4.1 Predicting the Upset Rate

In table 4.1, both the direct result of the calculations using SPENVIS and the Petersen [28] adjusted result are presented.

Source SPENVIS Petersen adjusted

Direct ionization (GCR) 3.2985E-11 1.6493E-11 (RP −GCR)

Proton induced 9.2586E-12 1.2962E-11

(RP −proton)

Table 4.1: SEU/bit-sec

The orbital calculations carried out using STK found that the MIST satellite will spend 13.6 % of its time inside of the SAA and that the longest pass through the SAA will take 1400 seconds. This means that the upset rate in the critical region, i.e. inside the SAA, is:

Rcrit = RP −GCR+ RP −proton·

100 OSAA

= 1.1180 · 10−10

where Rcrit is the predicted (critical) upset rate and OSAA the

percent-age of time spend inside the SAA.

The upset rate in the calm region, i.e. outside of the SAA, is simply the Petersen adjusted GCR upset rate:

Rcalm = RP −GCR = 1.6493 · 10−11

(34)

CHAPTER 4. RESULT 23

4.2 Running Time of Chosen Algorithms

The running time for each code is listed in table 4.2. The time is mea-sured and adjusted according to section 3.3.1.

Code Run time Run time (adj.) Data coverage

Hamming(72,64) 0.17s 2.04s 4.90 MB/s Hamming(13,8) 0.10s 1.20s 8.33 MB/s RS(255,253) 0.11s 1.32s 7.58 MB/s RS(255,251) 0.22s 2.64s 3.79 MB/s RS(196,192) 0.21s 2.52s 3.97 MB/s RS(66,64) 0.10s 1.20s 8.33 MB/s RS(28,24) 0.20s 2.40s 4.17 MB/s

Table 4.2: Collected and adjusted running times for error correction codes over 10 MB of data.

All codes are collected in one table. The resulting running time is adjusted as if each code would cover 30 MB instead of 10 MB.

Code Run time (adj.) Hamming(72,64) 6.12s Hamming(13,8) 3.60s RS(255,253) 3.96s RS(255,251) 7.92s RS(196,192) 7.56s RS(66,64) 3.60s RS(28,24) 7.20s

Table 4.3: Adjusted running times for error correction codes over 30 MB of data.

4.3 Expected Data Loss of Chosen Algorithms

The expected data loss in the following calculations is calculated based on the orbit spending longest time inside the SAA, that is 1400 seconds inside of the SAA and 4458 seconds outside.

(35)

24 CHAPTER 4. RESULT

4.3.1 Plotting Expected Data Loss

Each code is plotted using its corresponding data loss model from sec-tion 3.4. The data loss is presented in different plots according to scrub-bing time and CPU usage.

The y-axis on all plots corresponds to expected number of corrupt blocks (EB). When presented in terms of scrubbing time, the x-axis

corresponds to Tc in the models. To represent the x-axis in terms of

CPU usage, the following calculation is used:

CP U usage = 100 · Running T ime Tc

The block sizes chosen to be investigated are 1, 64 and 256 bytes. Since the assumption is made that two corrupted codeword never affected the same block and that each codeword covers as few blocks as possi-ble, there is no point in investigating larger block sizes that 253 bytes. This is because the codeword that protects the most data protects 253 bytes. However, the biggest block size of 256 bytes is chosen to have all block sizes be a power of 2.

(36)

Figure 4.1: Expected number of blocks lost for each error correcting code in respect to scrubbing time. (block size = 1 byte)

(37)

Figure 4.3: Expected number of blocks lost for each error correcting code in respect to scrubbing time. (block size = 256 byte)

(38)

(39)

Figure 4.6: Expected number of blocks lost for each error correcting code in respect to CPU usage. (block size = 256 byte) The curves for RS(66,64) and Hamming(72,64) overlap.

4.3.2 Comparing Data Loss to Parity Overhead

To get a measurement on how well an error correcting code performs, the expected amount of lost blocks is compared to how much parity overhead is used to protect the data. Each error correcting code is given 20 % CPU time to run its decoding algorithm and the block size is set to 64 bytes.

Parity overhead is measured in terms of the ratio of parity bits to data bits. For example, Hamming(13,8) has 5 parity bits and 8 data bits. So the parity overhead for Hamming(13,8) is 5/8.

(40)

Code Expected blocks lost Parity overhead

Hamming(72,64) 6.3 · 10−7 12.5 % Hamming(13,8) 9.2 · 10−8 62.5 % RS(255,253) 8.4 · 10−5 0.79 % RS(255,251) 4.5 · 10−14 1.59 % RS(196,192) 4.0 · 10−14 2.08 % RS(66,64) 6.6 · 10−7 3.13 % RS(28,24) 5.2 · 10−18 16.67 %

(41)

Chapter 5 Discussion and Concluding

Re-marks

5.1 Analysis of Result

Many interesting conclusions can be drawn from the simulation data presented in chapter 4. Maybe most importantly, it seems clear that regardless of which EDAC algorithm is chosen, there is a great increase in reliability compared to the case where no EDAC algorithm is used.

Interestingly, the most important factor for the result seems to be what EDAC code is used. Varying the block size and CPU utilization seem to have little impact on the performance relative to the differ-ences between the codes. This is a convenient result from an imple-mentation perspective since there is no clear gain in using different codes for different data.

Another interesting result is the very poor performance of the Ham-ming codes, and HamHam-ming(13,8) in particular, relative to many of the RS codes. Especially when considering their parity overhead.

Generally, the best candidates for MIST seem to be RS(28,24), RS(196,192) and RS(255,251). If the 16.67 % parity overhead of RS(28,24) is accept-able, that seems to be the best pick considering its very high reliability. If lower overhead is required then either RS(196,192) or RS(255,251) could be used. Both RS(196,192) and RS(255,251) provide similar data protection with respect to CPU time used. The difference being that RS(255,251) has less parity overhead than RS(196,192), but RS(196,192) provides better protection with respect to scrubbing time.

(42)

CHAPTER 5. DISCUSSION AND CONCLUDING REMARKS 31

5.2 Reliability

An important fact to keep in mind when using the result of this thesis is that a lot of assumptions were made to make calculations feasible. Most of them were conservative and probably quite insignificant to the result, but a few were not and these are the focus of this section.

5.2.1 Bit Flip Susceptibility

The goal of the upset rate prediction that was carried out in section 4.1 was to get a value as close as possible to the real value.

The perhaps biggest assumption of the calculations and models in this thesis is that all upsets are assumed to be SEUs. While this seems to be a common assumption for reliability models in space (see for ex-ample [11][16][1]), it could skewer the result heavily unless mitigated through storing data in a way so that physically adjacent bits belong to different codewords (as briefly discussed in section 3.4.2).

When it comes to the reliability of the calculated upset rate, it is believed to depend on the correctness of two main factors:

1. The SPENVIS upset rate calculation was based on a generic mem-ory device.

2. The upset rate is assumed to adhere to a Poisson distribution. Since the goal of the upset rate prediction was to get a value as close as possible to the real, these factors were not accounted for and could skewer the result in either direction. If Petersen’s conclusions are to be believed, (1) alone could lead to a factor 10 difference in predicted up-set rate. How far from reality the Poisson assumption is, is hard to say. For the most impactful part of the orbit, the SAA region, the upset rate is believed to be higher closer to its center, why the proton upset rate is probably higher there than predicted. However, since magnetic shield-ing could reduce GCR induced upsets, that effect should be somewhat mitigated.

5.2.2 Running Time

(43)

32 CHAPTER 5. DISCUSSION AND CONCLUDING REMARKS

measured on a laptop and then scaled accordingly based on the MIPS values of the laptop and the OBC. The actual MIPS of the OBC is not relevant besides for giving some context in terms of CPU usage. The MIPS of the OBC and the laptop only change a common scaling fac-tor for all running times. A codeword that is twice as fast as another codeword will still be twice as fast even if the MIPS of the OBC or the laptop is altered.

Also, the time measurement of the Hamming codes is based on an abstract description of how an algorithm could be implemented while the time measurement of the Reed-Solomon codes is based on a program that has been developed over many years. The result in table 4.2 shows that the fastest Hamming code and the fastest Reed-Solomon code have the same running time. This is contrary to what is described by Shirvani and Saxena [17] where the decoding time for a Hamming code is more that 7 times faster than the Reed-Solomon code.

However, the time to decode the hamming code as presented by Shirvani and Saxena [17] is almost identical to the clock speed. This means that their implementation decodes almost 1 byte every clock cy-cle. This seems very unrealistic from a software-only implementation (not using hardware accelerated instructions created for this purpose) and does leave a bit of skepticism to their presented measurements.

5.2.3 Model Correctness

When it comes to the probability models, the Hamming model is be-lieved to be accurate while the Reed-Solomon model is based on a worst case. As also stated by McEliece and Swanson [12], the probabil-ity that a Reed-Solomon codeword cannot correct u upsets is difficult to calculate.

5.3 Improvements

Since time was a limiting resource in this thesis project, not all interest-ing cases could be investigated. To give further assistance to the MIST mission, some interesting areas would be:

(44)

CHAPTER 5. DISCUSSION AND CONCLUDING REMARKS 33

as the reliability proportions between different codes.

• Investigate if separate scrubbing-times inside and outside of the SAA can be used to save CPU resources.

• Study the program size overhead of the algorithms.

• Discuss the differences in error detection between the codes and model the probability of the algorithms failing to detect errors. • Finally, it would be interesting to study a larger set of codes than

the ones presented in this thesis. Especially codes from other families, such as Euclidean geometry low-density parity-check codes and two-dimensional error codes.

5.4 Concluding Remarks

The result of this thesis clearly distinguishes three main competitors among the different codes. The highest reliability is provided by RS(28,24), which has 16.67 % parity overhead. If less overhead is desired, the sec-ond best option is either RS(196,192) or RS(255,251). Both provide high reliability with parity overhead of about 2 %.

(45)

Bibliography

[1] Gregory R. Allen et al. “Single-Event Upset (SEU) of Embed-ded Error Detect and Correct Enabled Block Random Access Memory (Block RAM) Within the Xilinx XQR5VFX130”. In: IEEE Transactions on Nuclear Science 57, No. 6 (2010), pp. 3426–3431. [2] Y. Bentoutou. “A real time EDAC system for applications

on-board Earth observation small satellites”. In: IEEE Transactions on Aerospace and Electronic Systems 48, No. 1 (2012), pp. 648–657. [3] Norman L. Biggs. Discrete Mathematics. second edition. Oxford

University Press, 2004.ISBN: 978-0-19-850717-8.

[4] D. Binder, E. C. Smith, and A. B. Holman. “Satellite anomalies from galactic cosmic rays”. In: IEEE Transactions on Nuclear Sci-ence 22, No. 6 (1975), pp. 2675–2680.

[5] A. Campbell, P. McDonald, and K. Ray. “Single event upsets in space”. In: IEEE Transactions on Nuclear Science 39, No. 6 (1992), pp. 1828–1835.

[6] Paul E. Dodd and Lloyd W. Massengil. “Basic Mechanisms and Modeling of Single-Event Upset in Digital Microelectronics”. In: IEEE Transactions on Nuclear Science 50, No. 3 (2003), pp. 583–602. [7] Kazutami Furutani Kiyohiro an Arimoto et al. “A build-in Ham-ming code ECC circuit for DRAMs”. In: IEEE Journal of Solid-State Circuits 24, No. 1 (1989), pp. 50–56.

[8] R. Harboe-Sorensen, R. Müller, and S. Fraenkel. “Heavy ion, pro-ton and Co-60 radiation evaluation of 16 Mbit DRAM memories for space application”. In: 1995 IEEE Radiation effects data work-shop, Madison, Wisconsin, USA, 1999. 1999.

(46)

BIBLIOGRAPHY 35

[9] B. G. Henson, McDonald P. T., and W. J. Stapor. “SDRAM space radiation effects measurements and analysis”. In: 1999 IEEE Ra-diation effects data workshop, Norfolk, Virginia, USA, July 12–16, 1999. 1999, pp. 15–24.ISBN: 0-7803-5660-8.

[10] Yugo Kimoto et al. “Space radiation environment and its effects on satellites: analysis of the first data from TEDA on board ADEOS-II”. In: IEEE Transactions on Nuclear Science 52, No. 5 (2005), pp. 1574– 1578.

[11] Yubo Li, Brent Nelson, and Michael Wirthlin. “Reliability mod-els for SEC/DED memory with scrubbing in FPGA-based de-signs”. In: IEEE Transactions on Nuclear Science 60, No. 4 (2013), pp. 2720–2727.

[12] R. J. McEliece and L. Swanson. “On the Decoder Error Proba-bility for Reed-Solomon Codes”. In: NASA TDA Progress Report 42-84 (1985), pp. 66–72.

[13] P. J. McNulty et al. “Test of SEU algorithms against preliminary CRRES satellite data”. In: IEEE Transactions on Nuclear Science 38, No. 6 (1991), pp. 1642–1646.

[14] E. L. Petersen. “Predictions and observations of SEU rates in space”. In: IEEE Transactions on Nuclear Science 44, No. 6 (1997), pp. 2174–2187.

[15] Martyn Riley and Iain Richardson. An introduction to Reed-Solomon codes: principles, architecture and implementation. 1998.URL: https: / / www . cs . cmu . edu / ~guyb / realworld / reedsolomon / reed_solomon_codes.html(visited on 05/04/2017).

[16] Abdallah M. Saleh, Juan J. Serrano, and Janak H. Patel. “Reli-ability of scrubbing recovery-techinques for memory systems”. In: IEEE Transactions on Reliability 39, No. 1 (1990), pp. 114–122. [17] Philip P. Shirvani and Nirmal Saxena. “Software-implemented

EDAC protection against SEUs”. In: IEEE Transactions on Relia-bility 49, No. 3 (2000), pp. 273–284.

(47)

36 BIBLIOGRAPHY

[19] Allan J. Tylka et al. “Single event upsets caused by solar ener-getic heavy ions”. In: IEEE Transactions on Nuclear Science 43, No. 6 (1996), pp. 2758–2766.

[20] C. I. Underwood and M. K. Oldfield. “Observations on the relia-bility of COTS-device-based solid state data recorders operating in low-Earth orbit”. In: Fifth European Conference on Radiation and its Effects on Components and Systems, Fontevraud, France, Septem-ber 13–17, 1999. 1999, pp. 387–394.ISBN: 0-7803-5726-4.

[21] Craig I. Underwood. “The single-event-effect behaviour of commercial-off-the-shelf memory devices. A decade in low-Earth orbit”. In:

Fourth European Conference on Radiation and its Effects on Compo-nents and Systems, Cannes, France, September 15–19, 1997. 1997, pp. 251–259.ISBN: 0-7803-4071-X.

(48)

Software based memory correction for a miniature satellite in low-Earth orbit

Software based memory correction

for a miniature satellite in

low-Earth orbit

JOHAN SJÖBLOM

JOHN WIKMAN

Software based memory

correction for a miniature

satellite in low-Earth orbit

JOHAN SJÖBLOM AND JOHN WIKMAN

Abstract

Sammanfattning

Contents

List of Abbreviations

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Problem Statement

Chapter 2

Technical Background

2.1

Environmental Effects on Memory

Relia-bility

2.1.1

Radiation Environment

2.2

Detection and Correction of Memory

Up-sets

2.2.1

Hamming Codes

2.2.2

Reed-Solomon Codes

2.3

The MIST Mission

2.3.1

Reference Orbit

2.3.2

Hardware

2.3.3

Payload

Chapter 3

Method

3.1

Memory Upset Rate Prediction

3.1.1

Space Environment System

3.1.2

Fine-tuning the Predicted Upset Rate

3.2

Error Correction

3.3

Average Algorithm Running Time

3.3.1

Measuring Average Running Time

3.3.2

Hamming Codes

3.3.3

Reed-Solomon Codes

3.4

Modelling Data Loss

3.4.1

Measuring the Expected Data Loss

3.4.2

Assumptions

3.4.3

Hamming Code Data Loss Model

3.4.4

Reed-Solomon Code Data Loss Model

Chapter 4

Result

4.1

Predicting the Upset Rate

4.2

Running Time of Chosen Algorithms

4.3

Expected Data Loss of Chosen Algorithms

4.3.1

Plotting Expected Data Loss