Towards self-learning sensors: Matching pursuit with dictionary learning on FPGA

(1)

MASTER'S THESIS

Towards self-learning sensors

Matching pursuit with dictionary learning on FPGA

Kim Albertsson

2013

Master of Science in Engineering Technology

Engineering Physics and Electrical Engineering

(2)

Lule˚

a Univeristy of Technology

Master’s Thesis

Towards self-learning sensors: Matching

pursuit with dictionary learning for FPGA

Author:

Kim Albertsson

Supervisor: Fredrik Sandin

(3)

Abstract

A general approach for sparse signal decomposition is matching pursuit with an over-complete dictionary of learned features. It has been demonstrated that this gives efficient codes when applied to images and acoustic signals. This thesis presents an implementation of matching pursuit with Hebbian learning that is capable of on-line signal processing. The target application is analysis of acoustic emission and vibration signals, for example for condition monitoring of bearings in rotating machinery. The implementation is done in C and VHDL. Bottlenecks in the algorithm are identified and these parts are implemented in VHDL to enable on-line processing of high-frequency acoustic emission signals using an FPGA. The C implementation can handle data in both fixed and floating point format and is verified to be functioning in both cases. The VHDL components are verified in simulation. The maximum processing speed possible for the designed system is expected be on the order of one million samples per second, 1 Msps, using a modern programmable system-on-chip FPGA.

(4)

Preface

My first and foremost thanks goes to my family. They are wonderfully helpful and supportive people that I would give my everything for. From the bottom of my heart, thank you for being who you are.

My supervisors Fredrik Sandin and Jens Eliasson have my thanks for their steady guidance and thought-provoking advice. My friend and fellow student Joakim Nilsson have my thanks for his support and hard work developing the hardware necessary for running my algorithm.

I would also like to thank the good people at SKF UTC. You have all been a tremendously fun group to spend the past four months with!

(5)

3.1.2 Data representation . . . 12 3.1.3 Configuration . . . 13 3.1.4 mpInterface . . . 14 3.1.5 mpLearn . . . 15 3.1.6 mpReconstruct . . . 15 3.1.7 Other modules . . . 15 3.2 VHDL . . . 16 3.2.1 Atom . . . 17 3.2.2 Max-comparator . . . 17 3.2.3 Candidate-event generator . . . 18 3.2.4 Sweeper . . . 19 3.2.5 Data representation . . . 19 4 Simulation results 20 4.1 Simulation: Basic feature detection . . . 20

4.2 Simulation: Extended feature detection . . . 20

4.3 Simulation: VHDL implementation . . . 27

4.4 Benchmarking of mpLib . . . 29

(6)

5 Discussion 33 5.1 mpLib - Implementation . . . 33 5.2 mpLib - Simulation . . . 33 5.3 VHDL - Implementation . . . 34 5.4 VHDL - Simulation . . . 34 5.5 VHDL - Bandwidth . . . 35 5.6 Future work . . . 35 5.7 Conclusion . . . 36

(7)

1 Introduction

Condition monitoring typically requires tuning of signal processing models to a specific application and manual processing of the observed signals. It would therefore be helpful to have self-learning machines that can adapt and tune themselves to the system they are applied to thus simplifying au-tomated diagnostics and reducing costs. For high frequency applications an efficient representation is desired which reduces bandwidth requirements.

This is a field that the industry is currently researching. For example, the international company SKF that among other things manufacture bearings, mechatronics products and lubrication systems is interested in the monitoring of bearings in, among other things, windmills [1]. The company has recently established a number of University Technology Centres to promote research in areas of interest for SKF. This work was carried out in cooperation with the SKF UTC at LTU.

Imagine a windmill farm with hundreds of mills, all with complex gearboxes controlling the rotors. To ensure the safe operation of the mills and to avoid breakdowns, sensors are installed near key bearings of the gearboxes. If a bearing starts to degrade, repairs can be scheduled pre-emptively. The vibration and acoustic emission signals in the gearboxes can be of high frequency and if these signals are to be sampled the raw data rate can limit the number of sensors that can be used simultaneously. Instead, if a sensor that generates a sparse representation of the signal is used, more bearings can be monitored and a more complete picture of the health of the gearbox can be obtained. Liu et al. [2] shows an approach for vibration signals and concludes that the method is effective for fault diagnosis.

Acoustic emission signals are high-frequency vibrations typically in the range of 100 kHz to 1 MHz and a reason for studying them is that these signals can often be detected long before low-frequency vibrations indicating a fault appear. Acoustic emission can occur when lubrication is degraded or indicate for example contamination of the lubricant. These issues can slowly degrade the machinery eventually necessitating repair. Lees et al. [3] provides an investigation to this effect where they demonstrated that acoustic emission data can be used to extract relevant information on ball bearing health.

If the problem is quickly identified, for example via acoustic emission, permanent damage with associated economical consequences can be avoided and the impact on the remaining useful life of the component can be minimised. Another, potential, use for acoustic emission analysis is to improve automated lubrication systems, so that lubricant is supplied when needed and excess lubrication, which is costly and has a negative environmental impact, can be avoided.

This thesis aims to further the development of a self-learning sensor front-end that delivers, instead of raw data, features that describe the signal more compactly using a feature vector and can be applied to device condition monitoring and fault-detection.

Using signals of features instead of raw samples enables new applications where pattern recog-nition is a central part. While this particular implementation is geared towards high-frequency applications with the sampling of acoustic emission as its prime target, the method is more general and might allow a higher level of information processing; Identifying events such as a cry for help in a city street and could allow relevant services a faster response time. Other possible applica-tions include distributed earthquake detectors in, for example, mines allowing the source of the quake to be pinpointed. This requires accurate timing information which the sensor platform and accompanying software described here and in Nilsson [4] might be able to provide.

(8)

the original author is attributed. At the time of writing this report the source code is only available upon request from ketost@gmail.com.

1.1 Problem description

The current literature is rich in off-line solutions for signal decomposition using sparse coding and dictionary learning. Bruckstein et al. [6] gives an overview, so does the combined efforts of Smith et Lewicki [7], Haykin [8] and Pati et al. [9]. However there is not much provided in the area of on-line solutions which can handle sampling rates up to the MHz range in real-time. Special care is needed when dealing with such high frequency signals to ensure that processing time available will suffice.

This text describes an implementation of the matching pursuit algorithm with Hebbian dictionary learning in a fashion suitable for deployment on an FPGA and future, possible, synthesis as an ASIC. Several algorithms were considered. As all are computationally expensive matching pursuit was chosen because it is straightforward, parallelisable and has been proven to work in general signal environments.

1.2 Goals

The primary objective for this project is to implement the matching pursuit algorithm with dictio-nary learning in a framework suitable for deployment on an embedded platform. The implemen-tation should also utilise any benefits from implemenimplemen-tation on an FPGA. The performance target for the implementation is to be able to process acoustic emission waveforms, typically in the range of 100kHz to 1 MHz. This can be expressed as a series of guiding questions that the thesis should provide the answer for.

• Is the matching pursuit algorithm feasible in an on-line setting?

• Investigate how the algorithm can be adapted for an FPGA. What optimisations are possible? • What parts of the algorithm will benefit from implementation on FPGA?

• Can matching pursuit be combined with dictionary learning when using an FPGA? • Can sampling rates of 1 MHz be processed by such a system?

1.3 Scope

The work will lean on previous literature for implementation and algorithm details. Existing algo-rithms will be optimised and implemented. Several approaches to the sparse signal decomposition problem will be briefly studied but the focus will be on matching pursuit with a Hebbian gradient ascent learning as proposed by Smith and Lewicki [7].

The FPGA part will be implemented in VHDL and verified in a simulator. If there is enough time, it will be synthesised on an FPGA that is used in a parallel project [4]. That project is conducted by a fellow student at LTU and aims to create a sensor platform providing an ADC and sufficient processing power for the algorithm presented here.

A library implementation of matching pursuit with dictionary learning is to be developed as a proof-of-concept of a self-learning condition monitoring and fault detection system that uses sparse representation for reduced bandwidth requirements.

(9)

1.4 Related work

Previous authors have described matching pursuit as a general approach for sparse coding. The method originate from Mallat and Zhang [10]. It was further demonstrated by Olshausen [11] to work well for succinct representations of image signals. Smith et Lewicki [12] later proved that matching pursuit coupled with a gradient ascent learning algorithm for the dictionary could reproduce the neuron responses of the cochlea. The dictionary was initialised with Gaussian noise and then trained in a sound environment designed to mimic that of the mammalian ear. This lends some credibility to the idea that the learning method can be used in a general signal environment that allows for sparse coding.

Smith et Lewicki also showed that using matching pursuit it is possible to reduce the bandwidth requirements 10 times at a 20 dB signal-to-noise ratio; An example demonstrating that the method is efficient for signal compression.

The fact that trained humans are able to detect faults with the use of hearing motivated H. Liu et al. [2] to investigate how this might be mimicked by a machine. A method similar to matching pursuit, MAP, is used for sparse coding of a vibration signal from a machine and using a classifier they are able to distinguish a faulty machine from a healthy one.

Campo et al. [13] outlines a general “analogue-to-feature converter” and verifies the feasibility of the approach using simulations on real-world data. The converter transforms an analogue signal to a feature vector. It is partly based on the work presented herein.

Devices for optimising adaptive signal decomposition with dedicated custom-designed hardware are demonstrated in the literature. Yang and Petersen [14] proposed an optimisation of the least-squares problem based on FPGA implementation, useful in orthogonal matching pursuit. Anguita et al.[15] implements a support vector machine with learning using an FPGA.

1.5 Thesis structure

The thesis is divided in four main parts beside this introduction. These are: theory, implementation, simulation results and discussion.

Theory

The theory section covers the theoretical background that is necessary for the discussion in later sections. An overview of signal decomposition is given followed by an explanation of the matching pursuit algorithm and dictionary learning. An overview of FPGAs and ASICs is also given, as is an analysis of the matching pursuit running time and optimisations possible with customisable hardware.

Implementation

The implementation section covers how mpLib, a C library developed in the project, is designed and a high-level description of its workings is presented. It is followed by a similar presentation of the VHDL implementation.

(10)

Simulation results

This section presents simulation results carried out to verify the C library. The focus lies on ability to detect and adapt to features present in the signal. This is followed by verification of the VHDL part. Lastly mpLib is benchmarked and profiled.

Discussion

The thesis is concluded with a discussion of the results and concluding remarks concerning the questions posed in the introduction. Possible continuations of the project are discussed.

(11)

2 Theory

The theoretical background needed for this thesis is covered in this section. First, a general overview of the concept of signal decomposition and its relevance is presented. A brief comparative study is also given of algorithms for signal decomposition and sparse representations. The concept of FPGA and ASIC is then explained followed by an in-depth description of the matching pursuit algorithm and dictionary learning. Lastly an investigation on the algorithmic complexity of matching pursuit is performed, which also includes possible optimisations when customised hardware is available. 2.1 Signal decomposition

A signal can be expanded in a given base; This process is called signal decomposition. Depending on the base in which the signal is expanded the decomposition will have different properties. One interesting property is the sparseness of the representation. Assume that a vector v is a vector encoding the expansion of a signal in a particular base. One measure of sparseness is the number of non-zero elements of v. This is described by the L0-norm. Thus the sparsity, s, of a signal can

be expressed as

s = ||v||0, (1)

where || · ||0 is the L0-norm. The L0-norm will not be used further in this text but it is still helpful

to have a basic grasp of what sparsity means and how it can be quantified.

Olshausen [11] explains that a sparse vector of events is desirable as it allows for a simple description of structure in a signal. This reduces the data rate necessary for signal representation and it can allow for an easier interpretation of the signal.

Examples of methods for signal decomposition and sparse representations include matching pur-suit, orthogonal matching purpur-suit, MAP, principal component analysis, Fourier and Wavelet trans-forms. These approaches will be briefly discussed. Smith et Lewicki [7] provides a good overview of matching pursuit and MAP, Haykin [8] presents a chapter on principal component analysis and Pati et al. [9] introduces orthogonal matching pursuit. Furthermore Bruckstein et al. [6] and Elad [16] provides introductory material for one interested in the field.

Matching pursuit

Matching pursuit is a greedy approach that uses an over-complete, that is non-orthogonal, dictio-nary consisting of atoms and matches the features described by the atoms to locations in the signal. The matching is performed iteratively where the atom with the largest component is matched first and constitutes an event. The remaining residual, or error, from one iteration is carried over to the next and is used for matching the next event.

Orthogonal matching pursuit

Orthogonal matching pursuit improves the convergence of matching pursuit by rescaling the am-plitudes of the matched events ensuring that, in each iteration, the matched atoms are orthogonal to the remaining residual. Each iteration is more computationally expensive than the original matching pursuit and requires the solution of a least-squares problem.

(12)

MAP

MAP is a probabilistic approach to the signal decomposition problem where the a posteriori prob-ability for an event is maximised. It is thus a more global algorithm than the greedy matching pursuit but more computationally expensive. In order to reduce the complexity of the algorithm it an be initialised with events extracted by matching pursuit.

Principal component analysis

Principal Component Analysis (PCA) involves the eigenvectors and eigenvalues of the correlation matrix of the signal. These will then correspond to features of the signal and by selecting only the largest components, the principal components, an approximation to the input can be obtained. Fourier and Wavelet transforms

The Fourier and Wavelet transforms defines an infinite basis in which the signal is to be decomposed. In the Fourier case the basis consists of sinusoids and in the wavelet case it consists of wavelets (any waveform with finite temporal extent that is scaled into a set of orthogonal functions). These decompositions are not generally sparse.

The matching pursuit algorithm is shown by Mallat and Zhang[10] to be a satisfactory method for generating sparse representation when a suitable dictionary is used. The algorithm is simple in that it can easily be split into small independent parts. Many of the other algorithms described herein lack this property. There exists efficient methods for calculating a discrete Fourier transform but the method does not generally produce sparse representations.

2.2 FPGAs and ASICs

Sometimes the computing capabilities of the general-purpose CPU is not enough or the demands on power efficiency so high that dedicated hardware is preferable. Two general categories exists, one called ASIC, or application-specific integrated circuit, the other FPGA, or field-programmable gate array. The ASIC is static in nature and once created cannot be easily changed. The FPGA on the other had is more dynamic and can be reprogrammed “in the field”, hence the name.

It is the ability to reprogram the FPGA that is its strength. A configuration for all the gates of the chip is generated from software and passed on to the device. These gate-configurations can be quite large and the process can take considerable time, still it allows for rapid iteration and testing when compared to the ASIC. There is a trade-off however, the clock speed is lower and the power consumption is higher than that possible in an equivalent ASIC implementation. Noteworthy is that a design deployable on an FPGA can easily be adapted for use in an ASIC.

In both cases it is the ability to tailor the design after specific needs that is appreciated. For example, executing many actions in parallel is non-trivial even on modern day CPUs, but for simple computations the parallel adaptation is easy using an ASIC or FPGA.

An article by I. Kuon and J Rose [17] compares the speed-ups one can expect when implementing a design on an ASIC instead of an FPGA. Their results show that if only look-up tables and flip-flops are used, the speed-up is of the order of 35. If the design utilises specialised MAC blocks or similar, a speed-up of the order of 18 is more common.

(13)

2.3 Matching pursuit

The literature establishes a terminology that will be adopted throughout this text. In general signal decomposition an orthogonal basis is used over which the signal is decomposed. In the matching pursuit framework the orthogonal basis is replaced with an over-complete, that is non-orthogonal, dictionary consisting of atoms where the atoms generally represent features of the signal.

In the matching pursuit context an atomic event is characterised by the triggering of an atom gγ,

a time τ at which the event occurred with an amplitude α. Mathematically this can be expressed as

Λ(γ, τ, α) = αgγ(t − τ ). (2)

The time can be encoded both in an absolute sense and relative to some other event. The triggering atom belongs to a finite dictionary consisting of M atoms

D= {φ0, . . . , φM −1} . (3)

The algorithm tries to expand a signal f in the atoms of the dictionary as a series of events in an iterative way. The atom with the largest corresponding amplitude triggers an event and the signal can then be expressed in a series of events with ever decreasing amplitudes. If the approximation stops after N events are triggered the expansion is called an N -approximation. The error between the i:th approximation and the signal is the i:th residual and can be defined in the following way

R0 = f

Ri = Ri−1− Λi(γ, τ, α) .

Note that this definition leads to f = Ri+ ˆfi if one solves the recursion. Thus the signal is the

sum of the i:th approximation and the i:th residual. The approximation can thus be written on the form ˆ fN = N X i=1 Λi(γ, τ, α) = N X i=1 αigγi(t − τi). (4)

The amplitude of an atom is defined by

α = hφ(t − τ ), Rii : φ ∈ D, (5)

with h·, ·i being the inner product of the arguments. The atom that triggers the event is then given by the atom that maximises the inner product between itself and the current residual for any time-shift of the atom.

gγi = arg max

φj

{hφj(t − τ ), Rii : 0 ≤ j < M } (6)

Here the choice of the indexing variable γi becomes more clear. It corresponds to the index of an

atom of the dictionary and requires less data to store than the triggering atom itself if the size of the dictionary is reasonably small. The time of the event is calculated in a way similar to that of the triggering atom.

τi = arg max τ

(14)

2.4 Dictionary learning

The learning algorithm used here is the the same one as used by Smith and Lewicki [12]. It belongs to the probabilistic approach to machine learning and is a gradient ascent algorithm that resembles Hebbian learning. The ascent is performed on the approximate log data probability and the key equation is ∂ ∂φm = 1 σ2 X i αm_i RN|tm i , (8)

which means that the derivative of the the current atom is proportional to the variance of the final residual and the sum of all events corresponding to that atom, multiplied by a slice of the final residual corresponding to the time offset and length of the atom. The convergence of the gradient ascent method is dependent on the step size of the learning. To control this a learning rate parameter γ is introduced that is multiplied with the update rule. The resulting learning rule becomes ∂ ∂φm = γ σ2 X i αm_i RN|tm i . (9)

A consequence of the learning scheme is that not all atoms are learned at the same rate, as the learning rate is proportional to the number of atom activations. Another important observation is that in the basic gradient ascent the atoms can grow without bounds. This must be limited somehow. One way is to normalise the atoms after each learning iteration.

2.5 Algorithmic complexity

The pseudo code presented in Algorithm 1 shows an algorithmic description of matching pursuit. The residual is initialised to the piece of signal that is to be matched. A window is swept along the residual buffer. The contents of the window is passed to the atoms and the inner product between the atoms and the current window is calculated and compared for the largest product. When the whole buffer has been swept the atom with the largest inner product is recorded and subsequently subtracted from the residual and the process is repeated until a stopping criterion is reached. The time complexity for these steps are

S T

2 (LK + K) + L

, (10)

where S is the resulting average event rate, T is the length of the residual buffer, L is the length of the atoms and K is the number of atoms in the dictionary.

2.5.1 Optimisations

The lines presented in Algorithm 2 are concerned with computing the inner products between the current residual window and all the atoms. Since each atom is independent of all the others the calculation can be carried out in parallel. A speed-up on the order of K can thus be achieved reducing the the K component of the LK term in Equation 10 to 1.

The comparison step, looping through all the results and extracting the maximum, takes K steps originally but this can be improved to log K using a construct called a max-heap. A conceptual overview of a max-heap is shown in Figure 1. Values from a linear array are compared in parallel and

(15)

Algorithm 1 The matching pursuit algorithm presented in pseudo code L ⇐ atom length

while no stopping criteria reached do αm ⇐ 0, γm ⇐ 0, τm ⇐ 0 for τ in 1 : L do α ⇐ max (hRN, gγi : gγ ∈ D) γ ⇐ arg max γ (hRN, gγi : gγ ∈ D) if α > αm then αm⇐ α, γm ⇐ γ, τm ⇐ τ end if end for R ⇐ R − αmgγm(t − τm) end while

Algorithm 2 A snippet from matching pursuit describing the calculation and comparison of the inner products between the residual and the atoms.

α ⇐ max (hRN, gγi : gγ∈ D) γ ⇐ arg max γ (hRN, gγi : gγ ∈ D)

<

Figure 1 – An illustration of how a logarithmic comparison tree is con-structed. Each row is executed in parallel.

in pairs, the largest half is propagated to the next stage where the remaining values are compared in a similar fashion. The process is repeated until the largest value is identified.

The internals of the computation of the inner products can also be optimised. The calculation consists of two buffers in which each element should be multiplied with the corresponding element of the other buffer, and the results summed. Using a tree-like structure the combined multiplication and summation can be performed in logarithmic time. The LK dependence on the inner product calculation can thus be reduced to K log L. The initial multiplications are performed in parallel and the results are summed in parallel pairs until only one term remain. The process is shown in Figure 2.

The last part of the algorithm is a subtraction from the current residual thus forming the residual for the next iteration. This is performed in linear time in the length of the array ordinarily and can be improved to constant time if the process is made parallel. The complexity of the subtraction step is thus reduced from L to 1.

(16)

+

×

Figure 2 – An illustration of how the parallel operations of multiply and accumulate are performed. Each row is executed in parallel.

2.5.2 Optimised complexity

Using the optimisations described above the original complexity, Equation 10, repeated here for ease of reading

S T

2 (LK + K) + L

, (11)

can be reduced to (with all coefficients discarded)

S (T (log L + log K) + 1) , (12)

and even if the final optimisation, the subtraction of the event, is left out the large-scale complexity can be written as

ST (log L + log K) . (13)

Since L is much larger than K Equation 10 can be approximated with ST LK and comparing this to Equation 13 it can be seen that LK has practically been reduced to log L + log K. Thus the dependency on the atom length and dictionary size has been reduced significantly.

(17)

Residual Buffer

Current chunk Previous chunk

Samples

Figure 3 – An overview of the data flow when matching. Parts of the signal is acquired in chunks and fed to mpLib. The library maintains a residual consisting of two parts; The newly acquired chunk and half the remaining residual from the previous iteration.

3 Implementation

One goal of the thesis was the development of mpLib, a library for performing matching pursuit with dictionary learning. Knowing this it is easy to guess the origin of the library name. Beside the matching pursuit it provides some basic auxiliary functions for basic signal acquisition suitable for development purposes. Another version of the library was also produced that implements the computational core on an FPGA. The FPGA computational core was modelled after the C implementation and designed in VHDL.

3.1 mpLib

One of the results of the project is a library implementing the matching pursuit algorithm. The library itself is divided into several modules that each implement distinct functionality. The library is implemented in C and it is targeted for easy deployment on embedded systems. This is reflected in the choice of implementation language and memory management among other things. The modules in the library are as follows: mpInterface, mpLearn, mpReconstruct, mpio and mpUtility and they are all explained in this section, after a general overview.

3.1.1 Design philosophy and overview

The library mpLib is targeted for embedded systems and high performance. It is therefore imple-mented in C providing a low level access and control. Embedded systems, and FPGAs in particular, sometimes lack a dedicated floating-point processor. This puts restrictions on the format used for data representation, more on this in Section 3.1.2. The library uses a fixed path of computing and limits memory use by putting a few restrictions on the matching pursuit algorithm.

First, there is no dynamic memory allocation in mpLib. The embedded system is usually lacking dynamic memory management and keeps all memory on the stack. Thus all declarations need to be done at compile-time. Configuration is performed by compiling an executable with different parameters. For generality, and to keep things low level and modular, the memory is left to be managed by the programmer. The exact specification for this is presented in Section 3.1.3.

To achieve high performance the signal is processed sequentially in finite-size pieces, hereafter called chunks. The signal is sampled and saved to the residual buffer. The length of a chunk is currently set to one atom length but is subject to change in future versions. To achieve matching against different time-shifts of the atoms a second chunk is stored. This chunk is the first half of the remaining residual from the last iteration. In other words, the two chunks together constitute the residual buffer. When one matching iteration is complete the old residual is shifted so that

(18)

CPU A to m 0 A to m 1 A to m 2 A to m 3 ... A to m N -1 Residual Buffer Largest value across all slices

Su b tr a c t Ev e n t a n d r e p e a t Record Event Current slice Monitoring of stopping criteria Residual Buffer Management

Figure 4 – A schematic overview of data flow in mpLib. The data is ac-quired by and sent to the residual buffer in chunks one atom length long. The residual buffer is then matched and the atom and offset with the largest amplitude is recorded as an event. The matched atom is subtracted from the residual and matching continues until a stopping criterion is fulfilled.

There are two parameters controlling the matching process. These are the maximum number of events to extract from the signal and the minimal amplitude allowed for an event to be recorded. Matching is continued until either enough events have been generated or until the last event ex-tracted has too small amplitude.

A conceptual overview of the internals of the library is provided in Figure 4. The signal is acquired and passed to the current residual buffer. Half the previous residual is saved and shifted to the end of the buffer. The residual buffer is matched and the event with the largest amplitude is recorded. The matching is performed by letting a window slide over the residual buffer. The slice corresponding to the window is distributed to the atoms and inner products are calculated. When the window has slid across the whole residual buffer it is updated by subtracting the largest candidate event from the residual and the process is repeated until a stopping criterion is fulfilled. 3.1.2 Data representation

Depending on the platform on which mpLib is deployed different limitations will apply. The floating point data format is convenient to work with but has a disadvantage in speed and hardware support. An alternative is fixed point representation, which has low quantisation error in relation to bit depth but suffers from low dynamic range. In mpLib the programmer can chose what data format to use

(19)

by specifying either of the two data format markers.

• MP DATA FORMAT FLOAT for single precision floating point. • MP DATA FORMAT Q15 for 16-bit fixed point formatted in Q1.15.

When defining one of these the correct implementation of the library is automatically selected when at compile time and the data type used for representing the data is set accordingly. It is currently floatfor the floating point option and int16 t for the fixed point option. Having this flexibility in data representation comes at a price; One extra code path must be maintained for each data type supported. To alleviate this problem and reduce code redundancy a system was put in place that allows for common definitions and separate implementation of the library modules. The approach is similar to include guards in headers. The general structure is

#ifdef MP_DATA_FORMAT_Q15

/* Code goes here */

#endif

The modules have a common header file with all function declarations, while the implementation may be spilt into three files, one code path for each data format supported and one for common functionality.

3.1.3 Configuration

To control the behaviour of mpLib there are a number of options. They are all compile-time constants as the library is targeted towards resource constrained devices, and when the correct settings are in place they are unlikely to change.

MP KERNEL LENGTH Specifies the length of the atoms in the dictionary in samples. MP KERNEL NUM Specifies how many atoms there are in the dictionary.

MP SPIKE MAX ITERATIONS The maximum number of events that will be generated for each atom length chunk of the signal.

MP SPIKE MIN AMPLITUDE The minimum amplitude for an event. Currently this constant is not generic and must be specified in the correct data format for the current representation. For example, 0.01f for floats.

MP LEARN SCALE The learning rate parameter, controls how fast learning is performed. A value of 13 will correspond to a learning rate of 2−13.

These configuration parameters can be found in the mpConfig module. The data represen-tation format, stored in mpData t, is specified here, controlled by the MP DATA FORMAT Q15 and MP DATA FORMAT FLOATmarkers. Also the representation of events is defined here.

The programmer is responsible for managing the memory for the buffers are used by the li-brary. These are: mpData t reconstruction[], mpData t residual[] and mpData t atoms[]. The reconstructionbuffer is used by the module mpReconstruct and holds the currently reconstructed segment of the signal. If no reconstruction is needed this buffer can be omitted. The residual buffer is used by mpMatch and hold the current residual to be matched. The atoms buffer holds the dictionary and all its atoms. The sizes of these buffers are defined at compile time and are subject to the requirements presented in Table 1.

If the module providing input and output functionality using files, mpio, is used the programmer is responsible for managing all file handles.

(20)

Table 1 – A short overview of the requirements for the buffers used by mpLibfor which memory must be managed.

Buffer name size

reconstruction at least 2 atom lengths

residual at least 2 atom lengths

atoms atom length × number of atoms

3.1.4 mpInterface

The public interface of the matching pursuit algorithm is implemented in this module. There are three functions defined here and they require that the residual and atom buffers are defined. mpInit

void mpInit( mpData_t *pResidual, mpData_t *pKernels )

As the name implies mpInit initialises mpLib. More specifically it handles the set-up of the dictionary and residual buffer. Each element in the residual buffer is set to zero. Note that the length of the buffer is governed by the configuration option MP KERNEL LENGTH. This option also controls the length of the atoms. Each atom in the dictionary is initialised with uniform random numbers provided by the C function rand. The tails of the atoms are then set to zero and the number of zeroed items is currently set to a tenth of the atom length.

mpFrameUpdate

void mpFrameUpdate( mpData_t *inputSignal, mpData_t *residual, int32_t length )

The mpFrameUpdate function provides the only way inputting data to the framework. As Sec-tion 3.1.1 details the residual buffer is defined to be two atom lengths. The frame update shifts a new block of input into the residual buffer while shifting out one of the old blocks from the other end. The first half of the buffer, counting from index zero, will contain the data most recently added and the second half will contain the block that was added during the previous frame update. mpMatch

void mpMatch( mpData_t *pResidual, mpData_t *pKernels,

spike_t *pSpikes,

int32_t *nOutSpikes )

The mpMatch function generates at most MP SPIKE MAX ITERATIONS number of events from a residual buffer when given a dictionary. Each event is added to the array pointed to by pSpikes and the final residual after matching will be contained in the pResidual buffer. Internally the function calculates the inner products for all atoms in the dictionary and for time shifts. Only time shifts where the complete atom fits in the residual buffer are considered. The event with the highest amplitude is

(21)

subsequently subtracted from the residual forming the new residual for the next iteration. The process is repeated until a stopping criterion is reached.

3.1.5 mpLearn

The mpLearn module implements dictionary learning according to Equation 9. It currently does not need initialisation but that may change in a future version.

mpInitLearning

void mpInitLearning()

The function mpInitLearning sets up the learning module. Presently the only thing it does is initialising a buffer that is reserved for storage between learning iterations. That storage may be needed because of the limited range of the 16-bit fixed point format, but it is not used in the current version.

mpLearn

void mpLearn( mpData_t *pResidual, mpData_t *pKernels,

spike_t *pSpikes,

int32_t numSpikes )

Learning is performed by the mpLearn function, which implements Equation 9. Thus all the trig-gered events that are stored in the pSpikes array are iterated through and the residual aligned with a particular event is scaled and added to the corresponding atom.

3.1.6 mpReconstruct

void mpReconstructSignal( spike_t *spikes,

int32_t length, mpData_t *reconstruction, int32_t reconstructionLength, mpData_t *atoms, int32_t kernelsLength )

The module for signal reconstruction mpReconstruct provides an interface with a single function. The function reconstructs a signal given a spike code and a dictionary of atoms. Provided that the reconstruction buffer can fit the reconstructed signal it will be recreated. The length of the buffer must be one atom-length longer than the largest offset in the spike array.

3.1.7 Other modules

(22)

CPU FPGA

Atom 0 Atom 1 Atom 2 Atom 3

...

Atom N-1

Residual Buffer

Largest value across all slices

Subtract Event and repeat

Record Event Current slice Monitoring of stopping criteria Residual Buffer Management

Figure 5 – A schematic overview of the parts of mpLib implemented for the FPGA.

and fixed point. The last module mpio provides input and output functionality for application of the library on a PC platform using ASCII files.

3.2 VHDL

A hybrid design utilising both an FPGA and a CPU is proposed here where the computationally intensive parts are handled by the FPGA and the complimentary functionality is provided by the CPU.

The optimisations discussed in Sections 2.2 and 2.5 are suitable for implementation on an FPGA. There are four different optimisations covered in the section on algorithmic complexity but due to time limitations only three have been implemented.

The implemented optimisations are three parallelisations. The calculation of the inner products have been distributed so that each atom calculates it own inner product in parallel. The core of the inner product calculation has also be parallelised. Finally the comparison step is also performed in parallel.

The core of the inner product calculation, the comparison step and the computation of the inner products for all atoms. These optimisations are, in turn, implemented by the atom, the max-comparatorand the spike generator component.

A schematic overview of the functionality implemented on the FPGA and what functionality is handled by the CPU can be found in Figure 5. It is the calculation of the inner product and the closely related structure required to output the largest event for a specific residual that the FPGA handles.

(23)

Pi p e lin e Sta g e 2 Pi p e lin e Sta g e 1 Signal Buffer Pipelined Atom

The pipelined version calculates all multiplications in parallel and then sums them in log-time using a pipeline.

•

Coefficient Buffer Each element is multiplied

and the sum is emitted

⊕

Figure 6 – A graphical description of the pipelined atom implemented in VHDL.

3.2.1 Atom

The atom component is implemented as a parallel multiplication followed by a pipelined adder tree. The two input buffers are the signal and the atom, modelled as a FIR filter defined by a coefficient buffer. For ease of implementation the sizes of the input buffers are limited to powers of two. In Section 2.5 the time complexity of the inner product calculation is determined to be log L, where L is the length of the atom. The pipeline introduces a delay of the output proportional to the height of the adder tree, log L. Figure 6 gives a visual description of the pipelined implementation.

Pipeline Stage 2 Input Buffer Pipelined Max Comparator

The max comparator uses a max heap to generate the determine the largest input in logarithmic time

Largest Element Emitted

<

_<

<

Pipeline Stage 1

Figure 7 – A graphical description of the pipelined implementation of a max comparator with a configurable number of inputs.

3.2.2 Max-comparator

The max-comparator compares the values in a linear array and outputs the largest of them, shown in Figure 7. This implementation uses a pipelined max-heap to calculate the result in log n clock cycles, where n is the number of elements in the input array. The pipeline introduces a delay of

(24)

log n clock cycles because the heap is a tree like structure and the running-time is similar to that for the atom implementation.

Signal Buffer A t o m A t o m A t o m A t o m A t o m A t o m A t o m Max comparator Distributed calculation of inner products

Atom index with largest inner product is emitted Candidate-event generator

Distributes an input signal to a number of atoms and calculates their inner products emitting the largest one.

Figure 8 – A candidate event is determined by the inner products of the signal and the dictionary. This component looks at a slice of the signal and determines which atom gives the largest inner product.

3.2.3 Candidate-event generator

The candidate-event generator is a more high-level construct than the atom and max-comparator components. It is responsible for producing a candidate event by matching the input signal slice with the atoms of the dictionary. The atom and offset for the largest amplitude event is outputted. Internally it uses the atom and max-comparator components. Figure 8 gives an overview of the component.

Residual Buffer

Tracks largest ampl. using comparator and memory

Candidate-event generator Mem Max cmp Current Slice Extracts a single candidate event

When whole buffer is swept candidate event is emitted

Event output sweep_done

Sweeper: Performs a sweep of the residual buffer to find the atom with the largest amplitude and at what time shift that happens.

Figure 9 – The sweeper component processes the entire residual buffer using a sliding window calculating inner products for all atoms. The atom and time offset giving the largest amplitude and the amplitude itself is emitted as an event when processing completes.

(25)

3.2.4 Sweeper

The sweeper component, shown in Figure 9, implements the part of matching pursuit that iterates over the residual buffer. In each iteration it extract a slice of residual buffer and sends it to the candidate-event generator. It keeps track of the largest inner product generated so far and the corresponding atom and time offset. Once the whole residual buffer is processed it outputs the largest amplitude event.

3.2.5 Data representation

In this implementation the total number of bits required to represent an event is given by

Dev= Ds+ log T + log K, (14)

where Ds and Dev are the bit depth (number of bits) per sample and event respectively; T is

the number of possible time offsets per processing period and K is the number of atoms in the dictionary. This gives a bit rate of

Rmp= Devfev, (15)

where Rmp is the bit rate and fev is the event frequency. The classical bit rate is given by the

number of raw data samples per second times the bit depth per sample.

Rc= Dsfs. (16)

In the case with 16 bits per sample the classical bit rate is

Rc= 16fs. (17)

In the complex application considered by Smith and Lewicki[12] no more than 32 atoms are needed in the dictionary, this can be coded using 5 bits. A residual length twice the atom length can be and an atom length of 128 samples yields that the offset can be coded in 7 bits. Assuming that an average event rate of a tenth of the sample rate gives a reasonable fidelity, which is a reasonable estimate considering the results obtained in the case study presented below, combined with the other assumptions above and again calculating for 16 bits per sample yields

Rmp≈ 28 ·

1

10fs. (18)

Thus the matching pursuit provides an improvement of the bandwidth requirements of about 6 times when compared to sending raw data.

(26)

4 Simulation results

The implementation described in Section 3 is simulated and verified. The results are presented in this section, starting with basic feature-detection case study where a sine signal is processed with the C implementation, mpLib. Secondly a case study where several distinct features are combined to form the input is considered.

The VHDL implementation is simulated using ghdl [18] and gtkWave [19]. Benchmarks and a profile for mpLib are also presented.

4.1 Simulation: Basic feature detection

A sinusoid was processed with mpLib two times. Each time the library was configured differently. The first time the atom length was 128 samples, the window length was 128 samples and the number of atoms was 1. The second time the number of atoms were increased to 3. The sinusoid had a period of 0.75 the atom length.

Two types of figures are shown: a combined graph of the input signal, reconstruction provided by mpLib and the resulting residual between the two signals and a spikegram plot showing the few largest component events in the reconstruction.

In Figure 10 the results for the first case with only one atom can be seen. Only the largest events are shown and the events are shaded so that large amplitude corresponds to a dark colour. There is a significant periodic mismatch in the residual, but other than that the reconstruction is good. The atom is very oscillatory and, as the spikegram shows, two additions of the atom right next to each other is required for accurate reconstruction.

The three-atom case, shown in Figure 11, is significantly better at reconstructing the signal. The second atom is similar to the atom in the single atom case. The first atom seems to an auxiliary atom that corrects whatever the second atom cannot match. The third atom is practically unused. 4.2 Simulation: Extended feature detection

A simulation was performed to measure the feature detection capabilities of the library. A test signal consisting of a carrier sine wave with superimposed impulses and noise were processed with both the fixed point and floating point implementation of mpLib.

The sinusoid had a period of approximately 80 samples and the impulses consisted of an exponen-tially decaying sinusoid reminiscent of acoustic emission waves and a morlet wavelet. The impulses had fixed lengths, 40 samples for the morlet and 60 samples for the exponential feature. The noise level was 31% and the feature density was around 10% for each impulse type, thus around 20% of the signal contains samples from the impulses. All features present in the signal are presented in Figure 12.

A signal consisting of these features was generated in Matlab and processed with mpLib. The signal was 5 · 105 _{samples long corresponding to 50 seconds. The dictionary was learned with}

mpLib for 25 iterations through the signal and a learning rate of 10−4. The signal was then run through mpLib a second time with learning turned off. The resulting events were recorded and used to reconstruct the signal. The result is shown in Figure 13a for the floating point version and Figure 14a for the fixed point version. The resulting signal to residual ratio was 12.3 dB and 10.3 dB respectively.

(27)

In Figure 13b and Figure 14b a spikegram of the first 256 samples is shown. The corresponding part of the input signal is shown for reference. The atoms are also presented in this plot with the uppermost atom corresponding to the uppermost line in the spikegram. Only the high-amplitude events representing significant matches between the signal and atoms in the dictionary are shown. In the spikegram for the floating point result it can be seen that the harmonic atoms, number 8 and 4, correlate well with the sinusoidal part of the signal. The step-like atoms of 7 and 5 seem to compensate for the fact that the sinusoidal atoms did not begin at phase zero. Atom one looks much like the morlet wavelet in Figure 12 and triggers at times when the signal exhibits this feature. Between sample 100 and 160 there is an exponential impulse where atom 2 triggers. Atoms 3 and 6 could also partake in the matching of the exponential as there seems to be low-amplitude exponential features around sample 25 and 125 where they trigger.

The fixed point implementation shows a similar behaviour as that of the floating point version. Some small differences exist though. The floating point version encodes the signal with a higher signal to residual ratio but the fixed point version provides more convincing features. Both imple-mentations show step-like atoms that are not part of the input feature space. These can be edge effect from the signal being processed in chunks.

(28)

500 1000 1500 2000 2500 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Processed Signal sample amplitude signal reconstruction residual (a) (b)

Figure 10 – Simulation results for a simple feature detection experiment where a single atom is trained on an input sinusoid with a period length of 0.75 times the length of the atom. Shown in (10a) is the input signal, the reconstruction and the resulting residual. Part (10b) shows a spikegram for the first 256 samples, a zoomed in reconstruction and details of the dictionary.

(29)

500 1000 1500 2000 2500 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Processed Signal sample amplitude signal reconstruction residual (a) (b)

Figure 11 – Simulation results for a simple feature detection experiment where three atoms are trained on an input sinusoid with a period length of 0.75 times the length of the atom. Shown in (11a) is the input signal, the reconstruction and the resulting residual. Part (11b) shows a spikegram for the first 256 samples, a zoomed in reconstruction and details of the dictionary.

(30)

20 40 60 80 100 120 −1 −0.5 0 0.5 Sinusoidial feature amplitude sample 20 40 60 80 100 120 −1 −0.5 0 0.5 1 Morlet feature amplitude sample 20 40 60 80 100 120 −1 −0.5 0 0.5 1 Exponential feature amplitude sample 20 40 60 80 100 120 −1 −0.5 0 0.5 1 Noise feature amplitude sample

Figure 12 – Features present in the signal considered in the feature detec-tion simuladetec-tion. There are four components: a continuous sine wave, a Morlet impulse, an exponential impulse and Gaussian noise. The impulses are repeated with random offsets.

(31)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 104 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Processed Signal sample amplitude signal reconstruction residual (a) (b)

Figure 13 – Learning of features in the signal described by Figure 12. Panel (13a) shows the result of running the floating point version of mpLib. The reconstruction signal-to-residual ratio is 12.3 dB. Part (13b) shows the first 128 samples of the floating point signal

(32)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 105 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x 104 Processed Signal sample amplitude signal reconstruction residual (a) (b)

Figure 14 – Fixed-point learning of the features in the signal described by Figure 12. Panel (14a) shows the result of running the fixed point version of mpLib. The reconstruction signal-to-residual ratio is 10.5 dB. Part (14b) shows the first 128 samples of the fixed point signal from 13a along with the events and atoms used for reconstruction.

(33)

4.3 Simulation: VHDL implementation

Figure 15 – A screenshot from the VHDL simulation viewer gtkWave showing a verification simulation of the VHDL implementation. Key signals and their timings can be seen. The time scale is ar-bitrary, only the propagation with respect to the number of clock cycles are relevant.

The VHDL implementation as specified in Section 3.2 was simulated using the ghdl tool and the simulation results were visualised using gtkWave. There were plans on synthesising the design on an FPGA chip that was developed in a parallel project but due to time limitations that was not possible.

The simulation shows that the implementation is functioning as expected. The signals in Fig-ure 15 shows the input to and the output from the sweeper component. Matching will be done two times. The procedure is reset when the reset flag is high and processing of data residing in the residual buffer will start when it is toggled low. Process completion is indicated by the done flag and when it is toggled the output can be read.

The input signal is 1, 2, 0, 0, 2, 1, 1 the first matching run and 1, 2, 4, 4, 2, 1, 1 the second time. The atoms are initialised to 1, 0, 0, 1 and 0, 1, 1, 0. It is expected that the first atom will trigger in the first case at the second element, corresponding to the gap. For the second case the best match should be for the second atom in the same position.

In the Figure 15 the first signal is the reference clock. The simulation is advanced on each rising clock edge. The next three signals are the output of the component and corresponds to the amplitude, the atom index and the time-shift of the event with the largest amplitude. This is followed by a signal called done (orange) that, as stated previously, indicates whether data can be read from output or not.

(34)

and current offset signals correspond to the best match for the given offset. If the halt flag is high processing has stopped and the pipeline is being emptied.

As can be seen in the Figure 15 the correct atom is triggered at the correct offset where offset 0 corresponds to the first element. The amplitudes are also correct. The time this takes is 10 cycles where 5 cycles are spent calculating and 5 cycles are required to empty the pipeline. Section 3.2 predicts that log L + log K cycles are required for the pipeline which corresponds to what is seen.

(35)

4.4 Benchmarking of mpLib

The C implementation of the library was benchmarked using a standard laptop computer running an Intel Core i5-2557M 2.7GHz. A Matlab script was constructed that facilitated signal generation and batch benchmarking. A signal was prepared by the script and 1000 matching and learning iterations was executed by mpLib. The benchmarking executable was bare-bones and performed no operations other than reading the signal, matching and learning. No output was created.

The results can be seen in Tables 2 - 4 and the data is visualised in Figure 16. The plot also contains an approximation performed by the polyfit command in Matlab. The resulting fits are

y = 210x + 73 ms (19)

for the number of atoms in the dictionary,

y = 0.19x2+ 2.6x − 9.0 ms (20)

for the atom length and

y = 340x + 84 ms (21)

for the event rate.

Table 2 – Benchmarking results when varying the number of atoms in the dictionary. The number of events extracted were set to 10 and the length of the atoms to 128 samples. The benchmarking was run on a sine wave for 1000 matching and learning iterations.

Number of Atoms 1 2 4 8 16 32 64 Fixed point Time [s] 0.38 0.76 1.25 2.53 4.91 9.79 19.36 Spikes rate [1_s] 25.82k 13.06k 7.95k 3.94k 2.03k 1.02k 0.51k Floating point Time [s] 0.28 0.49 0.93 1.76 3.48 6.81 13.64 Spike rate [1_s] 34.84k 20.30k 10.73k 5.66k 2.87k 1.46k 0.73k

(36)

Table 3 – Benchmarking results when varying the length of the atoms in the dictionary. The number of events extracted were set to 10 and the number of atoms in the dictionary to 16. The benchmarking was run on a sine wave for 1000 matching and learning iterations.

Atom Length 32 35 64 75 110 128 161 256 348 512 Fixed point Time [s] 0.32 0.42 1.30 2.00 3.78 4.91 8.03 16.45 36.41 77.15 Spike rate [1_s] 30.93k 23.77k 7.66k 4.98k 2.64k 2.03k 1.24k 0.50k 0.27k 0.12k Floating point Time [s] 0.28 0.33 0.94 1.26 2.58 3.48 5.35 11.36 24.15 51.49 Spike rate [1_s] 35.63k 30.17k 10.59k 7.92k 3.87k 2.87k 1.86k 0.76k 0.41k 0.19k

Table 4 – Benchmarking results when varying the number of spikes ex-tracted per matching and learning iteration. The number of atoms in the dictionary was set to 16 and the length of the atoms to 128 samples. The benchmarking was run on a sine wave for 1000 match-ing and learnmatch-ing iterations.

Number of Events 10 20 30 40 50 60 70 80 90 Fixed point Time [s] 4.91 10.08 14.91 20.33 25.82 29.01 34.01 38.81 45.38 Spike rate [1_s] 2.03k 1.98k 2.01k 1.96k 1.93k 2.06k 2.05k 2.06k 1.98k Floating point Time [s] 3.48 6.89 10.33 13.57 16.95 20.41 23.74 27.19 30.73 Spike rate [1_s] 2.87k 2.90k 2.90k 2.94k 2.94k 2.93k 2.94k 2.94k 2.92k

(37)

10 20 30 40 50 60 5 10 15

Benchmarking of mpLib

Number of Atoms Time [s] 50 100 150 200 250 300 350 400 450 500 20 40 60

Length of Atoms (hSamples)

Time [s]

10 20 30 40 50 60 70 80 90 100

20 40

Number of spikes extracted

Time [s] Fixed Best fit Float Best fit Fixed Best fit Float Best fit Fixed Best fit Float Best fit

(38)

4.5 Profiling of mpLib

Two of the executables used in the benchmarking tests were profiled with Instruments to see where program execution time was spent. The executables were the float and fixed point versions of mpLib, which were configured to use 128 samples as the atom length, 16 atoms in the dictionary and extracting at most 10 events from each chunk. The profiling results can be seen in Table 5. The input was matched and learned in 10000 matching and learning iterations.

Time spent in functions corresponding to matching the input constituted 97.5 % of the time for the floating point case and 99.0 % of the time for the fixed point implementation. In both cases the execution time was heavily dominated by the function mpKernelInnerProduct, an internal function to the mpMatch module that computes the inner product for an atom. Less than 1 % of the time was spent on learning in both cases.

Table 5 – Profiling results collected for the fixed and floating point im-plementations of mpLib running for 10000 matching and learning iterations.

Running time [s] Time spent matching [%] Time spent learning [%]

Float 38 97.6 0.4

(39)

5 Discussion

The aim of this thesis is to explore the possibility to calculate sparse codes of high-frequency sensor signals in resource-constrained embedded systems using matching pursuit and unsupervised dictionary learning. In particular, processing of acoustic emission signals is a tentative application that requires sampling rates on the order of 1 MHz. A prototype software library in C and VHDL is designed and the possibility to use a modern FPGA to enable processing at 1 MHz is discussed. Sparse coding with a learned dictionary of atomic waveforms can reduce the sensor data rate significantly. The resulting atoms encode characteristic features in the signal, which is desirable for pattern recognition and monitoring purposes.

5.1 mpLib - Implementation

A problem that was encountered during the course of the project was the handling of fixed-point formats. The library is designed to work with 32-bit floats or 16-bit integers in Q15 format. Currently the library uses an ad-hoc solution to handle overflow that seems to be working well. It is mainly the inner product calculation that can overflow. Therefore, the data format of the event amplitude is set to have extra guard bits, currently 6, but log L bits are needed to ensure no overflow. Here, L is the length of the atoms in samples.

The learning rate required for the gradient ascent learning scheme is problematic. The result is best when the learning rate is on the order of 10−5, however this is also the limit for the Q15 fixed point format, 2−15 ≈ 3 · 10−5_{. To circumvent this problem a scheme was devised to accumulate}

the gradients into a learning buffer and, once the magnitude of the accumulated buffer is large enough it is multiplied by the learning rate and added to the corresponding atom. In this way a low learning rate can be used with retained accuracy for the learning. However, when this approach was tested the results were erroneous, most probably due to a programming error by the author. Because of time limitations the idea was abandoned, a higher learning rate on the order of 10−4 is used, which is good enough for the cases presented here.

5.2 mpLib - Simulation

The simple feature detection simulation includes a single atom for feature detection and learning. The feature in the simulation is shorter than the atom yet there is still a significant residual with a periodic repetition. The atom does not resemble much of the feature it encodes but several events triggered in a short succession manages to accurately recreate the feature. This high-frequency oscillation might be due to the noise present in the signal or the initialisation of the original atom. Likely the gradient ascent amplifies the initial small variations. Possibly, initialising the atoms with lower frequency noise would circumvent this phenomena. Classification of the signal is still possible even though the atom does not directly represent the feature of the signal. This is due to the characteristic event pattern that occurs each time the signal feature is encoded.

The second use case includes several atoms in the dictionary, which implies that the model can more accurately represent the input signal. Here multiple atoms are combining to encode the whole feature. This indicates that any common feature can be encoded as long there are sufficiently many atoms in the dictionary to encode it (and given that it does not have a higher frequency content than the atom can represent).

(40)

Another way to deal with features of varying length is to use a method similar to that used by Smith et Lewicki[12], where they let the atoms grow and shrink adaptively. One could limit the initial length of the atom to a subset of the samples and, if the atom shows growth in the tails increase the allowable range in which the atom is allowed to grow. If an atom has a low amplitude in its tails the growable range is reduced. This can continue until the atom has reached its maximum size potentially determined by the FPGA implementation.

In the advanced feature detection use case a more realistic signal was considered and the floating-and fixed-point versions of the library were compared. The versions give similar results with some small differences. The floating-point atoms started growing from the left edge while fixed-point atoms have more centred features. The fixed point version seem to learn the exponential term more accurately. This indicates that there are differences the two versions. The centring of the atoms’ features is desirable to avoid edge effects and it could be achieved by the adaptive process described above. One could also imagine a modified algorithm that shifts the atom if the amplitude starts to grow too far out in the tail. The step-like atoms learned in both the cases might be windowing effects that arise because the library only handles an atom length of samples at a time.

5.3 VHDL - Implementation

The VHDL simulation gives a good indication that the code is working correctly however testing on a real FPGA is still required. It would be interesting to know how much chip resources the implementation will utilise, how much power it will consume and to perform benchmarking of the VHDL-implemented parts.

If the VHDL implementation uses a relatively few chip resources in terms of logic and DSP blocks, several instances could be run simultaneously at different sampling rates. The signal input to each instance can be provided as low-pass filtered versions of a single sensor signal. This could be used to detect both features with high and low frequency content. Currently the fixed and small size of the atoms and sample buffer prevent large-scale features from being detected. The investigation conducted by Nilsson [4] indicates that a FIR-filter bank, which is the centre-piece in the matching pursuit algorithm, does indeed use a relatively small chip area making the above point plausible. However, this design is more complex than an FIR-filter bank and the results from Nilsson are therefore not directly comparable.

5.4 VHDL - Simulation

Comparing the values found in Section 4.4 to the ones in Section 2.5 it is seen that the approxima-tions used are sensible. The only dubious case is the second-order dependence on the atom length. The coefficient of the second degree term seems small but its a quadratic term that has a large influence.

The profiling done shows that the calculation of the inner products is the operation that requires most processing time, indicating that implementing this part on dedicated hardware is a sound op-timisation. The results in Sections 2.5, 4.5 and 4.4 show that the optimisations provide a significant speed up in comparison to the naive implementation.

(41)

5.5 VHDL - Bandwidth

The critical code-section is the matching of events with the residual. This requires something on the order of ten events per 100 input samples or less depending on the signal structure and noise. This corresponds to, at one megahertz sampling rate, roughly 105 events per second. According to Section 4.4 an event is matched in roughly the same time as a sample chunk is acquired. Thus the FPGA must be clocked at around 10 MHz. The parallel project focusing on the development of an FPGA sensor platform found that an FIR-filter can be implemented and clocked at 77 MHz on a Xilinx Zynq -7000 [4]. This indicates that the design presented here is feasible since the core computation is basically that of an FIR-filter bank.

The data reduction presented in Section 3.2.5 is interesting because it predicts an order of mag-nitude reduction in bandwidth requirements. This approach enables more sensors per area as more devices are able to talk on the same wireless band. This coupled with feature extraction can en-able new applications in fault detection as stated in the introduction, and it enen-ables interesting developments for the emerging “Internet of Things”.

In the Section 3.2.5 it is stated that 28 bits are required to code the example event yielding a very general representation. The number of bits can be reduced however according to the needs of a specific application. The outputted event amplitude can possibly be reduced to 12, 8 or lower while still retaining a reasonably good approximation. In extreme cases, perhaps in fault diagnosis, where the existence of a feature yields sufficient information, the amplitude can be binary. In this case the event timing information might also be irrelevant thus reducing the amount of information that needs to be transferred significantly.

It might be possible to avoid coding event timing information as part of the event altogether by letting the receiver note the arrival time of the event locally. Thus a server might receive an atom index and an amplitude while the time offset of the event is assumed to be coinciding with the time the network packet arrived. Assuming no network losses, a time-shifted version of the original signal would be reconstructed where the time-shift would be equal to the latency of the network link. However further investigation is required to determine the feasibility of the approach. 5.6 Future work

As a continuation of the work presented in this thesis the author would like to run more tests of the implementation. It is possible to vary the number of samples that the library can process at at a time. In this text a buffer length of one atom length was considered. It would be interesting to see how the buffer length affects reconstruction quality and whether the appearance of artefact atoms is influenced.

The implementation of the two-step learning scheme that was devised should also be completed. Another task would be to replace the learning algorithm with a parameter free alternative.

Other ideas to try out include a hybrid growth technique similar to that introduced by Smith et Lewicki [12] and an alternate algorithm adaptation where the atom is shifted if the amplitude starts to grow in the tails. The goal for both techniques is to centre the features.

The above points deal with the quality of the decomposition. It is also possible to further improve the processing speed of the VHDL implementation. One idea is to use multiple windows when calculating the inner products of the dictionary giving a constant speed-up at the price of chip area used.

(42)

Further work is needed to verify and evaluate the performance of the VHDL implementation, preferably also testing on real world acoustic emission signals.

5.7 Conclusion

In this thesis a parallel implementation of the matching pursuit algorithm with dictionary learning is developed for synthesis on standard FPGAs. The parallel implementation uses fixed-point data, while a software reference implementation of the algorithm supports both fixed and floating-point data.

The combined theoretical results from this project and the joint project by Nilsson [4] suggest that it is possible to achieve the target processing frequency of 1 Msps with an FPGA implementation of the matching pursuit algorithm, which is needed for on-line processing of acoustic emission signals. It is further suggested that 1 Msps is certainly reachable, in principle, using an ASIC.

The real-world applicability of this approach is demonstrated using simulations and ball-bearing vibration data in a conference paper [13] that was written in collaboration with other members of the SKF UTC during this work.

In general, other applications can potentially benefit from this approach to represented signals in succinct format using sparse codes that are learned and calculated on-line near the sensors.