Implementation and Evaluation of Historical Consistent Neural Networks Using Parallel Computing

(1)

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Linköping University

Linköpings universitet

LIU-ITN-TEK-A--15/051--SE

Implementation och

utvärdering av Historical

Consistent Neural Networks med

parallella beräkningar

Johan Bjarnle

Elias Holmström

(2)

LIU-ITN-TEK-A--15/051--SE

Implementation och

utvärdering av Historical

Consistent Neural Networks med

parallella beräkningar

Examensarbete utfört i Medieteknik

vid Tekniska högskolan vid

Linköpings universitet

Johan Bjarnle

Elias Holmström

Examinator Pierangelo Dell'Acqua

Norrköping 2015-06-15

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Abstract

Forecasting the stock market is well-known to be a very complex and difficult task, and even by many considered to be impossible. The new model, Historical Consistent Neural Networks (HCNN), has recently been successfully applied for prediction and risk estimation on the energy mar-kets.

HCNN is developed by Dr. Hans Georg Zimmermann, Siemens AG, Corporate Technology Dpt., Munich, and implemented in the SENN (Sim-ulation Environment for Neural Network) package, distributed by Siemens. The evalution is made by tests on a large database of historical price data for global indicies, currencies, commodities and interest rates. Tests have been done, using the Linux version of the SENN package, provided by Dr. Zimmermann and his research team.

This thesis takes on the task given by Eturn Fonder AB, to develop a sound basis for evaluating and using HCNN, in a fast and easy manner. An important part of our work has been to develop a rapid and improved implementation of HCNN, as an interactive software package. Our ap-proach has been to take advantage of the parallelization capabilities of the graphics card, using the CUDA library together with an intuitive and flexible interface for HCNN built in MATLAB. We can show that the com-putational power of our CUDA implementation (using a cheap graphics device), compared to SENN, is about 33 times faster.

With our new optimized implementation of HCNN, we have been able to test the model on large data sets, consisting of multidimensional fi-nancial time series. We present the results with respect to some common statistical measures, evaluates the prediction qualities and performance of HCNN, and give our analysis of how to move forward and do further testing.

(5)

2.2 Back-propagation . . . 9 3 The HCNN Model 10 3.1 Introduction . . . 10 3.2 Model . . . 11 3.3 Learning . . . 13 3.4 Initialization . . . 14 3.5 Forecasting . . . 14 3.6 Dynamics . . . 14 4 Implementation 17 4.1 SENN . . . 17 4.2 HCNNLab . . . 17 4.2.1 C Mex . . . 20 4.2.2 CUDA Mex . . . 20 4.3 Modified Back-propagation . . . 21 5 Configuration 23 5.1 Data Selection . . . 23 5.2 Data Pre-processing . . . 24 5.3 Model Configuration . . . 25 5.4 Computing . . . 26 6 Results 29 6.1 Performance . . . 29 6.2 Tests . . . 30 6.2.1 Comparison Models . . . 30 6.2.2 Error measurements . . . 31

6.2.3 Forecast Hit Rate . . . 32

6.2.4 Local Hit Rate . . . 34

6.2.5 Theil Coefficient . . . 36

6.2.6 MASE . . . 38

(6)

7 Discussion 41

7.1 Answers to Thesis Questions . . . 43

7.2 Further Work . . . 44

7.2.1 Performance . . . 44

(7)

1 Introduction

1.1 Motivation

Forecasting the stock market is well-known to be a very complex and difficult task, and even by many considered to be impossible. Neural networks are a popular approach when simulating the market dynamics, but they can be very computationally heavy. The new model, Historical Consistent Neural Networks (HCNN), has recently been successfully applied for prediction and risk estima-tion on the energy markets.

Nowadays, with the raise in GPU power, more and more compute-intensive operations are performed utilizing a modern graphics device, instead of the traditionally used CPU. This opens up huge possibilites to parallelize and do the calculations simultaneously.

The basic structure of neural networks, with nodes and connections, can be implemented as matrix and vector operations, which is very well suited for GPU calculations.

1.2 Purpose

Eturn Fonder AB is a small stock exchange company based in Stockholm, founded in 2004. The company’s founders and managers have a long combined experience of model development and over 18 years of experience in successful model-based trading. This thesis takes on the task given by Eturn Fonder AB, to develop a sound basis for evaluating and using HCNN, in a fast and easy manner. To meet Eturn’s goals and needs, we should deliver an easy-to-use software solution suited for daily analysis, as well as an evaluation of the HCNN model.

1.3 Questions

Our work aims to answer the following questions:

• Can we create a faster implementation of the HCNN model on our own? • Can the GPU be utilized in an efficient way to speed up the learning

phase?

• Is it possible to use HCNN for predictions on large financial data sets on a daily basis?

• Is it possible to predict the financial markets using HCNN?

1.4 Limitations

In order to fit the thesis within reasonable boundaries, the following limitations has been set up:

(8)

• The analysis has been focused on two data sets.

• All available data for the sets have been used - no correlation analysis has been made.

• Only weekly data and predictions.

• No variations of the HCNN model has been evaluated (i.e. RHCNN, CRCNN).

(9)

2 Background

2.1 Introduction to Neural Networks

The human brain is the most complex and sophisticated system that we know of in the universe. It’s no wonder that scientists and engineers have tried to replicate its structure and features to imitate human intelligence for decades. To be able to create an artifical version of our brains would surely open up huge possibilities of artifical learning and computations.

To replicate the behaviour of the human brain, we would need to take into consideration it’s main charasteristics; self-organization and learning capability, generalization capability and fault tolerance [1]. Even though the human brain theoretically is slower than a modern computer, it still outclasses computers in several areas. When processing, a huge amount of the brain is active with a massive amount of neurons working simultaneously together. Combine this with it’s ability to store and access data as well as noise filtering, and we can begin to understand how complex it is to make this biological system into an artificial one.

In 1943, Warren McCulloch and Walter Pitts introduced models of neurolog-ical networks. They created threshold switches based on neurons, and showed that even simple networks could calculate nearly any logic or arithmetic func-tion. The first computer implementations of neurological networks were done, among others, by Konrad Zuse, who was tired of calculating ballistic trajectories by hand.

(10)

The fundamental idea of creating a neurological network is to model neu-rons and their connections to each other. A neuron is nothing more than a switch with information input and output. The switch is activated if there are enough stimuli from other neurons to its input. The neuron output will then send out a pulse to, for example, other neurons. The information is received through the neuron’s dendrites in special connections called synapses, see figure 2. Through these tree-like connections, the information is then transmitted into the nucleus of the cell. When enough stimulation has been achieved through multiple synapses, the cell nucleus of the neuron activates an electrical pulse, which then is transmitted to the neurons connected to the current one through the dendrites.

Not all stimulated signals are of equal strength. The synapses can transfer both strong and weak signals, and over time they can form both stronger or weaker connections. The adjustabilitiy varies a lot and is one of the central points in the examination of the learning abilities of the human brain.

Output

Hidden

Input

Figure 2: Artificial neural network.

To imitate the learning process in the human brain, we need to model an artifical neural network consisting of the key elements in the human brain. Here we have the neurons with their nucleus cell, and the connections between them that varies in strength due to the synapses. Each neuron can be modelled as three serialized functions that gives us the proper behaviour. First, we have the propagation function which sums all the inputs from other neurons. Then we apply an activation function, that decides if the stimulation of the nucleus was enough and a pulse to connected neurons should be transmitted. Last, we use an output function that transforms activation to output for other neurons.

(11)

topologies. One of the most common toplogies is the feed-forward network. This network consists of input-, hidden- and output layers. Each layer consists of multiple neurons modeled as nodes, with weighted connections between them, representing the synapses. The hidden layer is invisible from the outside, which is why the neurons in this layer are referred to as hidden neurons.

When discussing neural networks they are often referred to as black boxes. This means that even though the structure might be unknown, we can still ob-serve output generated from the feeded input. This emphasizes the complexity of the network and its structure. One often told anectode from the world of neural networks refers to the US Army. There is no confirmed source that we know of, but it really well illustrates the powers, as well as the problems, that come with neural networks:

In the 1980s, the US army wanted to detect camouflaged enemy tanks and de-cided to use neural networks for this task. They trained their networks by using 50 pictures of tanks camouflaged in trees, and 50 pictures of trees without tanks. By adjusting the parameters and the network settings, the researchers managed to make the network separate pictures with and without tanks. This did not, however, guarantee that new images would be classified correctly at all. The network might only work for these 100 images. But the wise researchers had thought of this problem, so they had originally taken 200 images, leaving an-other 100 images to test the network on. They ran the tests and concluded that the network classified all these pictures correctly as well. Success confirmed!

The finished work was handed to the Pentagon, but they soon handed it back. They complained that in their own tests the neural network did no better than chance at detecting tanks.

It turns out that the photos of camouflaged tanks had been taken on cloudy days, while the photos of plain forest had been taken on sunny days. The neural network had, instead of recognizing camouflaged tanks, learned how to distin-guish between cloudy and sunny days.

(12)

2.2 Back-propagation

The error backpropagation is a supervised learning algorithm that was first introduced in 1974 by Paul Werbos [1]. It calculates the first partial derivates of the network error function, and updates the weights according to the gradient descent algorithm. It’s an effective method to train supervised neural networks, where we want to minimize the deviation of the network output and the target values in the training pattern. This can be expressed as adjusting the weights, so the mean-square error function of the network is minimized:

MSE = n X k=1 1 2(outk− tark) 2 → min wij (1) where n is the number of nodes, out is the network output and tar is the target values. More details with an example of back-propagation of a three layer feed-forward neural network can be found in [2].

x 0 x 1 x 2 x 3 x 4

Figure 3: The principle of the gradient descent algorithm illustrated in a two-dimensional error space. The algorithm helps us to minimize the error function step-by-step. By travelling in the direction with the steepest descent, the error is minimized.

(13)

3 The HCNN Model

3.1 Introduction

Neural networks offer significant benefits for dealing with the typical challenges associated with forecasting. With their universal approximation properties, neu-ral networks make it possible to describe non-linear relationships between a large number of factors and multiple time scales [2].

As said in the previous chapter, a neural network can be expressed as indi-vidual layers in the form of nodes, and the connections between the layers in the form of links. Many real-world technical and economic applications can be seen in the context of large systems, in which various non-linear dynamics interact with one another through time.

Recurrent Neural Network (RNN) is one common type of neural networks. RNN are universal approximations of dynamic systems, and can be used to model the behavior of a wide range of complex systems. Unlike feed-forward neural networks, RNN has an internal memory to process sequences of inputs, because of its recurrent nature. RNN consists of input-, hidden- and output layers. The following set of equations is a general description of a RNN network:

sτ = f (sτ −1, ut) state transition, (2a)

yτ = g(sτ) output equation. (2b)

The state transition sτtakes influences from inputs uτas well as the previous

state transition sτ −1. Each hidden state generates outputs yτ.

(14)

The RNN is used to model and forecast an open dynamic system using a non-linear regression approach. Every hidden state node is influenced by the previous state and the network input. This means that the network might be heavily dependent on the network inputs during the learning phase, and will have to make predictions without them in the future, where no inputs are available. To make predictions consistent with the training, further improvements in the network model are needed.

st −3 st−2 st−1 st st+1 st+2 st+3 yt −3 yt−2 yt−1 yt yt+1 yt+2 yt+3 ut −3 ut−2 ut−1 ut A A A A A A C C C C C C C B B B B

Figure 5: A recurrent neural network unfolded through time. A, B and C are transition matrices for the model.

In 2010, Zimmermann, Grothmann, Tietz and Jouanne-Diedrich [5] intro-duced a new type of RNN called Historical Consistent Neural Networks (HCNN). HCNN allows the modeling of highly-interacting non-linear dynamical systems across multiple time scales. HCNN is a closed system and does not draw any distinction between inputs and outputs, but models observables embedded in the dynamics of a large state space, see figure 6. The fundemental idea of the HCNN is to explain the join dynamics of the observables in a casual manner, i.e. with an information flow from the past to the future.

3.2 Model

The HCNN describes the dynamics of all observables by the sequence of states

sτ using a single state transition matrix A [5]. The state transition matrix

contains the only free parameters in the system. When the network unfolds,

the hidden state vector sτ represents each hidden state through time until the

last time step in the training, t. The model is defined as:

sτ= tanh (Asτ −1) state transition (3a)

(15)

where B is a vector as defined in equation 5 and tanh is the activation

function1

, illustrated in figure 7. The purpose of BT _{is to filter out the}

observ-ables stored as the N first neurons in the state sτ. The subsequent neurons in

the state represents the hidden variables, which are the underlying dynamics of the observables. As previously stated, the model makes no distinction between observables and hidden variables.

The identification task for the HCNN model can be expressed as a function that minimizes the square error between the network outputs and the target values, by adjusting the parameters in the state transition matrix. This is expressed in the following equation:

E= t X τ =t−m (yτ− ydτ) 2 → min A system identification (4)

where E represents the total residual error in the network, yτ is the network

output, yd

τ is the target values and m is the number of time steps.

st −3 st−2 st−1 st st+1 st+2 st+3 yt₋₃ yt₋₂ yt₋₁ yt yt+1 yt+2 yt+3 A A A A A A BT BT BT BT BT BT BT

Figure 6: Architecture of the HCNN Model.

−2 −1 1 2

−1 1

x tanh x

Figure 7: Activation function tanh.

1. The purpose of the activation function is to convert a neuron’s weighted input to it’s output activation. Nonlinear activation functions are what give neural networks their nonlinear capabilities [7]. Symmetric sigmoids such as the hyberbolic tangent often con-verge faster than the standard logistic function and is commonly used in neural networks.

(16)

3.3 Learning

A technique that is frequently used in the learning task of neural networks is

teacher forcing [8]. With this technique the actual output yτ is replaced with

the teacher signal yd

τ in all timesteps. [5] introduces a new approach to teacher

forcing as an integrated part in the neural network architecture:

sτ= tanh (A · rτ −1), τ ≤ t state transition (5)

where rτ = (Csτ −1+ Bydτ −1), B = Id 0 and C =0 0 0 Id .

The identity part of B, Id, has the same dimension as the number of time series. C has the same dimension as A and is constructed so that C · B = ~0. E.g. for one time series, this would give:

B=      1 0 .. . 0      , C =      0 0 · · · 0 0 1 · · · 0 .. . ... . .. ... 0 0 · · · 1     

This allows us to use the standard learning algorithm back-propagation through time [2, pp. 113-135] to solve the identification task (equation 4).

The optimization of the network parameters can be achieved by different methods, e.g. SNOPT [9] or Genetic Algorithms [10]. For the learning of feed-forward neural networks, standard error back-propagation is still an effective method, especially with teacher forced learning [6].

bias st₋₂ rt₋₂ st₋₁ rt₋₁ st rt st+1 st+2 yt₋₂ yt₋₁ yt yt+1 yt+2 ydt −2 y d t −1 y d t s0 C A C A C A A BT BT BT BT BT B B B

(17)

3.4 Initialization

In large recurrent neural networks, such as HCNN, the sparsity of the state transition matrix is essential [6]. It improves long term memory and is a nat-ural way to limit the operations and prevent overflow in the system. A large network with nonlinear tanh connectivities, would not be stable if initiated with full connectivity. Thus, the state transition matrix A is defined as a uniformly randomized matrix, with a square dimension of n × n, and with a sparsity λ.

All non-zero elements in A can be seen as weights wij in the network, and must

satisfy the following condition:

{wij ∈ R | − α < wij < α} (6)

The first hidden state is initialized from the network bias1

[6]. The bias

consists of a vector of all ones, with a randomized vector s0 of dimension n as

weights (see figure 8). In order to stabilize the network against uncertainties,

s0is adjusted throughout the learning phase. All weights wi in s0 must satisfy

the following condition:

{wi∈ R | − β < wi< β} (7)

After the initialization, the network unfolds forward in time from t − N to t, where N is the number of time steps in the past (see equation 3a). The first unfolded time step, t − N , is calculated as:

st−N = tanh(s0◦ bias) (8)

3.5 Forecasting

The solution of the system identification task will depend on the initilization of the network weights (see section 3.4). The HCNN model is over-parameterized to fit the data perfectly and it is not possible to know if the solution represents the true dynamics; each individual solution is a reasonable forecast scenario. Thus, an ensemble of scenarios together can be used as a reasonable risk esti-mation.

By calculating the average (or median) of an ensemble we can define the forecast as: f orecast(τ ) = 1 n· n X i=1 scenarioi(τ ), τ > t (9)

where n is the number of scenarios and τ is future time steps.

3.6 Dynamics

To demonstrate HCNN’s ability to adapt, we show forecast examples of sinus data. The data consists of two sinus periods over one hundred data points and

(18)

has an additional random part added that varies in size for each example. Each forecast consists of 100 scenarios. The charts show original data, individual scenarios, as well as the forecast median and average of the scenarios.

Original Scenarios Forecast median Forecast average

Figure 9: A two period forecast of sinus data with an additional random part of maximum 10% of the magnitude.

(19)

Figure 12: An eighteen period forecast of sinus data with an additional random part of maximum 10% of the magnitude.

As seen in the figures, HCNN can easily manage to describe the dynam-ics of a sinus curve with various amounts of random noise. However, we can see tendencies of undershooting related to the amount of noise and prediction length.

(20)

4 Implementation

4.1 SENN

SENN (Software Environment for Neural Networks) is a development environ-ment for the creation of artificial neural network based forecasting and clas-sification models [11]. It’s developed by Siemens in Munich, Germany. They use it in various applications, i.e. to improve their timings for electricity and worldwide copper purchases [12].

SENN contains tools for a finance-analytical interpretation of the network outputs and for the solution of general, complex problems. It supports vari-ous network types and optimization algorithms (such as the back-propagation method). The networks are described in topology files which contains the net-work structure. The application can be controlled by a graphical interface or by running TCL (Tool Command Language) scripts [13].

TCL scripts are robust and flexible and with the use of TCL reference and SENN coding examples, we created our own library for SENN. Using our library, we could automate the whole process including data selection, learning and charting for our HCNN tests.

Figure 13: Neural network node structure and charting component within the SENN environment.

Even though we kept our tests simple and only used a few time series, the network learning time was rather high. We felt the need to solve the identifica-tion task in a shorter time span than we managed with SENN.

4.2 HCNNLab

To be able to run daily forecasts or forecasts with large amount of data using SENN, something similar to a computer farm would be required. For this thesis, as well as for real world usage by Eturn, a faster and more powerful implemen-tation would be desirable. One where the computers we had available would

(21)

suffice. We thought that by utilizing the graphics card and implementing the HCNN model within MATLAB, we should be able to achieve a faster and more flexible solution. This is where the fundamentals of our own software were born. HCNNLab stands for Historical Consistent Neural Networks Lab, and is our own neural network environment built within MATLAB. It covers the whole pipeline including automatic data transformation and adjustments, network modelling, network weight initialization, network learning, forecast charting and statistical measurements. The components and the overall structure can be seen in figure 14 below.

The user can implement general neural network models in HCNNLab, and make faster and more optimized code implementations using MATLAB MEX files. These are written in C code and compiled for faster executions. Using MEX files with the CUDA library [4] enables the utilization of the GPU. We have built both CPU- and GPU MEX files for the HCNN- and the RHCNN models in HCNNLab. Multiple scenarios can be queued or ran in parallell, both on the CPU and the GPU. This means that the user is able to run a whole ensemble of scenarios simultaneously on a home computer, equipped with a graphics device with CUDA support. By allowing the user to pause and resume during the network learning, the flexibility and usability are increased.

By using the powerful builtin plotting tools in MATLAB, the forecast results can easily be analyzed. HCNNLab automatically adds the interesting elements to charts, including the original data and the forecast ensamble with averages, medians and standard deviations.

(22)

% Load data into v a r i a b l e D

load (’ data / w o r l d _ d a t a . mat ’);

% S e t t i n g s

data = D ;

d a t a C o l s = [2 ,3];

d a t a L a b e l s = {’ DAX close ’, ’ EU close ’};

n S c e n a r i o s = 100; m = 100; l a s t E p o c h = 400; % Create hcnn object hcnn = HCNN (’ DAX ’); hcnn . s e t D e s c r i p t i o n (’A HCNN Test . ’); % Set data

hcnn . setData ( data , dataCols , d a t a L a b e l s ); hcnn . s e t T r a i n i n g I n t e r v a l ( lastEpoch , m );

% Set nr of s c e n a r i o s

hcnn . s e t S c e n a r i o s ( n S c e n a r i o s );

% Run on GPU with CUDA

hcnn . s e t M e t h o d ( Model . C U D A _ M E X ); % Set model hcnn . s e t M o d e l ( HCNN . M O D E L _ H C N N ); % Learn hcnn . learn (); % Plot hcnn . plot ();

Figure 15: Simple code example from HCNNLab with two time series, where scenarios are run in parallel on the GPU.

(23)

4.2.1 C Mex

Since the HCNN learning phase is computationally heavy with its complex al-gorithm and iterative process, we will benefit from rewriting the code for the C language, instead of using the more simple MATLAB language. The program-ming language C can handle different kinds of data structures more efficient. Read more about using MEX files in MATLAB in the online documentation [14].

4.2.2 CUDA Mex

Because the gradient information for each dimension can be calculated in par-allel, a suitable approach to minimize the calculation costs is to introduce GPU acceleration. This also makes it possible to multiply each element in the hidden network state with its transition matrix in parallel. We used CUDA [4] from Nvidia to extend our C MEX to fully support GPU acceleration. CUDA is a language that makes it possible to fully communicate with the GPU and utilize the highly parallel computational power that it withholds.

(0, 0) (1, 0) (2, 0) (3, 0) (0, 1) (1, 1) (2, 1) (3, 1) (0, 2) (1, 2) (2, 2) (3, 2) (0, 3) (1, 3) (2, 3) (3, 3) (0, 0) (1, 0) (2, 0) (3, 0) (0, 1) (1, 1) (2, 1) (3, 1) (0, 2) (1, 2) (2, 2) (3, 2) (0, 3) (1, 3) (2, 3) (3, 3)

Figure 16: Blocks and threads in CUDA.

The main topics that we needed to consider were maximum threads running in parallel, size of shared memory and synchronization of threads. On Nvidia cards supporting CUDA 2.x, the maximum number of threads per block is 1024, and the maximum amount of shared memory per multiprocessor is 48 KB (approximatly 2000 double values). When calculating the gradients for HCNN, the time steps for each gradient is recursive in time, and is therefore not parallelizable. However, in each time step we are able to parallelize the calculations. Since all the scenarios in the ensamble are independent of each other, they can also be parallelized. Maximum performance is reached when all the GPU threads are active simultaneously, which is achieved when the batch

(24)

of scenarios is large enough. The maximum size of the batch is limited by the size of the specific graphics device’s memory.

We introduce a new approach to only have one network identification task for all scenarios in an ensemble. This makes it possible to solve the HCNN in a computationally efficient way for the GPU. The new model definition is similar to the HCNN definitions on page 11. The difference is that we combine all the scenarios into one single network. For this new ultra-large network, the combined variables are defined as:

sτ =      si τ sii τ .. . sn τ      , y_τ=      yi τ yii τ .. . yn τ      , A=      Ai ₀ _{· · ·} ₀ 0 Aii _{· · ·} ₀ .. . ... . .. ... 0 0 · · · An      (10)

where n is the number of scenarios.

The dimension of the problem is huge and varies depending on the problem task. We have taken advantage of the CUDA library CUSPARSE to handle the matrix-vector-calculations, and it provides a robust and flexible implementation that handles the memory allocation and matrix pointers very well, using the

HYB-format1

[15].

4.3 Modified Back-propagation

The learning of neural networks using back-propagation (see section 2.2), is divided into training epochs. Each epoch contains the network evaluation, the iteration of one complete back-propagation, and the weight update process. The purpose of each epoch is to move towards a better position in the weight space. How far we travel in each epoch is determined by the step length, the parameter η (greek: eta). To minimize the time it takes to fully train a network, the amount of traing epochs needs to be reduced.

One way to achieve this, without affecting other network parameters, is by increasing the η value. However, an increased step length will also lead to instability in the learning process, and will make the network very unstable. By introducing two learning phases with two different step lengths, combined with a limit of the value for the weight update, we have drastically improved the learning curve, and thereby the time to train the network.

The network error is larger in the beginning of the learning phase. To pre-vent an overload of the error flow, a constant η value is used to restrict the error to a smaller magnitude when updating the weights. As η becomes smaller, the training process becomes more robust, but at the same time it takes more

1. ”The HYB format, a combination of the ELL and COO sparse matrix data structures, offers the speed of ELL and the flexibility of COO. Often, unstructured matrices that are not efficiently stored in ELL format alone and readily handled by HYB. In such matrices, rows with exceptional lengths contain a relatively small number of the total matrix entries. As a result, HYB is generally the fastest format for a broad class of unstructured matrices.” [16]

(25)

epochs to complete the identification task. This is a common trade-off in back-propagation. To solve this dilemma, we have, as previously mentioned, a first phase of the learning where we limit the weight updates, combined with a rela-tively high η.

This works well in the case of an overparametrized network, such as HCNN. This new approach equalizes the error distribution among the weights, in a short number of epochs, and enables us to use a stable and comparably large η for the rest of the training. We think, for the case of HCNN, that this is a better approach than the standard back-propagation. We define it as:

∆A = η · 1 N N X τ =1 ∆Aτ,−L < w∆A< L (11)

where ∆A is the total weight updates for A in the current epoch, N is the

number of time steps, L is the update limit for each weight w and ∆Aτ is the

weight update in each time step. The same approach is used for updating s0.

Note that it is critical how η, L and the number of epochs is configured. The learning is sensitive to small changes, that significantly influence the method’s effectiveness. 0 1 2 3 4 5 6 7 8 9 10 x 104 0 0.01 0.02 0.03 0.04 0.05 0.06

Dimension = 120, Sparsity = 0.6, Modified Dimension = 160, Sparsity = 0.12, Modified Dimension = 300, Sparsity = 0.12, Modified Dimension = 160, Sparsity = 0.12, Eta = 0.08 Dimension = 160, Sparsity = 0.12, Eta = 0.2 Dimension = 160, Sparsity = 0.12, Eta = 0.5 Dimension = 160, Sparsity = 0.12, Eta = 1.0

Figure 17: Average residuals during HCNN learning, comparing the modified-and the stmodified-andard back-propagation algorithm. The Y-axis represents the aver-age residual error, and the X-axis the number of trained epochs.

As seen in the figure above, the modified back-propagation can drastically improve the residual errors in the beginning of the learning phase. It also allows for fast convergence in the later stages, heavily reducing the total amount of epochs.

(26)

5 Configuration

We have chosen to use HCNNLab for our test runs. The reason for this is that we have a full understanding of the software, and it performs well, as shown in section 6.1.

5.1 Data Selection

We have chosen to test HCNN on two different data sets. Some of the data is provided by Eturn, but most is downloaded from Reuters. The first data set is constructed by large world indices, currencies, commodities and financial rates. The second data set consists of some of the most traded currencies. Our belief and our understanding is that HCNN works well with a large amount of time series and with high degrees of correlations. We wanted to construct the data sets of as many time series as possible, limited by the amount and quality of the collected data. We also thought of this as a very interesting challenge for our HCNN implementation. The final time series used in the sharp runs for the two data sets are listed in table 1 and 2.

Currency data time series US Dollar / British Pound US Dollar / Canadian Dollar US Dollar / Euro

US Dollar / Hungarian Forint US Dollar / Japanese Yen US Dollar / Norwegian Krone US Dollar / Polish Zloty US Dollar / Russian Rubel US Dollar / Singapore Dollar US Dollar / South African Rand US Dollar / Swedish Krona US Dollar / Swiss Franc US Dollar / Thai Bath

(27)

World data time series

AEX (Holland) NASDAQ Tran Aluminium 3-Month (Composite) New Zealand .NZ50 Australia .AORD Nickel

Baltic .OMXBPI Nikkei 225

Brazil .BVSP NYSE Composite Index Brent Blend Oats CBT

Brussels BEL20 Index OMX Copenhagen 20 BSE 30 Bombay OMX S30

Copper Cash (Composite) Oslo Børs All-share Index Corn CBT Pakistan .TRXFLDPKP Crude Oil Palladium Spot XPD EQUAL S Czech Republic .TRXFLDCZP Paris CAC40 Index

D J Composite Platinum Spot XPT EQ S D J Industrials Pork Bellies

D J Transports RTS (Ryssland) D J Utilities Russel 2000 DAX IBIS Tyskl S&P 100 DJ Wilshire 5000 .Wil5 S&P 400 Midcap DJI Yahoo S&P 500

Dow Jones Developed Markets ex-U.S. Index Shanghai SE Composite Index Dow Jones Global Basic Materials Index Silver SPot XAG EQ S Dow Jones Global Consumer Goods Index Singapore

Dow Jones Global Consumer Services Index Soybean Oil Dow Jones Global Financials Index Spain .SMSI Dow Jones Global Health Care Index SSV 30 Dow Jones Global Index Swiss M. Index Dow Jones Global Industrials Index Taiwan Weigthed Dow Jones Global Oil & Gas Index Tin 3-Month (Composite) Dow Jones Global Technology Index US Dollar / British Pound Dow Jones Global Telecommunications Index US Dollar / Canadian Dollar Dow Jones Global Utilities Index US Dollar / Euro

Gas oil US Dollar / Hungarian Forint Gold Spot XAU EQUAL S US Dollar / Japanese Yen HANG SENG INDEX US Dollar / Norwegian Krone IPC G. (Mexico) US Dollar / Polish Zloty Ireland .TRXFLDIEP US Dollar / Russian Rubel Jakarta Comp US Dollar / Singapore Dollar Korea Composite US Dollar / South African Rand Kuala Lu (Malay) US Dollar / Swedish Krona Lead US Dollar / Swiss Franc Lean Hogs US Dollar / Thai Bath London FTSE 100 USD 10-Years Obl Marocco .TRXFLDMAP USD 30-Years Obl Merval (Argent.) Value Line Index NASDAQ 100 Wheat CBT NASDAQ Comp Zink NASDAQ Tele

Table 2: The 93 time series included in the world data set.

5.2 Data Pre-processing

When it comes to simulations, forecasting and other types of real world applica-tions, the most fundamental building block is to have good data. The old saying garbage in, garbage out pretty much sums it up. Therefore, it’s very important

(28)

to acquire good data, and to treat it well. We need to properly process it before we can feed it to our networks. Our pre-processing consists of the following steps, in order:

• Data cleanup

Remove all data and time series that are insufficient and of bad quality. • Remove outliers

Data points with erroneous values, and values that deviate significantly, will create a problem and drastically slow down the learning process. They are therefore deleted.

• Extract weekly data

There are many ways to extract week data. We chose to average all available data points for each week to get the weekly data values. • Repair data

To get data with continuous values, we need to repairing missing data. This is done by replacing the missing value with the previous existing value. We had very few occurrences of this for our weekly data.

All data pre-processing were performed in MATLAB, where the final data sets are stored and used as .mat-files.

5.3 Model Configuration

Parameter Description Value

Nm Number of time series in the training set. 93

m Number of time steps in the training set. 200

dim(A) Dimension of the state transition matrix A. 800

sparsity(A) Sparsity of the state transition matrix A. The value rep-resents the amount of alive weights.

12% max(|Ainit

|) Max abs value of each randomized weight in Ainit

0.2 max(|sinit

0 |) Max abs value of each randomized weight in s init

0 0.2

max(|∆A|) Max abs weight update contribution value for A. 0.01 max(|∆s0|) Max abs weight update contribution value for s0. 0.01

max(|A|) Max abs weight value allowed in A 2

max(|s0|) Max abs weight value allowed in s0 2.5

P1 epochs Number of epochs in learning phase one. 8000

ηP 1 Step length in phase one. 10

ηP 2 Step length in phase two. 1.5

Emax The maximum residual error allowed. 1E−4

(29)

Parameter Description Value

Nm Number of time series in the training set. 13

m Number of time steps in the training set. 250

dim(A) Dimension of the state transition matrix A. 160

sparsity(A) Sparsity of the state transition matrix A. The value rep-resents the amount of alive weights.

60% max(|Ainit

|) Max abs value of each randomized weight in Ainit

0.2 max(|sinit

0 |) Max abs value of each randomized weight in s init

0 0.2

max(|∆A|) Max abs weight update contribution value for A. 0.01 max(|∆s0|) Max abs weight update contribution value for s0. 0.01

max(|A|) Max abs weight value allowed in A 2

max(|s0|) Max abs weight value allowed in s0 2.5

P1 epochs Number of epochs in learning phase one. 2000

ηP 1 Step length in phase one. 5

ηP 2 Step length in phase two. 1.5

Emax The maximum residual error allowed. 1E−4

Table 4: Model configuration for the currency data set.

5.4 Computing

We knew that even with our own software with increased performance, the time to complete many runs would be very high. Therefore, we needed to have access to dedicated computers, allowing us to be up and running neural network calculations 24/7. We were given access to the following two computers, on which almost all network training were performed on:

• Computer 1

CPU: Quad-Core Intel Xeon E5410 2.33 GHz • Computer 2

CPU: Hexa-Core Intel Xeon E5-2620 2.0 GHz (hyper-threading enabled) GPU: Geforce GTX 670

We have performed large GPU tests on the high-end graphics device in Computer 2, the Geforce GTX 670. The tests include both of the two data sets. The performance is stated in the nextcoming two tables, where epochs means the total amount of epochs reached for the scenarios with the slowest convergence, and time per scenario epoch means the effective time it takes to complete one epoch in each scenario. Each test is set up according to the parameters in table 3 and 4.

(30)

Description Value

Number of scenarios 50

Memory allocation on GPU 614 MB of 2048 MB

Total time 268 min

Time per scenario 5.4 min

Epochs 39200

Time per scenario epoch 8.2 ms

Table 5: GPU performance for the world data set.

Description Value

Number of scenarios 500

Memory allocation on GPU 1241 MB of 2048 MB

Total time 289 min

Time per scenario 0.6 min

Epoch 21200

Time per scenario epoch 1.6 ms

Table 6: GPU performance for the currency data set.

The statistics for the sharp runs, on both the world data set and the currency data set, are listed in the tables below. Note that since the runs were performed running multiple scenarios simultaneously, the times are not equivalent with the total time it took to complete the training of all scenarios.

Description Value

Total time 696 days

Total epochs 49398700

Average epochs: 24298

Total number of scenarios 23118

Scenarios / batch 11.4

Average time / scenario 43 min

Average time / scenario epoch 107.3 ms

Table 7: Computational statistics for the world data set.

Description Value

Total time 254 days

Total epochs 19838900

Average epochs: 33625

Total number of scenarios 14692

Scenarios / batch 24.9

Average time / scenario 25 min

Average time / scenario epoch 44.6 ms

(31)

Even though the runs consist of enormous amounts of data with a huge num-ber of total scenarios and epochs, we have been able to configure the networks to achieve reasonable run times; 43 and 25 minutes per scenario respectively (see table 7 and 8). If we would’ve had access to more GPU’s and not being forced to run the majority of the computations on CPU, the run times would’ve been significantly shorter. See the example runs in tables 5 and 6 where the run times are much shorter.

Without our own software, our own model implementation, and the possibil-ity to run the networks in parallel, it would have been hard to get such a good exchange between run time and calendar time. As we can see in table 6, when we try to maximize the utilization of the GPU in HCNNLab, the interesting measure time per scenario epoch is drastically improved. The effective time is a remarkable 1.6 ms compared to the mixed runs (with both CPU and GPU computations), with the time of 44.6 ms. This can be explained by the supe-riour effectiveness of a modern high-end GPU, compared to a mid-end GPU, as well as CPU.

(32)

6 Results

6.1 Performance

To be able to compare the different software solutions in section 4 in terms of performance and speed, we have performed measurements in test runs. To get a fair performance comparison we have averaged the time it takes to com-plete 1000 training epochs. The runs was performed by using the standard back-propagation algorithm on a computer with an Intel i7 920 CPU @ 3.20 GHz (running one core, hyper-threading off), and a mid-end graphics device, a Geforce GTX 460. On the CPU we ran one scenario, while on the GPU we ran 300 scenarios in parallel.

Processor Scenarios Time per scenario Software

CPU 1 54.43 s SENN

CPU 1 12.59 s HCNNLab CMex

GPU 300 1.65 s HCNNLab CUDAMex

Table 9: Times for completing 1000 epochs with HCNN using the standard back-propagation algorithm. The tests average the time to complete 1000 epochs, with a state transition matrix with dimension 300 and sparsity 12%. The data set includes 200 time steps, and comes from the tutorial package in SENN.

As seen in table 9, the times are greatly improved by using HCNNLab. Note that the results only reflects computational powers, when using the same learning algorithm. One key different between SENN and HCNNLab, is that our calculations are performed with doubles instead of floats, which allows for higher accuracy and greater network memory. The trade-off for using doubles is a slight decrease in performance.

For an easier comparison, we have put together the results represented as speedup factors, see figure 18. The chart shows us that HCNNLab is around 4 times faster than SENN running on the CPU, and 33 times faster running on the GPU. This allows us, as discussed earlier in chapter 4.1, to run more data simultanously and keep the duration of the learning phase at a satisfactory level.

(33)

1,0 4,3 33,0 0,0 5,0 10,0 15,0 20,0 25,0 30,0 35,0 SENN C CUDA S p e e d u p F a ct o r

Figure 18: Computational performance comparison between SENN and

HCNNLab.

6.2 Tests

6.2.1 Comparison Models

HCNN is a neural network model that adapts according to historical data. This is done by adjusting a huge state transition matrix, consisting of weights to the data, and acquire a prediction in the feed forward process. To see how good these predictions are, we need to measure them somehow, and put them into context. Measurement values themselves won’t give us the bigger picture, if we can’t relate them to values from other models. Therefore, we need to collect prediction data and relevant error measurements not only from HCNN, but from other models as well. This creates the need for us to find or develop robust, but yet simple enough models to relate to.

A commonly used model when it comes to predictions and stock market forecasting, is the na¨ıve predictor. The idea behind it is that today’s stock price is the best estimate of tomorrow’s price [17], and is a consequence of the efficient-market hypothesis [18]. We think it’s a good idea to put the HCNN model in relation to trivial predictors, and that’s why we have chosen to include Theil Coefficient (see section 6.2.5) and MASE (see section 6.2.6) in our tests. Our implementation of the predictor is to use the latest known value as the prediction for all future timesteps. This will give the prediction the same conditions as the predictions of the HCNN ensemble. The formula for our implementation is very simple, and is defined as

ˆ

xt= x0, t >0 (12)

where ˆxis the prediction based on the actual values in x for each time step

(34)

Because neural networks, and HCNN in particular, are complex systems with advanced dynamics, it seems fair to include a more sophisticated model as well. Na¨ıve prediction is easy to relate to, and might even prove useful in some areas, but it is fundamentally far from HCNN. It totally ignores the history, and basically says that the weather of tomorrow will be the same as of today. It would be nice to find a system that adapts to the history, but yet is simple enough. Our choice for such a model is linear regression [19]. It fit’s a straight line through the set of n points, while keeping the sum of the squared residuals of the model as small as possible. It is defined as:

y= ax + b (13)

where a is the gradient constant, defined as

a= cov(x, y) var(x, y)= Pn i=1(xi− ¯x)(yi− ¯y) Pn i=1(xi− ¯x) 2 (14)

and b is the y-value when x = 0, which in our case is the last value in the

historical data. The functions cov and var stands for covariance1

and variance2

respectively.

Since linear regression adapts the predictions to the historical data, the predictions are heavily dependent on the only parameter for the system – the length of the data set. If we use a data set with a big amount of historical time steps, we will receive a slow system. If we decrease the amount of data and the historical time span, we get a more aggressive and probably steeper system. Each point in time will be different, and when the trends turn, the valutation of the historical data will vary. We will therefore run our tests against multiple linear regression models, each with a different time span. We have chosen to use three separate systems, with historical data memory varying between long and short. This change is achieved by adjusting the number of time steps to include, previous to the prediction date. The different parameter settings can be seen in the following table.

Setting Time steps included

Slow 200

Medium 100

Short 30

Table 10: Different parameter settings used for the linear regression model.

6.2.2 Error measurements

The most solid way to create a prediction from a HCNN ensemble, is to compute the median or average of all scenarios. We have chosen to use the median, since the influence from outliers will be nullified.

1. http:// en.wikipedia.org/ wiki/ Covariance

(35)

As standard, all predictions are included in our error measurement calcula-tions. This means that the total of 14 229 (153x93) predictions for the world data set, and 1 235 (93x13) predictions for the currency data set were used.

6.2.3 Forecast Hit Rate

Forecast hit rate is the average of the binear outcomes, that are given by compar-ing the forecast gradient to the actual gradient over the prediction timeframe. The gradient is derived from the straight line between the last prediction step, and the last value in the historical data. It can be either up- or downwards. We could look upon this hit rate as a predictor of bear- and bull markets [20]. We can define an outcome h as:

ht= ( 1, a >0 0, a <0, where a = ˆ xt− x0 xt− x0 (15) where t represents a prediction step (t > 0). By taking the average of all outcomes, we get the forecast hit rate value. The hit rate varies when we gradually increase the number of prediction steps. In the graph below, we let the number of time steps, p, go from 1 to 20 for both the world- and the currency data set. 48 50 52 54 56 58 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 F o re ca st H it ra te [ % ]

Predic!on Length [weeks]

Currencies World

Figure 19: Forecast hit rate when p goes from 1 to 20.

These results must be put in relevance to our other prediction models. Since our version of the na¨ıve predictor is a straight line for all future time steps t > 1, we need to look at the derivate from the previous historical value, to be able to extract the hitrate. The tables below show the results when we compare the HCNN forecast hit rates with the na¨ıve- and linear regression (linreg) predictor, for p = 1.

(36)

HCNN Na¨ıve LinReg-30 LinReg-100 LinReg-200

58.5% 61.5% 55.0% 55.4% 55.1%

Table 11: Forecast hit rate when t = 1 for the currency data set.

HCNN Na¨ıve LinReg-30 LinReg-100 LinReg-200

53.9% 59.2% 54.0% 53.3% 52.7%

Table 12: Forecast hit rate when t = 1 for the world data set.

For HCNN and linreg we can compare the forecast hit rate over time. The developed result for this comparison is shown in the following graphs.

46 48 50 52 54 56 58 60 62 64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 F o re ca st H it ra te [ % ]

HCNN LinReg-30 LinReg-100 LinReg-200

(37)

46 48 50 52 54 56 58 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 F o re ca st H it ra te [ % ]

Figure 21: Forecast hitrate for the world data set.

6.2.4 Local Hit Rate

Local hit rate is the average of the binear outcomes, that are given by comparing the gradient in each prediction step to the actual gradient over the prediction timeframe. We can define the binear outcome as:

ht= ( 1, a >0 0, a <0, where a = ˆ xt− xt−1 xt− xt−1 (16) where t represents a prediction step (t > 0). By taking the average of all outcomes, we get the local hit rate value. The hit rate varies when we gradually increase the number of prediction steps. In the graph below we let the number of time steps go from 1 to 20 for both the world- and currency data sets.

(38)

48 50 52 54 56 58 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Lo ca l H it ra te [ % ]

Currencies World

Figure 22: Local hit rate when t goes from 1 to 20.

These results must be put in relevance to our other prediction models. Forecast-and local hit rate is the same when t = 1, which makes the tables 11 Forecast-and 12 on page 33 relevant for this measurement as well. Just as with the forecast hit rate, we can’t use the na¨ıve predictor for timesteps t > 1. For HCNN and linreg, we can compare the forecast hit rate over time. The results for this comparison are shown in the following graphs.

46 48 50 52 54 56 58 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Lo ca l H it ra te [ % ]

(39)

46 48 50 52 54 56 58 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Lo ca l H it ra te [ % ]

Figure 24: Local hit rate for the world data set.

6.2.5 Theil Coefficient

The Theil Coefficient [17, p. 20] was proposed by the econometrician Henri Theil from the Netherlands. It is expressed as the square error sum between the prediction and the original, in relation to the na¨ıve predictor in every timestep. This makes it very useful for measuring the effectiveness in a predictor, since it’s more or less independent of the scaling of the time series. As the na¨ıve predictor, we use our earlier mentioned straight line implementation. The Theil Coefficient T is defined as:

T= q PN t=1(y(t) − ˆy(t)) 2 q PN t=1(y(t) − y(0)) 2 (17) Values below one implicate that the error sum for our predictor is lower than the error sum for the na¨ıve predictor, which is desirable. By comparing with LinReg while setting the number of time steps p to 1, we get the following tables:

Theil coefficient HCNN LinReg-30 LinReg-100 LinReg-200

Median 1.03 1.00 0.98 0.98

MAD 0.51 0.29 0.20 0.12

(40)

Theil coefficient HCNN LinReg-30 LinReg-100 LinReg-200

Median 1.12 1.01 0.99 1.00

MAD 0.58 0.29 0.18 0.12

Table 14: Theil coefficient for the world data set.

Median is referred to as the median of all theil coefficients for all the

predic-tions. MAD1 _{is the median absolute deviation, and measures how volatile the}

values are.

By varying p from 1 to 20, we can plot the results as charts, for both the currency- and world data set.

0.95 1.00 1.05 1.10 1.15 1.20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 T h e il C o e ﬃ ci e n t

Predic"on Length [weeks]

Currencies World

Figure 25: Theil coefficient when t goes from 1 to 20.

The following charts compare HCNN and LinReg for both data sets:

(41)

0.90 0.95 1.00 1.05 1.10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 T h e il C o e ﬃ ci e n t

Figure 26: Theil coefficient for the currency data set.

0.95 1.00 1.05 1.10 1.15 1.20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 T h e il C o e ﬃ ci e n t

Figure 27: Theil coefficient for the world data set.

6.2.6 MASE

MASE, or Mean Absolute Scaled Error [21], is an error measurement that takes into consideration different scales for different time series. This makes it useful as an error measurement for predictors. MASE is defined as:

q(t) = 1 y(t) − ˆy(t)

n

Pn

i=1|y(i) − y(i − 1)|

(18) M ASE = 1 n n X t=1 |q(t)| (19)

(42)

where y(t) is the observed values and ˆy(t) is the prediction. All values below one is to be considered as good. Due to the definition of MASE and the Theil Coefficient, they will be equal when t = 1. Therefore, tables specifically for MASE won’t be presented. Instead, see the tables in section 6.2.5, as they are relevant. The following charts show MASE for all future timesteps, for both the world- and the currency data set.

1.0 2.0 3.0 4.0 5.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 MA S E

Predic on Length [weeks]

Currencies World

Figure 28: MASE when t goes from 1 to 20.

The following charts compare HCNN and LinReg for both data sets:

0.0 1.0 2.0 3.0 4.0 5.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 MA S E

(43)

0.0 1.0 2.0 3.0 4.0 5.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 MA S E

Figure 30: MASE for the world data set.

6.2.7 Examples 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 2006−12−18 2006−12−25 2007−01−01 2007−01−08 2007−01−15 2007−01−23 2007−01−30 2007−02−06 2007−02−13 2007−02−21 2007−02−28 2007−03−07 2007−03−14 2007−03−22 2007−03−29 2007−04−05 2007−04−12 2007−04−20 2007−04−27 2007−05−04 2007−05−11 2007−05−19 2007−05−26 2007−06−02 2007−06−09 2007−06−17 2007−06−24 2007−07−01 2007−07−08 2007−07−16 Original Scenarios Forecast median Forecast average

Figure 31: Forecast example taken from the runs for the currency data set. The chart shows a USD/SEK prediction, with 150 scenarios in the ensemble over 30 future prediction steps.

(44)

7 Discussion

Our observations are that the currency market is trending and slow. If we look at the Theil Coefficient, see figure 26, the linear regression will beat the na¨ıve prediction (by having lower values than 1). This shows the strength in using predictors that acknowledge the market trends. In general, financial markets, including our data sets, contain a high level of trending data. This will give linear predictors, such as linear regression, an advantage during the trending phases. But it will also be their downfall when the trend shifts, i.e. from bull-to bear market.

Regardless results and accuracy, HCNN is a more consistent and flexible predictor, since it uses a large amount of nonlinear parameters to adapt to the history. This will, to a greater extent, create a more independent analytical model. With the right data and good conditions, it is most likely possible to benefit from HCNN and it’s features. Siemens in Germany has in recent years used HCNN in various predictions, analytics and other applications [23]. They have, for example, used it in copper price predictions and in weather forecasts.

0 200 400 600 800 1000 1200 1400

Mar-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07 Mar-08 Mar-09 Mar-10

Figure 32: The swedish index OMXS30 from our world data set.

In figure 32, we see the swedish stock exchange index OMXS30 over the time period used in our runs, with clear trends in the price.

(45)

20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 90

Figure 33: Hit rate over time for HCNN and the world data set.

20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 90

Figure 34: Hit rate over time for linear regression and the world data set.

In the figures 33 and 34, the hit rate in the first prediction step for all time series, is plotted in time consecutive order, for both HCNN and LinReg-200. The time series are sorted in order by the best average hit rate. A white dot indicates a hit, and a black dot a miss. These figures show that HCNN isn’t as trend sensitive as the linear regression. The dots are more equally distributed

(46)

in the HCNN plot. In the plot for the linear regression, the hits are clearly more periodical, and the hits and misses sometimes appear visually as vertical stripes. The total averages of the hits and misses are shown in table 12 for the forecast hit rates.

Point 80 represents the first value in the results of year 2008, a year where the financial markets crashed. In October 2008, the Black Week [22] lasted for five trading sessions, in which Dow Jones Industrial Average fell 18.1%. If we look at both charts (figures 33 and 34), we can see that HCNN has higher hit rates during the crash than the linear regression.

7.1 Answers to Thesis Questions

Can we create a faster implementation of the HCNN model on our own?

At first, it seemed like an impossible task, but with great effort and extensive re-search, we managed to recreate both the HCNN model and the back-propagation algorithm, implemented as optimized C code. As seen in figure 18, our imple-mentation is 4.3 times faster than SENN.

Can the GPU be utilized in an efficient way to speed up the learning phase?

We have presented a unique solution to maximize the GPU utilization by cre-ating one single network identification task for all scenarios in an ensemble, see page 21. This, together with efficient calculations, and CUDA’s HYB sparse matrix pointers, have given as an effective and modern implementation of this complex high-dimensional neural network. We have been able to increase the performance with a speedup factor of 33, using our software HCNNLab, as seen in figure 18.

Is it possible to use HCNN for predictions on large financial data sets on a daily basis?

With HCNNLab, the whole process, from data pre-processing to final prediction, is made very intuitive and user-friendly. The complete HCNN configuration can be set up with only a few lines of code, see figure 15. We have presented a technique to stabilize the learning faster than traditional back-propagation (section 4.3). This is implemented in HCNNLab, and combined with the efficient GPU implementation, it makes HCNNLab a competent and powerful tool to achieve the goal of complex market analysis on a daily basis.

Is it possible to predict the financial markets using HCNN?

For our test data and time span, HCNN doesn’t significantly beat the more simple prediction models. When it comes to hit rate, HCNN is better than

(47)

linear regression for one time step forward, but a na¨ıve predictor based on the slope between the previous and current value, gives a higher accuracy.

None of the gathered statistics and error measurements in section 6.2 (hit rates, Theil coefficient and MASE), indicate that HCNN has a strong edge in any of our data sets.

7.2 Further Work

7.2.1 Performance

As modern GPUs start to outperform CPUs, our implementation with CUDA is a suitable approach. With a combination of back-propagation and gradient limit, we can take on more challenging and complex data sets. It is still a bit of a trial and error-approach to find the optimal configuration for each problem, and this could be further improved with a process to identify the configuration within HCNNLab. We have limited our tests to weekly data, and it would be an idea to also explore other time horizons as well.

As the computational power increases, we could also try more complex prob-lems and use more frequently sampled data.

7.2.2 Tests

One interpretation we can make from the results of the test runs is that the data needs to be more specific, and chosen more carefully. We believe that the data selection process could be improved by profound data dynamics analysis. E.g. choosing an index and including specific time series that explains the underlying causes of the index price movements. In our tests, we approached the creation of our data sets with the fundamental idea that all time series influence each other. The other approach to more specifically predict only a subset of the data set, could turn out to be more favourable.

The easiest way to find correlations between time series would be to use linear correlation [24]. It basically matches time series with similar linear dynamics with each other. The obvious weakness of this method is that more complex dynamics, as seen in the financial markets, with many different types of time series effecting each other, will not be found.

Another way of building good data sets is to use an unlinear method called sensitive analysis, developed by Hans-Georg Zimmermann, Siemens [6, p. 57]. This method helps find complex unlinear relations between the time series inside a given network, and ranks them depending on their influence.

There are also possibilities to further analyze the ensemble to estimate fore-cast risk. By, for example, looking at the standard deviation for the ensemble median, the network uncertainty could be interpreted as a risk measurement.

(48)

References

[1] D. Kriesel, “A Brief Introduction to Neural Networks”. [Online].

Available: http:// www.dkriesel.com/ media/ science/

neuronalenetze-en-zeta2-2col-dkrieselcom.pdf . [Accessed Dec 30, 2014].

[2] R. Grothmann, “Multi-Agent Market Modeling Based on Neural Networks,” Ph.D. Thesis, University of Bremen, Germany, 2002.

[3] MATLAB, MathWorks, Software. [Online]. Available: http:// www.mathworks.com

[4] CUDA, NVIDIA Corporation, Software. [Online]. Available: http:// www.nvidia.com

[5] H-G. Zimmermann, R. Grothmann, C. Tietz and H. Jouanne-Diedrich, “Market Modeling, Forecasting and Risk Analysis with Historical Consis-tent Neural Networks,” Operations Research Proceedings 2010, Siemens AG, Corporate Technology, Munich, Germany, 2010. [E-book].

Available: SpringerLink

[6] Hans-Georg Zimmermann, “Neural Networks in System Identification, Fore-casting & Control,” in MEAFA workshop, 15-17 Feb, 2010, The University of Sydney, 2010.

[7] Y. LeCun, L. Bottou, G B. Orr and K-R. M¨uller, “Efficient BackProp,”

Image Processing Research Department AT& T Labs, Wilmette University, USA, 1998. [Online].

Available: http:// yann.lecun.com/ exdb/ publis/ pdf/ lecun-98b.pdf [Accessed Dec 28, 2014].

[8] R J. Williams and David Zipser, “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks,” in Neural Computation, 1, pp. 270-280, 1989. [Online].

Available: ftp:// ftp.ccs.neu.edu/ pub/ people/ rjw/ rtrl-nc-89.ps. [Accessed Dec 28, 2014].

[9] P. E. Gill, W. Murray and M. A. Saunders, “SNOPT: An SQP Algorithm for Large-Scale Constrained Optimization,” in SIAM Journal on Optimization, volume 12, number 4, pp. 979-1006, 2002. [Online].

Available: http:// web.stanford.edu/ group/ SOL/ papers/ SNOPT-SIGEST. pdf .

[Accessed Dec 30, 2014].

[10] E. Alba and J. F. Chicano, “Training Neural Networks with GA Hybrid

Algorithms,” Departamento de Lenguajes y Ciencias de la Computaci´on

University of M´alaga, Spain, 2004. [Online].

Available: http:// www.lcc.uma.es/∼eat/ pdf/ gecco04f.pdf .

(49)

[11] C. Tietz, “SENN V3.1 User Manual,” Siemens AG, CT T, 2010.

[12] A. Pease, “The Science of Prediction,” in Pictures of the Future, Siemens AG, 2011. [Online].

Available: http:// www.siemens.com/ innovation/ pool/ en/ publikationen/

publications pof/ pof fall 2011/ machine learning/ pof0211 ml prognosen en.pdf .

[Accessed Jan 25, 2015].

[13] Wikipedia contributors, “Tcl,” in Wikipedia, The Free Encyclopedia. [On-line].

Available: http:// en.wikipedia.org/ wiki/ Tcl . [Accessed Jan 25, 2015].

[14] MATLAB MEX, MathWorks, Software. [Online].

Available: http:// se.mathworks.com/ help/ matlab/ ref/ mex.html [15] CUSPARSE Hybrid Format (HYB), NVIDIA Corporation. [Online].

Available: http:// docs.nvidia.com/ cuda/ cusparse/ #hybrid-format-hyb [16] Nathan Bell and Michael Garland, “Efficient Sparse Matrix-Vector

Multi-plication on CUDA,” December 11, 2008. [Online].

Available: http:// sbel.wisc.edu/ Courses/ ME964/ Literature/

techReportGarlandBell.pdf . [Accessed March 8, 2015].

[17] Thomas Hellström and Kenneth Holmström, “Predicting the Stock Mar-ket,” Department of Mathematics and Physics, Mälardalen University, Swe-den. 1998. [Online].

Available: http:// www.e-m-h.org/ HeHo98.pdf [Accessed 6 April, 2015].

[18] Wikipedia contributors, “Efficient-market Hypothesis,” in Wikipedia, The Free Encyclopedia. [Online].

Available: http:// en.wikipedia.org/ wiki/ Efficient-market hypothesis. [Accessed Apr 6, 2015].

[19] Wikipedia contributors, “Simple Linear Regression,” in Wikipedia, The Free Encyclopedia. [Online].

Available: http:// en.wikipedia.org/ wiki/ Simple linear regression. [Accessed Apr 6, 2015].

[20] Wikipedia contributors, “Market trend,” in Wikipedia, The Free Encyclo-pedia. [Online].

Available: http:// en.wikipedia.org/ wiki/ Market trend . [Accessed Apr 7, 2015].

[21] Wikipedia contributors, “Mean absolute scaled error,” in Wikipedia, The Free Encyclopedia. [Online].

Implementation and Evaluation of Historical Consistent Neural Networks Using Parallel Computing

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Linköping University

Linköpings universitet

LIU-ITN-TEK-A--15/051--SE

Implementation och

utvärdering av Historical

Consistent Neural Networks med

parallella beräkningar

Johan Bjarnle

Elias Holmström

LIU-ITN-TEK-A--15/051--SE

Implementation och

utvärdering av Historical

Consistent Neural Networks med

parallella beräkningar

Examensarbete utfört i Medieteknik

vid Tekniska högskolan vid

Linköpings universitet

Johan Bjarnle

Elias Holmström

Examinator Pierangelo Dell'Acqua

Norrköping 2015-06-15

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

Contents

1

Introduction

1.1

Motivation

1.2

Purpose

1.3

Questions

1.4

Limitations

2

Background

2.1

Introduction to Neural Networks

Output

Hidden

Input

2.2