a Data-Warehouse Solution for OMS Data Management

(1)

a Data-Warehouse Solution for OMS Data Management

Mikael ¨ Ohman

June 15, 2013

Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Stephen Hegner

Examiner: Fredrik Georgsson

Ume˚ a University

Department of Computing Science SE-901 87 UME˚ A

SWEDEN

(2)

(3)

Abstract

A database system for storing and querying data of a dynamic schema has been developed based on the kdb+ database management system and the q programming language for use in a financial setting of order and execution services. Some basic assumptions of mandatory fields of the data to be stored are made including that the data are time-series based.

A dynamic schema enables an Order-Management System (OMS) to store information not suitable or usable when stored in log files or traditional databases. Log files are linear, cannot be queried effectively and are not suitable for the volumes produced by modern OMSs. Traditional databases are typically row-oriented which does not suit time-series based data and rely on the relational model which uses statically typed sets to store relations.

The created system includes software that is capable of mining the actual schema stored in the database and visualize it. This enables ease of exploratory querying and production of applications which use the database. A feedhandler has been created optimized for handling high volumes of data. Volumes in finance are steadily growing as the industry continues to adopt computer automation of tasks. Feedhandler performance is important to reduce latency and for cost savings as a result of not having to scale horizontally. A study of the area of algorithmic trading has been performed with focus on transaction-cost analysis.

Fundamental algorithms have been reviewed.

A proof of concept application has been created that simulates an OMS storing logs on the execution of a Volume Weighted Average Price (VWAP) trading algorithm. The stored logs are then used in order to improve the performance of the trading algorithm through basic data mining and machine learning techniques. The actual learning algorithm focuses on predicting intraday volume patterns.

(4)

ii

(5)

List of Figures

1.1 Overview of actors involved and flow. A are agencies, OMSP are high-touch

OMSs, OMS are low-touch OMSs and M are markets. . . 2

1.2 Market data for some stock symbols found on the BATS exchange. . . 5

1.3 Overview of the way related messages are linked in FIX within and between different OMSs. O are order messages and E are execution reports. A1 and A2 are related orders and the same goes for B1 and B2 as well as C1 and C2. 7 2.1 Volume for symbol BARC.L . . . 11

2.2 Overview of feedhandler solution. TIBCO RV is the messaging system. Mes- sages are then processed by the Java implementation and sent to the kdb+ process for storage. . . 13

3.1 Finance Data Warehouse Setup. Data is produced by OMSs and market feeds. They are first stored in in-memory databases before being sent to historical storage as part of a nightly job. . . 16

3.2 Flow of an Institutional Order . . . 17

3.3 Usage During Internal Dispute . . . 18

3.4 Machine Learning Scenario Using Historic Data . . . 19

3.5 Folder structure of a historical database . . . 20

3.6 A complex sample schema . . . 23

3.7 Simple key-value schema inspection. . . 25

4.1 A decreasing volume pattern. . . 30

4.2 A U-shaped volume pattern. . . 30

4.3 A high middle volume pattern. . . 31

4.4 Clustering results on a high middle volume pattern. . . 31

4.5 An overview of the setup. . . 32

4.6 U-shaped pattern . . . 33

4.7 Middle volume . . . 33

4.8 High volume at the tail . . . 34

4.9 Successively increasing volume pattern . . . 34

v

(8)

vi LIST OF FIGURES

4.10 An example of the naive algorithm working fairly well . . . 36

4.11 An example of the naive algorithm not working very well due to the pattern being different than the average volume profile. . . 37

4.12 Learning algorithm, using no compensation number . . . 38

4.13 Learning algorithm using no compensation . . . 39

4.14 Learning algorithm with compensation number . . . 39

B.1 The investment process simplified . . . 60

B.2 The efficient frontier defined by Harry Markowitz . . . 66

B.3 The efficient trading frontier defined by Kissell and Glantz . . . 66

B.4 Result of VWAP algorithm. Algorithm volume in red, market volume in black. Asset price in blue. . . 72

(9)

List of Tables

4.1 Distances between correct volume and algorithm volume . . . 36

4.2 Distances for days when the clustering algorithm leads to a bad final time bin volume . . . 37

4.3 RMSE for the improved learning algorithm that compensates based on previous volumes . . . 38

B.1 Fictitious order book for NOK . . . 62

B.2 Fictitious order book for NOK post trade . . . 62

B.3 Categorized transaction cost components . . . 67

B.4 Implementation Shortfall notation . . . 70

vii

(10)

viii LIST OF TABLES

(11)

Chapter 1

Introduction: Problem Overview and Goals

To understand and place in perspective the work presented in this thesis, it is necessary to have a basic understanding of Order-Management Systems and the issues surrounding them.

This chapter provides the necessary background material to develop such an understanding.

1.1 Order-Management Systems and Order and Execu- tion Data

Investment banks have Order-Management Systems (abbreviated OMS) that receive and send messages from/to clients, other OMSs and market exchanges. These can be built in- house but are also available from several different vendors such as Fidessa [3], SunGard [12], FlexTrade [6] and Orc Group [11]. A clear distinction can be made between high-touch and low-touch OMSs. High-touch OMSs are systems that traders interact with whereas low-touch OMSs have little, if any, human intervention. Both types of systems use the FIX (Financial Information eXchange) protocol to communicate. Some examples of the dominant message types used in FIX is given in this section; a more thorough introduction is then given in Section 1.1.1. The protocol specification is also available online [4] for reference.

Thus there are three classes of actors to consider: clients, OMSs and exchanges. The situation is illustrated in Figure 1.1, which presents an overview of all actors involved and the flow. Rectangles labeled OMSP are Order-Management Systems that interact with traders sitting at desks using software. Rectangles labeled OMS are Order-Management Systems without human interaction. Rectangles labeled A are agencies/clients trading using the OMS flows. Finally, rectangles labeled M are market exchanges where sell orders get crossed with buy orders. Crossing in this context means matching buy and sell orders for trading.

The basic scenario involves clients issuing orders to OMSs. These OMSs then either propagate these orders to other OMSs or, if they have a market connection, submit them to a market. It is also possible for OMSs to be markets themselves in which case they have a built-in crossing engine (that is, software matching buy and sell orders for trading) and are called Multi-Lateral Trading Facilities (MTFs). Orders arriving to such an OMS may either cross with other matching orders or they may be sent to another OMS for execution.

Markets then send messages when orders are filled, partially filled, rejected or whenever some

1

(12)

2 Chapter 1. Introduction: Problem Overview and Goals

Figure 1.1: Overview of actors involved and flow. A are agencies, OMSP are high-touch OMSs, OMS are low-touch OMSs and M are markets.

other event relevant to the order has occurred. These messages again propagate through the system of OMSs until reaching the original client. Clients also have the possibility of issuing cancel and replace requests as well as some other types of messages. The real world scenario is more complicated with human intervention and more.

In this work, a simplification which considers only a few types of messages rather than the actual full set is suitable. Messages are either sent from clients (and OMSs) in a downstream direction towards exchanges or sent from exchanges in an up-stream direction towards clients. Downstream message types to be considered include:

– New Order – Cancel

– Cancel and Replace

while the only upstream message type is:

– Execution Report

Systems called feedhandlers can be created for gathering and normalizing order and execution data into databases. The data are used to produce reports for compliance to regulations, reports for clients and for internal use and are also used to add value to other applications etc. The data are a wealth of information potentially capable of improving algorithms and establishing facts to settle disputes. An example of an internal dispute is when, for two communicating OMSs, one accuses the other of not behaving according to policy. An example of an external dispute is when regulatory agencies need to know the exact sequence of messages in order to determine if regulations have been followed.

(13)

1.1. Order-Management Systems and Order and Execution Data 3

1.1.1 Data Description

This section aims to develop an understanding of the structure of the data involved. The data can broadly be separated into three categories. The first category consists of messages describing orders and executions. These messages follow some version of the FIX protocol.

The second category is financial market data. These data are provided by different sources and cannot be assumed to follow some standard specification, though Reuters market feeds dominate. The third category is general, containing any type of BLOB such as could be represented by an XML or JSON document.

FIX

FIX stands for The Financial Information eXchange Protocol and is a specification for the electronic communication of trade-related messages [4]. It is the de-facto message standard for pre-trade and trade communication globally within equity markets. It has been developed through collaboration of banks, exchanges, institutional investors and other actors in the financial industry. Further information and tutorials are readily available online [5] [1].

FIX messages are text messages with the structure controlled by key-value pairs con- nected by the equality sign and groupings controlled by braces. This shares similarities with other text protocols often specified through XML. In fact, FIX messages could easily instead be represented by XML documents if so desired.

An example execution report message can be seen below in Listing 1.1. For clarity this message has been cut in size as indicated by the dots as well as obscured to avoid reveal- ing identities of involved parties. Most elements are key-value pairs while more advanced elements at the end have multiple values as well as nested key-value pairs.

Examining this message we can see that it is an execution report (ExecutionReportMsg) which is also specified by the MsgType tag set equal to eight. The ExecType tag reveals that the message reports the status of the referenced order. The referenced order can be seen to have been a market order to buy 1200 shares of a security (the identity of the security has been cut from the message). The OrderCapacity is set to A for agency which means the original order was placed by a client (as opposed to a proprietary order). The ComplianceID is used to partially link the execution report to the original order.

Some information such as other identification fields (needed to fully link the execution report with previous execution reports and the original order) and the actual status of the order have been cut from this message. Examples of order statuses are new, partially filled and replaced.

Financial Market Data

Whereas order and execution data are generated by the OMS itself, market data come into the system via an external provider such as Reuters [2]. Market data are very important in finance. They are for example used for time-series analysis to study past volatility of prices and so predict future volatility. Volatility is one of the most important parameters when pricing instruments. Market data are also used to create volume profiles of instruments. Vol- ume, and specifically intra-day volume patterns, are very important for algorithmic trading systems as many algorithms require accurate predictions of intra-day volume.

Though this section does provide a very brief introduction, the interested reader is assumed to have other sources available for a deeper understanding. Information is widely available on the internet through, for example, the homepages of the major exchanges.

(14)

Listing 1.1: Example execution report message FIX :

{

ExecutionReportMsg

( MsgType=8[ E x e c u t i o n R e p o r t ] ) ( ExecType=I [ O r d e r S t a t u s ] ) ( OrdType =1[ Market ] )

( S i d e =1[Buy ] ) ( OrderQty = 1 2 0 0 . 0 ) ( TimeInForce =0[Day ] )

( ComplianceID=1/SABERWOLF/REGION/ p a r t o f i d / 2 0 1 1 0 2 2 2 /XXX/ 1 ) ( P r i c e = 0 . 0 )

( O r d e r C a p a c i t y=A[ Agency ] ) ( LeavesQty = 1 2 0 0 . 0 )

( ListName=BLUE 0222 3 0 7 7 8 1 4 ) ( SenderCompID=SABERWOLF) ( OnBehalfOfCompID=SABERWOLF) ( Currency=GBP)

( AvgPx = 0 . 0 )

( SenderSubID =1570) . . .

( BookingType =0[ R e g u l a r B o o k i n g ] ) ( E x D e s t i n a t i o n=LSE )

( TradeCaptureSystem=SABERWOLF[SABERWOLF] ) ( T r a d e E x e cu t i o n S y s te m=SABERWOLF[SABERWOLF] )

( NoPartyIDs [ { Group ( P a r t y R o l e =36[ E n t e r i n g T r a d e r ] ) ( P a r t y I D S o u r c e=D[ PropCode ] ) ( PartyID=mominat ) }

{Group ( P a r t y R o l e =11[ I n i t i a t i n g T r a d e r ] ) ( PartyIDSource=D[ PropCode ] ) ( PartyID=mominat ) }

{Group ( P a r t y R o l e =13[ O r d e r O r i g i n a t o r ] ) ( PartyIDSource=D[ PropCode ] ) ( PartyID=YYY BANK OF XXX) }

{Group ( P a r t y R o l e =3[ C l i e n t I D ] ) ( PartyIDSource=D[ PropCode ] ) ( PartyID=XXX) ( NoPartySubIDs

[ { Group ( PartySubIDType =4002[ ClientName ] ) ( PartySubID=XXX) } {Group ( PartySubIDType =4007[ CustomerParentAccount ] )

( PartySubID=XXX) } ] ) }

{Group ( P a r t y R o l e =24[ CustomerAccount ] ) ( PartyIDSource=D[ PropCode ] ) ( PartyID=XXX) }

{Group ( PartyIDSource=C [ AccptMarketPart ] ) ( P a r t y R o l e =22[ Exchange ] )

( NoPartySubIDs [ { Group ( PartySubIDType =25[ L o c a t i o n D e s k ] ) ( PartySubID= ) } ] ) }

{Group ( P a r t y R o l e =38[ P o s i t i o n A c c o u n t ] ) ( PartyIDSource=D[ PropCode ] ) ( PartyID=XXX) }

{Group ( P a r t y R o l e =99[ RR FC IR Number ] ) ( PartyIDSource=D[ PropCode ] ) ( PartyID = 6 1 8 8 9 ) } ] )

}

(15)

Information on stock prices used to be transmitted over telegraph lines and was printed out on ticker tape. In finance jargon, they are still referred to as tick data. These stock quotes are now delivered electronically and constitute arguably the most widely known form of financial data.

Figure 1.2 shows a snapshot of a table of stock quotes taken from one of the US stock exchanges. Symbols signify what stock (belonging to what company) is being traded. The bid price is the price offered to buy stocks while the ask price is the price given by sellers.

The difference is called the bid-ask spread.

For our purposes, market data are just a flat collection of key-value pairs and add no further complexities to what has already been described.

Figure 1.2: Market data for some stock symbols found on the BATS exchange.

1.1.2 Common Uses of Data of Order-Management System

Data History

Financial orders can pass through any number of systems within and outside an investment bank’s infrastructure. The system that originated the order has no way of accurately predicting the path and the changes to the order. Tracing orders and gaining an understanding of data history is therefore an important problem.

An order’s lifetime is complicated. A system may decide to split an order. Thus an entity previously only associated with one ID will now be associated with at least two (further splits can occur as deemed necessary by other systems). Worse, orders may be merged. Thus two IDs will be replaced by one. There can also be multi-day orders even though all systems may not support such orders. These systems employ various methods in order to keep track of IDs. Multiple regions further complicate tracking of orders and IDs. In the end, an order may be associated with several different IDs for any multitude of reasons.

Data history is very important because regulations often require it in some way. For example, it is common to want to connect an original client order to all orders that were sent to the market as a result. This is used to provide execution summary reports to clients

(16)

where they can see the resulting orders and the prices that were achieved. For software modernization and improvement, it can be used to track actual results of actions.

The way data history is established is based on the FIX protocol. Again, because institu- tions may use a custom implementation of the FIX protocol within their own infrastructure the specifics may differ from what is presented here. Figure 1.3 shows how orders within and between three different OMSs titled A, B, and C are linked together. The boxes represent messages, O stands for orders, and E stands for execution reports. The orders have been named A1, A2, B1, B2, C1, and C2. A2, B2, and C2 are cancel replacement orders for A1, B1, and C1 respectively. Further, B1 is a result of A1 passing through to OMS B. C1 is a result of B1 passing through to OMS C. The same relationship goes for the respective A2, B2, and C2 orders.

Each order has a client order ID, which is always created by the system issuing the order, and which is represented by the ClOrdID field in the associated FIX statement.

Each execution report for an order shares the ClOrdID with which it is associated. This is illustrated in Figure 1.3 as an arrow from the execution report stacked under the order A1 to that order labeled ClOrdID in the upper part for OMS A. Markets will preserve the client order ID value in execution reports delivered for this order. The situation for A2 is analogous.

Each order also has a original client order ID, which is represented by the OrigClOrdID field in the associated FIX statement. Orders which cancel and replace an existing order use this ID to refer to the original order’s client order ID. This is illustrated in Figure 1.3 as an arrow from A2 to A1 labeled A1.ClOrdID = A2.OrigClOrdID.

Finally, within an OMS, all of these orders (and their associated execution reports) can share a common order ID if so implemented. Order IDs are created by the broker and function as sell side IDs making them distinct from client order IDs which are buy side IDs.

This explains the relationships between all orders in OMS A in Figure 1.3. The situation is analogous for OMS B and OMS C.

Relationships between orders from adjacent OMSs are managed as follows. Each order has a secondary client order ID, which is represented by the SecondaryClOrdID field in the associated FIX statement. An order created in an OMS as a result of receipt of an order from another OMS use the secondary client order ID to refer to the original order’s client order ID. This is illustrated in Figure 1.3 as an arrow labeled A1.ClOrdID = B1.SecondaryClOrdID for orders A1 and B1 belonging to OMS A and OMS B respectively. The relationship between related orders in OMS B and OMS C is analogous.

All of these orders and execution reports are related to each other as they all are associated with the original order A1. To establish this, the compliance ID (ComplianceID field in FIX) is the same on all of the messages.

Together, OMS A, OMS B, and OMS C form an order-management flow.

(17)

Figure 1.3: Overview of the way related messages are linked in FIX within and between different OMSs. O are order messages and E are execution reports. A1 and A2 are related orders and the same goes for B1 and B2 as well as C1 and C2.

(18)

Verification of Software Contracts

Systems performing actions act depending on system rules. We distinguish between rules specified in external documents and internal rules.

Rules specified in external documents form a description of the functionality of the service. These can be used by humans using or dependent on the system to predict how it will act given an action taken by them. A major concern is to be able to prove that the system acted according to its interface in case there is a dispute arising from a user of the system dissatisfied with its operations for example.

Internal rules are specified by the developers and related to algorithms and the source code. Concerns include ensuring that the intentions behind the rule match the results given by adhering to it. This includes tracing results in other systems. It is also of interest to know how often the rule is referenced and under what conditions. Rules existing in source code can be used by developers for multiple purposes. First and foremost they can be used to ensure that the system does what the developers expect it to do.

These rules may be specified in external documents or exist explicitly in the source code.

Once verified, the outcome of following the rules may be studied and conclusions drawn can be used to further improve the systems. A complex scenario would want to attempt to study how rules in different systems affect each other. Even though each system behaves as it should there may be situations where rules in different systems counteract each other, affecting overall performance.

Software Modernization and Improvement

The data can be used to improve the software driving OMSs. This is especially useful for any algorithmic trading or pricing platform as these produce more data than systems partially operated by humans. A rough estimate is that the volumes of these system exceed those operated by traders by an order of magnitude of 10 or more. Given these volumes, a systematic way of improving the algorithms used through quantitative analysis is needed.

Data produced that is stored in such a way that they are readily queryable by the system can be used for data mining and/or machine learning techniques that could be used to improve upon algorithms. As an example, an algorithmic trading system responsible for splitting large orders into several smaller orders could upon receipt of a new order study the results of actions taken previously upon receipts of similar orders, compare the performance of different strategies in terms of execution completion, transaction costs, market impact costs and more and decide on a course of action using these calculated performance scores.

1.2 Motivation and Goals

The uses mentioned in Section 1.1.2 imply a data warehouse. Data from different OMSs need to be accessible in a common way rather than stored isolated on a per-OMS basis.

The lack of a data warehouse hinders linking related data across OMSs since the per-OMS storage solutions may diverge in structure and functionality. This in turn affects the delivery of data-history information. If data from OMSs are stored in isolated database management systems rather than in a common data warehouse, linking orders and executions across OMSs becomes fragile at best. Users of the data would have to adapt their programs to extract data from potentially heterogenous sources and create detailed logic to remove any discrepancies in regards to how the data are stored. Several teams may end up doing this independently of each other and aside from inefficiencies, changes to OMSs may cause inconsistencies.

(19)

1.3. Contributions of this Thesis 9

Storing OMS data outside a data warehouse hinders effective verification of software contracts. Because there is no common data warehouse, teams managing different OMSs have no common source to use when settling disputes. Extraction and homogenization of data from different storage solutions and establishment of data history incur significant overhead.

An OMS that provides its own storage solution can use its own data to improve the software it consists of. Again, there is no easy access to data from other OMSs. Because an OMS rarely, if ever, works in isolation in practice, such a storage solution will be of limited use.

A basic data warehouse solution for plain order and execution data is straightforward.

Systems rely on the FIX protocol and a common schema can be developed to store the generated data. This solves many issues primarily concerning data history. The data stored would however therefore be devoid of any information relating to the Order-Management Systems themselves. The services would focus on the data rather than the OMS itself. This leads to loss of interesting information. There is no data for why an order was not placed, for example. More generally, there is no data for why an event occurred.

This could be partly solved using system logs. Logs are used in most applications and are usually written to plain text files easily available for inspection by humans. The volumes involved with Order-Management Systems however make this solution impractical save for use when for example a system crashes. A single Order-Management System can receive more than 50 million orders on an active day along with millions of execution reports associated with the orders. Even during very short intervals the volumes can can become an unmanageable problem with peaks around 7000 incoming messages per second. This hinders effective searching and investigation of log files in addition to presenting a problem of sheer disk space if logs are to be saved in plain text format.

This implies a data warehouse solution for enriched OMS data rather than plain order and execution data. This solution must be based partly on a dynamic schema as order and execution data, financial market data and in general arbitrary OMS data needs to be stored together. This thesis aims to investigate the feasability of, and implement, such a data warehouse solution using the industry popular kdb+ as the underlying DBMS.

1.3 Contributions of this Thesis

The main contribution of this thesis is the design and implementation of a data warehouse solution called Extended Order and Execution Service (or EOES) for enriched OMS data capable of storing any data from heterogenous OMSs using existing technology. The system uses a dynamic database schema in order to accomplish this. The system consists of several components all needed in order for the system to be useful.

This includes a blueprint feedhandler for accepting incoming data of any form from any source which could be used to implement a production-ready feedhandler with superior thoughput capacity, flexibility and fault-tolerance etc. In addition, the design and implementation of a feedhandler aimed at maximizing message throughput is given. A feedhandler functions as the basic normalization component needed to massage data from heterogenous sources into a common shape. The performance of feedhandlers is vital in financial-software infrastructure.

A lightweight client library has been developed that can easily be plugged in to an existing OMS and enable that OMS to send data to the feedhandler. The library’s main functionality is the serialization of any object into an interoperable text-only message. The library also enables communication with the feedhandler.

(20)

Supporting software for analyzing the contents of a data warehouse using a dynamic schema as presented in this thesis has been developed. This provides a visualization of the structure of the data stored and enables more structured querying; both manually and from other systems. The information given includes the hierarchical structure, the names of all keys and the types of all the associated values.

Scripts that enable storing and querying historical data in the data warehouse have been developed to bridge the gap between built-in functionality of the technology used and the the basic requirement of including historical data in a data warehouse. These programs migrate data from realtime databases to the data warehouse. They further provide read functionality of data when the data warehouse is queried.

This thesis also demonstrates a potential usage of the system. A simple algorithmic trading system supporting a single, well-known and widely used, algorithm to trade large orders called Volume Weighted Average Price (or VWAP) has been created. A simple market simulator focusing solely on simulating the volume of orders for a single symbol has been created and is used to feed the algorithmic trading system with data. The algorithmic trading system attempts to follow VWAP and stores data to EOES. An enhanced algorithm is then created that reads the stored data and uses a data-clustering technique to improve upon the original algorithm.

1.4 Thesis Outline

Appendices should be read and understood before continuing with the rest of the thesis.

They are provided for readers unfamiliar with technology and financial concepts that this thesis relies heavily on. Appendix A gives an introduction to important software technology used in finance and relevant to this thesis: kdb+ and TIBCO Rendezvous. Appendix B provides a review of the field of algorithmic trading with particular focus on two of the most popular algorithm types: Volume Weighted Average Price (VWAP) and Implementation Shortfall (IS). Prerequisites such as the movement of stock prices are covered. This leads to the definition of important asset characteristics such as volatility. Transaction-cost analysis is identified as a major part of algorithmic trading.

Chapter 2 presents the design of a feedhandler aimed at high throughput capacity and low latency using the technology introduced in Appendix A. This feedhandler is designed for processing of regular order and execution data.

Chapter 3 presents the Extended Order and Execution Service which is the main system developed by this thesis. It is a family of system components that together form a data warehouse solution for extended order and execution data.

Chapter 4 connects previous chapters by demonstrating the use of a simple data-clustering algorithm in the context of an OMS providing the VWAP algorithm supported by EOES.

The results obtained are compared to a naive solution. Appendix B should be consulted prior to reading this chapter.

Some final conclusions are presented in Chapter 5.

(21)

Chapter 2

Handling Incoming Feeds

By a feedhandler, we mean a system that listens to feeds of data produced by OMSs. The data received are processed and stored in a database. They may also be sent to client systems by the feedhandler post-processing. This chapter first describes the requirements, design and implementation of a feedhandler listening to FIX data feeds of orders and execution reports produced by OMSs.

2.1 System Requirements

The volume of messages and the throughput with which messages are created by OMSs varies between OMSs, days and significantly intraday. This places pressure on the software accepting the incoming feeds. Section 1.1 briefly mentioned high-touch vs low-touch systems.

Because high-touch systems require human intervention, the volume of data produced by them is less than that produced by low-touch systems.

Figure 2.1: Volume for symbol BARC.L

Figure 2.1 shows the volume during a period of time for the symbol BARC.L which is Barclays PLC trading on the London Stock Exchange. The aggregate volume is generally large and a system capable of handling such a volume of order and execution data efficiently is necessary. This consideration is an important parameter in the design of the architecture

11

(22)

12 Chapter 2. Handling Incoming Feeds

for a system used to accept and process order and execution data as a low throughput capacity can result in an unacceptable latency between receiving a message and its availability in the database.

Different levels of time granularity give different insights into the volume of data produced. Figure 2.1 shows total volume per day for a number of days. On that time scale volume increases during periods of uncertainty, high volatility and crises. Within a day, volume profiles of symbols and of stock markets in general often display some day-invariant pattern. A common pattern is the U-shape, with increased volume at the opening and the close. Within an hour within a day, there may be minutes that contain bursts of activity.

A feedhandler should be capable of dealing with these bursts rather than simply keeping up with the average throughput.

An application written in a high-level language such as Java uses some messaging setup to receive incoming data. The received data are then typically processed in stages. The processing can include translation and transformation of values before database records are created and stored in the database. By translation we mean replacing a value with a synonym. By transformation we mean performing an operation on one or several values in order to produce a new value. Post-processing, the feedhandler may broadcast the records produced to systems subscribing to the data feed produced by the feedhandler.

Batch processing is typically employed as it decreases overhead. This increases throughput while affecting latency negatively. There is a trade-off between high throughput capa- bility and latency. The volumes may not allow small batch sizes while systems using the records produced by the feedhandler (either through queries or a subscription) may not function properly if the latency is too high because of the time value of information.

2.2 System Design and Implementation

The feedhandler created as part of this thesis has two parts: the server written in Java accepting the incoming messages and the kdb+ database ultimately storing the records produced. The processing performed in the kdb+ process is limited in comparison to that performed in the server process. The server process needs to translate text or binary data into objects. These objects may then need to be manipulated upon. Finally they have to be converted into bytes that can be sent to the kdb+ process according to the kdb+

inter-processes communication protocol. The IPC communication is done through sockets.

The raw data are sent to the server using TIBCO Rendezvous. The server can easily be configured to accept feeds from a range of different broadcast groups and topics using the TIBCO Rendezvous API. As explained in Section A.2, messages are stored in the ledger until they have been confirmed by the feedhandler. This means that, in addition to systems depending on the state of the database being near real-time, a low throughput capacity can result in disk space issues and increased memory usage ultimately resulting in an out-of- memory error if the ledger structure grows faster than it can be consumed.

The system described is illustrated in Figure 2.2. The feedhandler views a configurable batch size of messages as a work item and utilizes a thread pool to process these batches in parallel. Each message arrives in the FIX format and is parsed into Java objects resem- bling an XML structure compatible with the database schema. The Java objects are then converted into bytes that the kdb+ database understands and sent to it via sockets.

All required processing is trivially parallelizable but the order of the received messages is often required to be kept intact. This keeps different timestamps from showing time inconsistencies and gives the true chronological view of what actually happened. Therefore, all batches are associated with a number value that is monotonically increasing throughout

(23)

2.2. System Design and Implementation 13

Figure 2.2: Overview of feedhandler solution. TIBCO RV is the messaging system. Messages are then processed by the Java implementation and sent to the kdb+ process for storage.

the day. Before being stored in the database, the batches are placed in a queue that keeps track of what batch should be stored next. This maintains the proper ordering of messages as messages are correctly ordered within each batch as well.

The implementation was tested by sending 400 000 messages consecutively using machines with a low workload. The result is a stable average of a 1000 messages per 13 milliseconds for a peak capacity of over 76 000 messages per second. These numbers were obtained for messages containing FIX data and so indicative of performance capacity in a production setting. Different numbers may apply depending on the size of data in each message. Different numbers may also result depending on the load on the machine and in general other external factors. The machines were typical high-end servers c:a 2011 with 8 cores of 2.0 GHz processing power and 128 GB DDR2 internal RAM memory.

(24)

14 Chapter 2. Handling Incoming Feeds

(25)

Chapter 3

Extended Order and Execution Service

In a finance setting of order, execution and tick data, a potential data warehouse solution could be that of Figure 3.1. The producers of the raw data are external providers such as Reuters for market data and OMSs for order and execution data. The data then pass the feedhandler layer where they may be cleaned and reformatted before entering intraday, volatile storage (with continous backups on both sending and receiving sides). This storage is typically in-memory and used by applications requiring real-time data.

The data are migrated in intervals, say nightly, to non-volatile storage. This layer could potentially perform any cleaning and reformatting needed. The data are then finally accessible as part of the data warehouse. The bottom of Figure 3.1 shows the combined data warehouse consisting of order, execution, and tick data.

The entire system which has been developed for this thesis is called Extended Order and Execution Service and will be referred to as EOES for short. Referring to Figure 3.1, EOES includes subsystems that are a part of the OMS layer, the feedhandler layer, the resave layer, the data warehouse and includes applications built on-top of this.

The log4eoes subsystem is a lightweight logging system to be used by OMSs for publish- ing data. The feedhandler4eoes subsystem is a feedhandler server capable of processing any custom message published through the use of log4eoes. The eoes realtime subsystem is a kdb+ database that coupled with feedhandler4eoes forms the complete feedhandler solution. The eoes historical subsystem is the data warehouse for extended order and execution data. The Empirical Analyzer is a support system to be used to make sense of, and enable easier querying of the data warehouse.

Figure 3.2 shows parts of the system in action upon receipt of an order. In this example, the order is placed by an institutional investor such as a hedge fund or an endowment in step 1. The order arrives at OMS A in step 2. A trader at one of the associated desks decides how to work the order and the event is logged in step 3 before being forwarded to OMS C in step 4. This OMS has market connections and sends the order(s) to one or more markets after, for example, potentially splitting the incoming order(s). The event is again logged. All events are logged using log4eoes; feedhandler4eoes processes the incoming logs and they are subsequently available in eoes realtime. Received execution reports are logged similarly.

Figure 3.3 shows an application of realtime eoes and refers back to Section 1.1.2. The two teams jointly inform the technology support team of the details (such as time, identifi-

15

(26)

16 Chapter 3. Extended Order and Execution Service

Figure 3.1: Finance Data Warehouse Setup. Data is produced by OMSs and market feeds.

They are first stored in in-memory databases before being sent to historical storage as part of a nightly job.

cation numbers and anything else deemed relevant). Through interactive data exploration all relevant events can be found in realtime eoes and a quick report with the events in chronological order can be prepared and distributed back to the teams. The power lies in the report containing more than basic information about the orders and executions; it will contain the additional information logged by the OMSs.

Moving data from eoes realtime to eoes historical is a simple job. Once historical the data can be used by an algorithmic trading OMS (that has previously logged events). As first stated in Section 1.1.2, the data can for example be used for back-propagation testing, software error resolution and machine learning applications.

Figure 3.4 shows a scenario similar to the one in Figure 3.2. The difference here is that OMS C uses data from eoes historical as input to a machine learning algorithm to work the order(s). The power again lies in the data being rich with information rather than stripped to the bare essentials.

3.1 log4eoes

The log4eos subsystem constitutes the logging system used by software that wishes to publish logs. The design aims for this system to be lightweight and allow for lazy exploitation. By lazy exploitation a non-intrusive API that clients can use without much adaption required is meant.

The signature for logging data is as follows (in Java code):

log(LogLevel logLevel, String system,

String project, String rule,

(27)

3.1. log4eoes 17

Figure 3.2: Flow of an Institutional Order

String miscellaneous, Object object)

with some overloads that allow for increased flexibility (for example being able to add a custom date rather than the current date). This signature allows for logging using a log level enumeration with members: DEBUG(0), INFO(1), WARN(2), ERROR(3) and FATAL(4).

DEBUG is the lowest filter and FATAL is the highest. The filters are relevant when performing queries against the data warehouse as only events above, below or at a certain level may be of interest.

The argument system is the name of the system logging the event and is a string. The argument project can be used by a system to further specify what is being logged. The argument rule is used to signify what specific system rule caused the event and is a string as well to allow for flexibility. The argument miscellaneous can be used to specify any other details that don’t fit in the other columns and is convenient to have easily accessible.

The argument object is an object of any type that is to be stored and associated with the event. Every object is really an object-graph of key-value pairs where the values may again be objects themselves. This enables logging of any valid Java object.

Log arguments are placed in a new LogData instance. LogData is a simple class that just stores the arguments. This instance is then serialized to a text-based format. This can be any format but the current implementation uses JSON [23]. The messages in JSON format are then sent using REST-style POST requests [24] to a server capable of deserializing and storing the content (see Section 3.2 below). JSON serialization is done via the Google Java library GSON [7]. This transforms any object into an equivalent JSON string.

A settings file enables configuration of endpoint address to the REST web service, settings, the name of the system and the log level.

(28)

Figure 3.3: Usage During Internal Dispute

3.2 feedhandler4eoes

The feedhandler4eoes accepts incoming log requests and attempts to store the log data to the kdb+ database. In production use, the feedhandler would likely use the same software and follow the same architecture as that described in Chapter 2. In the interest of simplicity and testability however, TIBCO Rendezvous was not added as a dependency to this project.

No batch processing or multithreading is used either for the same reason.

The project is a REST-style web service and accepts POST requests to accept incoming logs. The implementation uses the lightweight HTTP server Jetty [8] to enable this scenario.

Currently, anything but POST requests are ignored. It would be possible to extend the web service to enable for example GET requests to read data from the database. Rather than extending feedhandler4eoes however, it would be more suitable to create a wholly new web service that reads from the data warehouse rather than the kdb+ database written to by feedhandler4eoes. Furthermore, kdb+ databases are readily accessed as services themselves.

GSON is used to deserialize the received JSON string. Because any object, whether it is known as a type by feedhandler4eoes or not, is to be capable of being logged, a custom parser is used. The parser transforms the parse tree generated from the string into an object-graph structure using dictionaries. The data are finally sent to the kdb+ database.

3.3 kdb+ Systems

Two systems relying on kdb+ have been implemented. The first system uses an in-memory kdb+ database and receives log data from a feedhandler4eoes instance. This system is called eoes realtime. The database contains one table and has a stored procedure addLog

(29)

3.3. kdb+ Systems 19

Figure 3.4: Machine Learning Scenario Using Historic Data

that accepts a record to insert into the table. The schema definition of the table is given in Listing 3.1. All columns are defined as untyped general lists. Upon receipt of the first record, all columns will receive correct types except for the data column which is the complex column.

Listing 3.1: logTable schema l o g T a b l e : ( [ ]

d a t e : ( ) ; l o g k e y : ( ) ; oms : ( ) ;

p r o j e c t : ( ) ; r u l e : ( ) ; m i s c : ( ) ; l o g l v l : ( ) ;

t r a n s a c t t i m e : ( ) ; l a s t m o d i f i e d : ( ) ; d a t a : ( ) ) ;

At the beginning of a new day, a resave job is responsible for migrating the data from the previous day to a historical database. This system is called eoes historical. Because the table includes a complex column, it can’t be partitioned using standard kdb+ commands. A different solution than the standard solution provided by kdb+ is used. Serialization of data is used combined with a custom folder structure. A subset of the columns common to all log records is used to segregate data into a folder structure somewhat similar to that used by kdb+ for partitioned databases.

Data are arranged first by date, then OMS and finally by project. This is logical because date will provide the largest restriction of data and eliminate almost all other records in

(30)

the historical database from further consideration. This is the strategy used by kdb+ for arranging data for partitioned databases as well (the column can be of any type that is based on an integer). Under a date folder, a folder for each known OMS is found. Under an OMS folder, the actual data files reside where the name is based on the project. Note that date, OMS and project columns are all mandatory when performing queries. Figure 3.5 shows the table structure implied by this organization.

Figure 3.5: Folder structure of a historical database

This format can be accomplished through a script such as that in Listing 3.2. Here, triples of the date, OMS and project are generated. These are then used to both select the relevant data, create the path to save the data and then finally save the data through regular serialization.

Listing 3.2: Extracts triples from the date, OMS, and project and creates paths dateTime : . z . Z ;

s a v e D a t e : dateTime . d a t e − 1 ;

o m s L i s t : e x e c d i s t i n c t oms from l o g T a b l e L i v e ; d t O m s P r o j T r i p l e s : { [ o m s s r c ] p r o j L i s t :

e x e c d i s t i n c t p r o j e c t from l o g T a b l e L i v e where oms=o m s s r c ; { [ omssrc ; p ] ( dt ; omssrc ; p ) }

[ o m s s r c ] e a c h p r o j L i s t } e a c h o m s L i s t ; . . .

The second system, the historical database, uses this setup. Access to the historical database is restricted to be through a stored procedure query that accepts arguments of lists of the desired OMSs and projects to query as well as a custom query encoded as a string. The string is parsed into a parse tree and then evaluated. It should always query

(31)

3.4. Empirical Analyzer 21

the table called logTable.

The contents of the logTable are determined by joining together all tables currently available in memory. The arguments provided to the stored procedure regarding OMSs and projects are used to build paths to serialized data files that will be necessary to load in order to complete the query. A table in the database keeps track of all loaded tables.

A configuration setting limits the number of tables that can be loaded at the same time to avoid memory exhaustion. If a table is already loaded, it is not loaded again. Newly referenced tables are however placed first in the table of loaded tables. This is a kind of temporal software-caching to make sure frequently referenced data is not unloaded and reloaded unnecessarily.

Querying of complex columns was demonstrated in Section A.1.

3.4 Empirical Analyzer

Integration is a vital part in data warehouse solutions. Given the requirements, the systems described so far cannot fulfill this role adequately. Common fields used by all records are integrated to a certain extent but no common format for for example the rule field is defined.

Using exploration, records can still be viewed and queried. But for programmatic access and data selection, extraction, and data mining, more powerful methods are required. The Empirical Analyzer attempts to solve part of this problem.

The Empirical Analyzer is a support system that, given some selection criteria (ex- pressed as a kdb+ select expression), mines the selected records and gathers data on the structure of their complex column. It collects information on what attributes are found at the different levels of the nested dictionaries found in the complex values. The name, type and count (number of occurrences) of the attribute are recorded. For levels beneath the first, the parent of the attribute is also recorded to create a tree view of the implicit object graph stored.

This gives a good view of the actual structure of records stored in the data warehouse.

It can be used to make informed decisions on what and how to select and extract data from the data warehouse in case a data mining method is to be employed. It can also be used to find faulty records in case tested stored procedures fail with kdb+ type errors (which are notoriously hard to debug).

In the tree, each nested dictionaries result in one new node for every identifier contained therein. Every list found also results in a new node. Once all nested dictionaries and lists (in whatever order) have been traversed, information on basic values (i.e, values of type integer, symbol and so on) contained therein are gathered and stored in a new node. Listing 3.3 creates several complex records. The resulting table can then be mined using the Empirical Analyzer to give a visual overview of the structure of the records stored.

(32)

Listing 3.3: Creation of a sample database / / / / / / / / / / / /

/// Day 1 / / / / / / / / / / / /

d a t a 1 : ‘ a ‘ b ‘ c ! 1 0 0 200 3 0 0 ;

d a t a 2 : ‘ a ‘ b ‘ c ! ( e x t r a D i c t 1 ; 1 0 0 ; ‘ b a s d f ) ;

d a t a 3 : ‘ d ‘ e ‘ f ‘ g ! ( ” s t r i n g 1 ” ; ” s t r i n g 2 ” ; 1 0 0 ; ‘ o v a l ) ;

d a t a 4 : ‘ a ‘ b ‘ f ‘ g ! ( e x t r a L i s t 1 ; e x t r a L i s t 2 ; e x t r a D i c t 2 ; e x t r a D i c t 3 ) ;

‘ d e b 2 0 1 1 1 1 2 0 a l g o g e n e r a l i n s e r t

( 2 0 1 1 . 1 1 . 2 0 ; ” 2 0 1 1 . 1 1 . 2 0 . a l g o . g e n e r a l . 1 ” ;

‘ a l g o ; ‘ g e n e r a l ; ” t e s t 1 ” ; ” sample t e s t ” ; ‘DEBUG;

2 0 1 1 . 1 1 . 2 0 D10 : 0 0 : 0 0 . 0 0 0 ; 2 0 1 1 . 1 1 . 2 0 D10 : 0 0 : 0 0 . 3 0 0 ; d a t a 1 ) ;

‘ d e b 2 0 1 1 1 1 2 0 a l g o g e n e r a l

i n s e r t ( 2 0 1 1 . 1 1 . 2 0 ; ” 2 0 1 1 . 1 1 . 2 0 . a l g o . g e n e r a l . 2 ” ;

2 0 1 1 . 1 1 . 2 0 D10 : 3 0 : 1 0 . 0 4 0 ; 2 0 1 1 . 1 1 . 2 0 D10 : 3 0 : 1 0 . 5 3 4 ; d a t a 2 ) ;

‘ d e b 2 0 1 1 1 1 2 0 a l g o g e n e r a l

2 0 1 1 . 1 1 . 2 0 D14 : 0 4 : 0 0 . 0 0 0 ; 2 0 1 1 . 1 1 . 2 0 D14 : 0 4 : 0 0 . 3 5 6 ; d a t a 3 ) ;

‘ d e b 2 0 1 1 1 1 2 0 a l g o g e n e r a l

2 0 1 1 . 1 1 . 2 0 D15 : 0 0 : 5 5 . 4 0 0 ; 2 0 1 1 . 1 1 . 2 0 D15 : 0 0 : 5 6 . 1 0 2 ; d a t a 4 ) ; / / / / / / / / / / / /

/// Day 2 / / / / / / / / / / / /

d a t a 1 : ‘ a ‘ b ‘ c ‘ d ‘ e ‘ f ! ( e x t r a D i c t 4 ; e x t r a L i s t 3 ; 1 0 0 ; 2 0 0 ; 3 0 0 ; ” a s d f ” ) ; d a t a 2 : 1 0 0 0 2 0 0 0 ! ‘ a ‘ b

d a t a 3 : ‘ a ‘ b ‘ c ! ( e x t r a L i s t 4 ; 1 0 0 ; 2 0 1 1 . 1 1 . 2 0 D10 : 0 0 : 0 1 . 0 0 0 ) ;

‘ d e b 2 0 1 1 1 1 2 1 a l g o g e n e r a l

2 0 1 1 . 1 1 . 2 1 D11 : 1 0 : 3 0 . 0 4 0 ; 2 0 1 1 . 1 1 . 2 1 D11 : 1 0 : 3 0 . 7 0 4 ; d a t a 1 ) ;

‘ d e b 2 0 1 1 1 1 2 1 a l g o g e n e r a l

2 0 1 1 . 1 1 . 2 1 D11 : 1 1 : 2 2 . 3 1 1 ; 2 0 1 1 . 1 1 . 2 1 D11 : 2 9 : 5 6 . 1 2 2 ; d a t a 2 ) ;

‘ d e b 2 0 1 1 1 1 2 1 a l g o g e n e r a l

2 0 1 1 . 1 1 . 2 1 D16 : 0 0 : 0 1 . 2 1 1 ; 2 0 1 1 . 1 1 . 2 1 D16 : 0 0 : 0 1 . 5 0 0 ; d a t a 3 ) ; / / / / / / / / / / / /

/// Day 3 / / / / / / / / / / / / d a t a 1 : ‘ a ‘ b ! 1 0 1 0 ;

d a t a 2 : ‘ a ‘ b ! ( e x t r a L i s t 5 ; ” a s d f 2 ” ) ;

‘ d e b 2 0 1 1 1 1 2 2 a l g o g e n e r a l

2 0 1 1 . 1 1 . 2 2 D16 : 1 6 : 3 0 . 0 4 0 ; 2 0 1 1 . 1 1 . 2 2 D16 : 1 6 : 3 0 . 7 0 4 ; d a t a 1 ) ;

‘ d e b 2 0 1 1 1 1 2 2 a l g o g e n e r a l

2 0 1 1 . 1 1 . 2 2 D16 : 1 7 : 3 0 . 7 8 0 ; 2 0 1 1 . 1 1 . 2 2 D16 : 1 7 : 3 0 . 8 0 4 ; d a t a 2 ) ;

(33)

Figure 3.6 shows the result of querying the data using the Empirical Analyzer. All basic values are shown together under a single Base node while dictionaries get their own nodes with a custom name. Lists also become nodes with List as heading and their contents are displayed underneath.

More commonly, the structure will be fairly simple. An example of a much more straightforward structure is shown in Listing 3.4. Mining this structure using the Empirical Analyzer gives the result shown in Figure 3.7. As can be seen this structure contains no nested lists or dictionaries. Instead it is just a mapping from keys to values.

Figure 3.6: A complex sample schema

(34)

Listing 3.4: Sample database using a simpler structure / / / / / / / / / / / /

/// Day 1 / / / / / / / / / / / /

d a t a 1 : ‘ OrderID1 ‘ ClOrderID1 ‘ OrderID2

‘ ClOrdID2 ‘ S e c u r i t y I D ‘ P r i c e ‘ R e f e r e n c e P r i c e ! ( ”PUMA. 2 0 1 1 . 1 1 . 2 0 .D.ABCD” ; ”ABCD” ;

”PUMA. 2 0 1 1 . 1 1 . 2 0 .D.EFGH” ; ”EFGH” ;

‘MSFT; 3 0 . 0 0 ; 3 3 . 5 0 ) ;

‘ ClOrdID2 ‘ S e c u r i t y I D ‘ P r i c e ‘ R e f e r e n c e P r i c e ! ( ”VF1 . 2 0 1 1 . 1 1 . 2 0 .D. HIJK ” ; ” HIJK ” ;

”PUMA. 2 0 1 1 . 1 1 . 2 0 .D.RATT” ; ”RATT” ;

‘GOOG; 6 0 . 5 0 ; 6 5 . 5 0 ) ;

‘ ClOrdID2 ‘ S e c u r i t y I D ‘ P r i c e ‘ R e f e r e n c e P r i c e ! ( ”VF1 . 2 0 1 1 . 1 1 . 2 0 .D.ORVD” ; ”ABCD” ;

”NP3 . 2 0 1 1 . 1 1 . 2 0 .D.GATT” ; ”GATT” ;

‘NMRA; 3 . 0 0 ; 4 . 5 0 ) ;

‘ d e b 2 0 1 1 1 1 2 0 n x d i s c r e p a n c y

i n s e r t ( 2 0 1 1 . 1 1 . 2 0 ; ” 2 0 1 1 . 1 1 . 2 0 . nx . d i s c r e p a n c y . 1 ” ;

‘ nx ; ‘ d i s c r e p a n c y ; ” r e f e r e n c e p r i c e mismatch ” ;

” sample t e s t ” ; ‘DEBUG; 2 0 1 1 . 1 1 . 2 0 D11 : 1 0 : 3 0 . 4 9 1 ; 2 0 1 1 . 1 1 . 2 1 D11 : 1 0 : 3 0 . 7 1 4 ; d a t a 1 ) ;

/ / / / / / / / / / / / /// Day 2 / / / / / / / / / / / /

‘ ClOrdID2 ‘ S e c u r i t y I D ‘ P r i c e ‘ R e f e r e n c e P r i c e ! ( ”VF1 . 2 0 1 1 . 1 1 . 2 1 .D.ABCD” ; ”ABCD” ;

”NP3 . 2 0 1 1 . 1 1 . 2 1 .D.EFGH” ; ”EFGH” ;

‘MSFT; 3 0 . 0 0 ; 3 3 . 5 0 ) ;

‘ ClOrdID2 ‘ S e c u r i t y I D ‘ P r i c e ‘ R e f e r e n c e P r i c e ! ( ”PUMA. 2 0 1 1 . 1 1 . 2 1 .D. HIJK ” ; ” HIJK ” ;

”PUMA. 2 0 1 1 . 1 1 . 2 1 .D.RATT” ; ”RATT” ;

‘VOD; 1 5 . 0 0 ; 1 3 . 5 0 ) ;

/ / / / / / / / / / / / /// Day 3 / / / / / / / / / / / /

‘ ClOrdID2 ‘ S e c u r i t y I D ‘ P r i c e ‘ R e f e r e n c e P r i c e ! ( ”PUMA. 2 0 1 1 . 1 1 . 2 2 .D.ABCD” ; ”ABCD” ;

”VF1 . 2 0 1 1 . 1 1 . 2 2 .D.GATT” ; ”GATT” ;

‘MSFT; 3 1 . 0 0 ; 3 4 . 5 0 ) ;

‘ ClOrdID2 ‘ S e c u r i t y I D ‘ P r i c e ‘ R e f e r e n c e P r i c e ! ( ”VF1 . 2 0 1 1 . 1 1 . 2 2 .D.ORVD” ; ”ORVD” ;

”PUMA. 2 0 1 1 . 1 1 . 2 2 .D.EFGH” ; ”EFGH” ;

‘MSFT; 3 2 . 5 0 ; 3 6 . 5 0 ) ;

(35)

Figure 3.7: Simple key-value schema inspection.

(36)

(37)

Chapter 4

Using EOES to Improve

Intra-Daily Volume Prediction

4.1 Introduction

Algorithmic trading is a large and dynamic field. Its size has been expanding as a result of increased technology adoption in the financial markets and the expansion of financial markets in general. The field changes as a result of innovation, technological advancement and regulation.

An overview is given in Appendix B. Market microstructure, transaction-cost analysis and algorithms are covered there. The appendix should be used as support and reference when reading this chapter. This introduction uses a sample scenario to intuitively provide the reader with prerequisites required to read the rest of the chapter.

In this example we assume a mutual fund has a portfolio of some forty stocks. An investment decision is made to sell some minor holdings in exchange for a significant holding in another company ABC. We focus on the purchasing of shares of company ABC.

Stock markets follow the same laws of supply and demand present in all markets. In- tuitively, if there are more sellers than buyers, the buyers have a stronger position and the price is driven downwards. A new player entering the market with an order will affect the imbalance between supply and demand proportionately as her order size compares to the general turnover size of the equity traded. Thus a small order in absolute terms may have a much larger impact on the price of an equity with a small daily turnover than a large order in absolute terms will on the price of an equity with a large daily turnover.

In finance, this effect on price is referred to as market impact. A distinction is made between temporary and permanent market impact. Temporary market impact occurs as a result of the instantaneous imbalance created between supply and demand relative to before the order was placed. As time passes, the price level partially reverts to that before the order was placed. Permanent market impact instead refers to the permanent change in price level as a result of the order leaking information that the previous level was inefficiently priced.

The difference between the original price and the price after revertion from the level caused by temporary market impact is the price change caused by permanent market impact.

Making our example more concrete, we assume that the average daily volume for ABC is 600000 shares. We further assume that we intend to buy two million shares in total. Placing an order to buy 2 million shares with an average daily volume of 600000 would result in

27

(38)

28 Chapter 4. Using EOES to Improve Intra-Daily Volume Prediction

enormous market impact as the order size is larger than the average volume traded during an entire day. It is therefore logical to ”work the order” during several days. The order can therefore be divided into ten equal pieces X = X₁+ X₂+ X₃+ ... + X₁₀.

This results in each order X_ihaving a size of 200000. This accounts for more than 33% of the average daily volume which is still a significant amount. Each sub-order can be further divided into additional sub-orders that are traded during different intervals of the day. If we arbitrarily pick 16 intervals we have Xi =

16

P

j=0

xj. This time however, we may want to choose a different partition scheme than equal size in each interval.

The reasoning behind this is that many symbols, and the entire market in general, show volume patterns with different volumes traded during different times of the day. The most common pattern is a U-shaped pattern with increased volume during the open and during the close. It is common to create a historical volume profile from, say, the last 30 days of trading. This profile can then be used to make sure that more shares are traded during intervals when there is more volume in the market for the symbol in general. By the same reasoning as before, even though the absolute size of the trades are large, as long as they are equal to the trade sizes during other intervals in relative terms, the market impact (temporary and permanent) will not be excessive. Similarily, if the naive partition scheme is used, significant market impact may occur during intervals of low market volume.

As implied in the previous discussion, minimizing market impact is a common goal when placing large orders. Decreasing the size of each sub-order will lead to less and less market impact. However, there are almost always constraints on the way the order can be handled.

For example, the investor may be vary of the price fluctuating far away from the initial decision price. This is referred to as timing risk.

If we assume that the mutual fund is not too concerned with timing risk, it will accept dividing the order into several days as stated above and during each day use an algorithm aimed at minimizing market impact. One such algorithm is based on the Volume Weighted Average Price. Details can be found in Appendix B, but the definition of the benchmark is straight-forward: ^{P p}_{P v}^j^∗v^j

j . This is simply the value of all trades at time j divided (or

”weighted”) by the volume during that time. It can be shown that to follow VWAP is to trade in proportion to the market volume.

In this chapter, we describe the simulation of an OMS offering the VWAP algorithm and use the EOES described in Chapter 3 to log data. The same OMS uses this data to improve upon the VWAP algorithm through data mining techniques also covered in this chapter.

This was one of the uses of OMS data given in Section 1.1.2.

(39)

4.2. Data-clustering for Intra-Daily Volume Prediction 29

4.2 Data-clustering for Intra-Daily Volume Prediction

Data-clustering is a part of data mining. An educational introduction to both are given in [25]. We give a summary of instance-based learning in this section.

Data classification takes an instance and classifies it into one of a predefined set of classes.

A class is not limited to being a noun. A classification algorithm could for example classify an instance of weather conditions into either the class play or stay inside.

An example of a classification algorithm is the k-nearest-neighbor algorithm. Each instance is represented as a d-dimensional vector and distance between the instance to classify and all previously classified instances are compared and the majority class among the k nearest neighbors is used to classify the instance. Usually the distance metric used is the Euclidean distance shown in Equation (4.1). Problems can arise when the problem involves high-dimensionality when using this metric. This was however not found to be the case in this thesis.

qX

(ai− bi)² (4.1)

Data-clustering is used to naturally partition instances into clusters (groups). The difference between clustering and classification is that there are no predefined classes in the latter. Instances are grouped into clusters to discover what instances are in some way similar to one another.

An example of a clustering algorithm is the k-means clustering algorithm. A common procedure is to use an iterative approach where k centroid values are initialized. The algorithm proceeds to assign each instance to the cluster it is closest to. The centroids of the k clusters are then recalculated by taking the mean of the position of all instances belonging to the cluster. This is repeated until (hopefully) convergence.

4.2.1 Data-clustering on Intra-Daily Volume

Given x instances of n intervals of volume data each, vectors can be formed for each instance by using the pure volume data. The k-nearest neighbor algorithm can then be used on these vectors as described previously.

3 instances are shown in Figure 4.1. All instances show a clear intra-day trend of decreasing volume as time increases. Figure 4.2 shows 3 instances displaying a clear intra-day trend of increased volume at the open and at the end. This is a more realistic pattern and is referred to as a U-shaped pattern.

(40)

30 Chapter 4. Using EOES to Improve Intra-Daily Volume Prediction

Figure 4.1: A decreasing volume pattern.

Figure 4.2: A U-shaped volume pattern.

a Data-Warehouse Solution for OMS Data Management