Applications of graph theory in the energy sector, demonstrated with feature selection in electricity price forecasting

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Applications of graph theory in the

energy sector, demonstrated with

feature selection in electricity price

forecasting

(2)

(3)

Applications of graph theory in

the energy sector, demonstrated

with feature selection in

electricity price forecasting

DUC TAM VU

Degree Projects in Optimization and Systems Theory (30 ECTS credits) Master’s Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020

Supervisors at Fortum Sverige AB: : Hans Bjerhag, Alexandra Bådenlid, Linda Marklund Ramstedt

(4)

TRITA-SCI-GRU 2020:219 MAT-E 2020:062

Royal Institute of Technology

School of Engineering Sciences KTH SCI

(5)

Abstract

Graph theory is a mathematical study of objects and their pairwise relations, known as nodes and edges respectively. The birth of graph theory is often considered to take place in 1736 when the Swiss mathematician Leonhard Euler tried to solve a routing problem involving seven bridges of K¨onigsberg in Prussia. In more recent times, graph theory has caught the attention of companies from all types of industries due to its power of modelling and analysing exceptionally large networks.

This thesis investigates the usage of graph theory in the energy sector for a utility company, in particular Fortum whose activities consist of, but not limited to, production and distribution of electricity and heat. The output of the thesis is a wide overview of graph-theoretic concepts and their practical applications, as well as a study of a use-case where some concepts are put into deeper analysis. The chosen use-case within the scope of this thesis is feature selection - a process for reducing the number of features, also known as input variables, typically before a regression model is built to avoid overfitting and increase model interpretability.

Five graph-based feature selection methods with different points of view are studied. Experiments are conducted on realistic data sets with many features to verify the legitimacy of the methods. One of the data sets is owned by Fortum and used for forecasting the electricity price, among other important quantities. The obtained results look promising according to several evaluation metrics and can be used by Fortum as a support tool to develop prediction models. In general, a utility company can likely take advantage graph theory in many ways and add value to their business with enriched mathematical knowledge.

(6)

(7)

Sammanfattning

Grafteori är ett matematiskt omr˚ade där objekt och deras parvisa relationer, även kallade noder respektive kanter, studeras. Grafteorins födsel anses ofta äga rum ˚ar 1736 när den schweiziske matematikern Leonhard Euler försökte lösa ett vägsökningsproblem som involverade sju broar av Königsberg i Preussen. P˚a senare tid har grafteori f˚att uppmärksamhet fr˚an företag inom flera branscher p˚a grund av dess kraft att modellera och analysera väsentligt stora nätverk.

Detta arbete undersöker användningen av grafteori inom energisektorn för ett allmännyttigt företag, närmare bestämt Fortum vars verksamhet best˚ar av, dock ej begränsat till, produktion och distribution av elektricitet och värme. Arbetet resulterar i en bred översiktlig genomg˚ang av grafteoretiska begrepp och deras praktiska tillämpningar, samt ett fallstudium där n˚agra begrepp sätts in i en djupare analys. Det valda fallstudiet inom ramen för arbetet ¨

ar variabelselektering - en process för att minska antalet ing˚angsvariabler, vilket vanligtvis genomförs innan en regressionsmodell skapas för att undvika ¨

overanpassning och ¨oka modellens tydbarhet.

Fem grafbaserade metoder för variabelselektering med olika st˚andpunkter studeras. Experiment genomförs p˚a realistiska datamängder med m˚anga ing˚angsvariabler för att verifiera metodernas giltighet. En av datamängderna ägs av Fortum och används för att prognostisera elpriset, bland andra viktiga kvantiteter. De erh˚allna resultaten ser lovande ut enligt flera utvärderingsm˚att och kan användas av Fortum som ett stödverktyg för att utveckla prediktionsmodeller. I allmänhet kan ett energiföretag sannolikt dra fördel av grafteori p˚a m˚anga sätt och skapa värde i sin affär med hjälp av berikad matematisk kunskap.

(8)

(9)

To my dear soulmate Flˆute _{p 啟玄元君 y} and all the marvellous 26 years

I have got to spend with you

(10)

(11)

Declaration

This individual thesis of mine is a part of a project I conducted together with my student partner Kristofer Nils Harald Wannheden Espinosa, also known as Kristofer Espinosa, at Fortum Sverige AB spring 2020.

Our studies are both about the applications of graph theory in the energy sector but with different foci. The thesis of Kristofer focuses on assessment of techno-economical aspects of graph theory in the energy business, while my thesis focuses on mathematical concepts and algorithms.

(12)

(13)

Jag skulle vilja tacka mina handledare p˚a Fortum, Alexandra B˚adenlid och Linda Marklund Ramstedt, samt min examensarbetesdirektör och mentor Hans Bjerhag, för deras ständiga stöd och engagemang. Alexandra och Linda är tv˚a unga energiska damer med innovativa tankar. Hans är en vis och storsint herre med exceptionella falkögon som lyckats hitta spr˚akgrodor i mina tidiga matematiska texter.

Jag är tacksam för all hjälp fr˚an KTH som jag har f˚att av min handelare Xiaoming Hu s˚aväl som min adoptiva handledare Elena Malakhatka. Med gott humör, gedigen akademisk erfarenhet och hedervärd sympati har Xiaoming gjort examensarbetet lättsammare för mig. Elena var handledaren till min studentpartner Kristofer men tog även hand om mig i processen med välvilja och gott fika samt vägledde mig till projekts slut.

Min tacksamhet g˚ar även till min studentpartner Kristofer Espinosa som arbetat p˚a detta projekt och kämpade à travers le feu et l’eau tillsammans med mig ända sedan begynnelsen. Utan honom skulle jag inte ha hittat nöjen i att arbeta p˚a ett energiföretag. Ett varmt tack g˚ar till v˚ar gemensamma vän Adnan Jamil som under projektets g˚ang stött oss med sin datatekniska expertis.

Ett smultronställe p˚a denna sida dedicerar jag ˚at tre vänner 1 som har varit av stor betydelse för mig under min studietid 2018-2020. Jag tackar Acernius p穎松公y orientaliskt för din högeligen rogivande v˚ag och värme. Hur insigfikant du än tycker att allting har varit, är du min vän. Ett spektralt tack g˚_{ar till Gualterius p永莞公y som med absolut tillgivenhet} gav mig en tron där mitt hjärta finner frid. Skolan är en rökväv, du är en körsbärsg˚ard. De sista orden s¨_{ander jag till Halkyones p忠勇公y som har} g˚att till tidens kant och väckt mina monokroma minnen till liv. E ad un tratto, jag förundrar mig över att världen slutar i solsken.

Stockholm i maj 2020, Tˆam V˜u

(14)

(15)

I

Graph theory & applications

8

2 Elementary terminology 10 2.1 Graphs and subgraphs . . . 10

2.2 Graph traversal . . . 12

2.3 Trees and connectivity . . . 13

2.4 Matching . . . 15

2.5 Colouring . . . 15

2.6 Directed graph . . . 16

2.7 Weighted graph . . . 17

3 Selected applications of graphs 20 3.1 University timetabling . . . 20

(16)

Contents

3.3 Cost-effective railway building . . . 22

3.4 Logistic network and optimal routing . . . 23

3.5 Planar embedding of graphs . . . 24

3.6 Winter road maintenance . . . 25

3.7 Social network . . . 26

3.8 Tournament ranking system . . . 28

3.9 Image segmentation . . . 29

Summary 31

II

Graphs,

feature selection & electricity price

forecasting

32

4 Introduction to feature selection 34 4.1 Background . . . 34

4.2 Purpose and outline . . . 35

5 Preliminaries 36 5.1 Laplacian matrix . . . 36

5.2 Nearest neighbour graph . . . 37

5.3 Graph clustering . . . 38

5.4 Comparison of two clusterings . . . 38

5.4.1 Clustering accuracy . . . 39

5.4.2 Normalised mutual information . . . 39

5.4.3 Adjusted mutual information . . . 40

5.5 Similarity comparison of two sets . . . 40

6 Feature selection methods 42 6.1 Laplacian score (LS) . . . 44

6.1.1 Preparation . . . 44

6.1.2 Optimisation problem . . . 44

6.1.3 Feature selection algorithm . . . 46

6.2 Multi-cluster feature selection (MCFS) . . . 47

(17)

Contents

6.4 Feature selection via non-negative spectral analysis and redundancy control (NSCR) . . . 53

6.5 Feature selection via adaptive similarity learning and subspace clustering (SCFS) . . . 57

7 Experiments and results 60 7.1 Line of action . . . 60

7.2 Data sets . . . 61

7.3 Parameter setting . . . 62

7.4 Experiment: Eight public data sets . . . 62

7.4.1 Clustering accuracy . . . 62

7.4.2 Normalised and adjusted mutual information . . . 63

7.4.3 Stability . . . 65

7.5 Experiment: CELEBI of Fortum . . . 66

8 Discussion 70 8.1 Convergence speed . . . 70 8.2 Parameter sensitivity . . . 72 8.3 Jaccard index . . . 75 8.4 Conceivable challenges . . . 76 8.5 Scalability . . . 77

8.6 Implications for forecasting activities . . . 77

9 Conclusion 78

Outro

80

Glossary 82

(18)

(19)

List of Figures

2.1 Examples of graphs and subgraphs. . . 11 2.2 Various kinds of walks in a graph. . . 13 2.3 Examples of graphs which are not trees. . . 14 2.4 Differently configured trees with 5 vertices and 4 edges each. . 14 2.5 Examples of spanning trees in a connected graph. . . 14 2.6 Examples of matchings. . . 16 2.7 Examples of vertex colourings using as few colours as possible. 17 2.8 A weighted directed graph of a small fictitious kingdom far,

far away. . . 18 3.1 The complete bipartite graph K3,3 is not planar. . . 25

3.2 The vertices in a graph can be most central in different aspects. 28 7.1 Jaccard index for eight different data sets. . . 66 7.2 Average Jaccard index. . . 67 8.1 Relative change of the objective value in iterative methods. . . 72 8.2 Clustering quality for different α and β, obtained with the

(20)

List of Tables

3.1 The centrality measures of the vertices in Figure 3.2. . . 28 6.1 Notations associated with a given data set. . . 43 7.1 Data sets with their numbers of samples (m) and features (n). 61 7.2 Clustering accuracy (ACC) [%] corresponding to different data

sets and feature selection methods. . . 63 7.3 Normalised mutual information (NMI) [%] corresponding to

different data sets and feature selection methods. . . 64 7.4 Adjusted mutual information (AMI) [%] corresponding to

different data sets and feature selection methods. . . 64 7.5 Clustering quality measures for feature selection with SCFS

(21)

CHAPTER

1

Introduction

1.1 Background and purpose

Societies are currently undergoing a transition toward a low-carbon energy system. The main drivers of this change, electric utility companies, are digitalising and innovating on new products to stay or become more competitive. The technological advancements with respect to the energy and information technology sectors provide new opportunities for utilities to optimise their operations or find new revenues streams. Graph theory appears as a potentially helpful technology to support such endeavours. It has been used in the contexts of internet of things for routing, fraud detection, customer analysis, advanced search, scheduling and much more. Fortum wants to investigate which possibilities graph theory can bring to their business in the energy sector.

The purpose of this thesis is to find a set of fitting areas where graph technology and optimisation algorithms could create business value for Fortum. Particularly, this thesis aims to answer the following questions:

I. Where can graph theory be useful in the energy sector where Fortum is currently or potentially active?

(22)

1.2. Disposition

1.2 Disposition

This paper comprises two parts.

Part I contains a wide overview of elementary concepts in graph theory and selected applications with support of researches. The aim is to provide an inspiring mathematical background for practitioners needing mathematical foundations for implementing graph analytics applications. This part answers Question I stated in 1.1.

Part II focuses on a specific area relevant for Fortum to bring the general concepts into deeper analysis. In this thesis, the case study is about feature selection for electricity price forecasting using graph-based methods. This part answers Question II stated in 1.1.

1.3 Miscellaneous preferences

Some word formations in this paper are intentional and due to linguistic reasons and personal taste. For example, the expression “graph-theoretic concept” is preferred to “graph-theoretical concept”.

Considering how the last name of the Swiss mathematician Leonhard Euler is pronounced in German, the indefinite article “an” will be used in such expressions as “an Eulerian graph” instead of “a Eulerian graph” as it has appeared in several papers written in English.

Some verbs will be conjugated in praesens historicum, also known as historical present, instead of simple past when they describe the work done in connection with this thesis. For example, the formulation “an experiment is done” will be used instead of “an experiment was done” even though the mentioned experiment was in fact already done a long time before the publication of this thesis.

If a mathematical expression is written in display mode, commas and full stops will not be used after the mathematical expression for the sake of aesthetics. “Display mode” here means that a mathematical expression is written on a separate line and not as a part of a text.

(23)

I don’t know what chapter I’m on. I only know where I am.

Timoth´ee Chalamet

Part I:

(24)

(25)

CHAPTER

2

Elementary terminology

This chapter conveys a number of selected concepts in graph theory which are typically presented in literature about graph theory on an introductory level. The definitions and properties are intentionally written in running texts to create coherent paragraphs and a relaxing reading. If no specific reference is mentioned, the mathematical definition, theorem, property, example or statement in question is common and can be found in such textbooks as [20], [21] and [25].

2.1 Graphs and subgraphs

A graph in graph theory is simply put a collection of vertices (synonym: nodes) and edges, where an edge can be drawn between two vertices to show that these vertices are somehow related to each other. In engineering problem solving, vertices can represent elements in a system, such as people, buildings, electric components, financial assets and data samples.

(26)

2.1. Graphs and subgraphs

Mathematically speaking, a graph G is an ordered pair (V, E), where V is a finite set and E is a set of pairs of elements in V . The set V consists of vertices and E of edges in G. A graph H is called a subgraph of G if each vertex and each edge in H is also in G. In Figure 2.1 (a), the graph G = (V, E) has 7 vertices and 9 edges, where V = {a, b, c, d, e, f, g} and E = {{a, c}, {a, e}, {b, c}, {b, d}, {c, e}, {c, f }, {d, e}, {e, g}, {f, g}}.

a b c d e f g (a) Graph G = (V, E)             0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 0             (b) Adjacency matrix of G a b c d e f g (c) Graph H = (V, E0) a b c d e f g (d) Graph K = (V, E00)

Figure 2.1: Examples of graphs, where H is a subgraph of G and K is another subgraph of G. The graph K is not a subgraph of H since the edge

{b, c} in K is not in H.

Two vertices x, y ∈ V are called neighbours if there is an edge between them, or in other words if {x, y} ∈ E. In this case, one can even say that x and y are adjacent vertices and that each of x and y is incident to the edge {x, y}. Two edges are adjacent edges if they have a common vertex. The structure of a graph, regarding whether the n vertices v1, v2, · · · , vn are

(27)

2.2. Graph traversal

adjacent, and Aij = 0 otherwise.

A basic and important metric in a graph is degree (synonym: valency) of a vertex x, denoted deg(x), which is the number of edges incident to x. A graph where each vertex has degree k is called a k-regular graph. The maximum degree of a graph, denoted ∆(G), is the maximum degree of its vertices. In the same manner, the minimum degree of a graph, denoted δ(G), is the minimum degree of its vertices.

2.2 Graph traversal

If a graph represents for example a network of buildings (vertices) and roads (edges), it is natural to define some kinds of walks through the vertices:

◦ A walk is a sequence of vertices v1v2. . . vk, where v1, v2, · · · , vk ∈ V

and {vi, vi+1} ∈ E for all i = 1, 2, · · · , k − 1. Not all vertices in V need

to be included in a walk.

◦ A trail is a walk which does not go through any edge more than once. ◦ A circuit is a closed trail, meaning that it starts and ends at the same

vertex.

◦ A path is a trail which does not go through any vertex more than once. ◦ A cycle is a closed path.

By these definitions, a cycle is a kind of path, which in turn is a kind of trail. Each path is also a trail, but not all trails are paths. Do note that these names are not standardised and different authors use them differently. In this paper, the terms will be consistently used as described above.

(28)

2.3. Trees and connectivity a d b e c f (a) Graph G a d b e c f (b) A trail in G a d b e c f (c) A path in G a _d b e c _f (d) A circuit in G a _d b e c _f (e) A cycle in G

Figure 2.2: Various kinds of walks in a graph. Note that the circuit in (d) is not a cycle since it passes the vertex b twice. The cycle in (e) is, however,

also a circuit.

2.3 Trees and connectivity

A graph is said to be connected if there is at least a path from any vertex to any other vertex in the graph. A graph which is not connected is called non-connected or disconnected. An important type of connected graph is a tree, which is defined as a connected graph without cycles.

If a graph G = (V, E) is connected, it has at least a subgraph which is a tree containing all vertices in V . This tree is called a spanning tree 1 _{of G. A}

graph may have multiple spanning trees.

1_{The term “spanning tree” has been translated to “sp¨}_{annande tr¨}_{ad” in some textbooks}

(29)

2.3. Trees and connectivity

(a) Graph G (b) Graph H

Figure 2.3: The graph G is connected but is not a tree since it has a cycle. The graph H has no cycles but is not a tree either since it is non-connected.

Figure 2.4: Differently configured trees with 5 vertices and 4 edges each. a b c d e _f g (a) A connected graph G a b c d e _f g (b) A spanning tree of G a b c d e _f g (c) Another spanning tree of G

(30)

2.4. Matching

Two important metrics about the notion of connectivity in a graph G = (V, E) are vertex connectivity and edge connectivity. The vertex connectivity of G, denoted κ(G), is the least number of vertices in V which needs to be removed (together with their adjacent edges) to make G disconnected. The edge connectivity of G, denoted λ(G), is the least number of edges in E which needs to be removed to make G disconnected. It is shown that κ(G) ≤ λ(G) ≤ δ(G) in any graph. As explained in 2.1, the notation δ(G) stands for the minimum degree of G.

2.4 Matching

Consider a group of people where some of them are mutually in love with each other. If two people, who are already in love with each other, get married to each other, this can be considered a matching. Typically, each person can take part in at most one marriage (matching). In graph theory, a matching M in a graph G = (V, E) is defined as a subset of E, where the edges in M are pairwise disjoint, meaning that any two edges in M do not share a vertex.

If a matching contains as many vertices in V as possible, it is called a maximum matching. If a matching contains all vertices in V , it is called a perfect matching. Some graphs do, however, not have a perfect matching, which aligns with the reality where not all people in the end can get married to their dream partners.

2.5 Colouring

Given a graph G = (V, E) drawn 2 _{on a surface, one may sometimes want to}

put a colour on each vertex in V such that no two adjacent vertices get the same colour. More formally speaking, a proper vertex colouring for G is a function c : V → N such that if {x, y} ∈ E, then c(x) 6= c(y). If nothing else is stated, a vertex colouring should always be interpreted as a proper vertex colouring.

The least number of colours required to vertex colour a graph G is call the chromatic number (synonym: vertex colouring number) of G, denoted

(31)

2.6. Directed graph a b c d e f (a) Graph G plainly drawn a b c d e f (b) Graph G with a matching M1 a b c d e f (c) Graph G with another matching M2 a b c d e f (d) Graph G with another matching M3

Figure 2.6: Examples of matchings, where the edges in each matching is marked red. The matchings M1 and M2 are not maximum. The

matching M3 is both maximum and perfect.

χ(G) or simply χ if there is no ambiguity. The Greek letter χ (khi, chi) has to do with the fact that the Greek word for “colour” is “qr¸ma” (khr´oma, chr´oma).

In the same manner, a proper edge colouring is a colouring of edges where no two edges which share a vertex get the same colour. The least number of colours required to edge colour a graph G is call the chromatic index (synonym: edge colouring number) of G, denoted χ0(G) or simply χ0.

2.6 Directed graph

(32)

2.7. Weighted graph a b c d e f

(a) Graph G with χ(G) = 3 a b c d e f (b) Graph H with χ(H) = 4

Figure 2.7: Examples of vertex colourings using as few colours as possible. The graphs have the same number of vertices and the same

number of edges, but different chromatic numbers.

direction if one considers the friendship as symmetric. Here, “symmetric” means that if x is a friend of y, then y is also a friend of x.

Nevertheless, not all relationships need to be symmetric. Imagine two vertices representing two objects x and y in a power grid, where x is a generating plan and y is a household. If x delivers electricity to y, one can choose to describe this flow with an arrow from x to y, implying at the same time that there is no electricity sent from y to x. This arrow is called a directed edge, alternatively an arc.

A graph where each edge is directed is called a directed graph. Even the contraction term digraph is used for convenience.

2.7 Weighted graph

(33)

2.7. Weighted graph a b c d e f g h k i j l 4 5 7 9 11 14 16 17 23 26 28 29 33 42 47 67

Figure 2.8: A weighted directed graph showing the number of people (red edge labels) travelling between some cities (round vertices) at some point of

(34)

(35)

CHAPTER

3

Selected applications of graphs

The applications described in this chapter are selected from a wide range of areas, where graph theory has proven to be an effective tool to model and solve problems. Each application begins with a couple of keywords, followed by an example which can be adapted into the activities of Fortum as a utility company and ends with some suggestions for adaptation possibilities.

3.1 University timetabling

Keywords: vertex colouring, chromatic number

An administrator needs to book classrooms for all the students having exams in a day. There are m student groups g1, g2, · · · , gm and n available

classrooms r1, r2, · · · , rn. Not all groups start or finish at the same time

point, and the exam times for some groups may overlap partially or completely. It is even required that at most one group can be in a classroom at any time. How should the administrator do to book as few classrooms as possible? A motivation can be that the fewer rooms booked the less fee paid to cleaning companies at the end of the day.

This situation can be modelled with a graph, where the vertex set is V = {g1, g2, · · · , gm}. An undirected edge is drawn between two vertices if the

(36)

3.2. Staff assignment

few colours as possible, where a classroom is considered as a colour and two adjacent vertices in a proper vertex colouring cannot have the same colour [4]. The least number of necessary classrooms is equal to the chromatic number of the graph.

In reality, there may be more parameters to take into consideration. Some classrooms may have too few seats for some student groups, and some classrooms may have different booking prices even if they have the same number of seats. The aforementioned graph model would require more parameters to solve the problem.

Adaptation: A utility company with many employees such as Fortum can take use graph colouring to solve timetabling tasks effectively. Power systems, such as hydropower and electricity power systems can also be monitored using graph colouring methods. For example, the connectivity information [29] and the maintenance scheduling [42] can be formulated as a vertex colouring of the system elements and transmission lines, respectively.

3.2 Staff assignment

Keywords: perfect matching

A company has n workers w1, w2, · · · , wnand n tasks t1, t2, · · · , tnwhich need

to be done. The goal is to assign each worker to exactly one task, where each worker is qualified for at least one tasks. Is it possible to come up with such an assignment?

This is a brilliant opportunity to make use of matching in graph. Build a graph with 2n vertices representing all the workers and the tasks, meaning that the vertex set is V = {w1, w2, · · · , wn} ∪ {t1, t2, · · · , tn}. An edge in this

graph goes between a worker wi and a task tj, but not between two workers

or two tasks, if wi is qualified for tj. The staff assignment problem is solved

when a perfect matching is found.

(37)

3.3. Cost-effective railway building

energy sources and bidirectional energy flows among other things, its data are highly vulnerable, making the power system exposed to security risks [12]. It is interesting to note that graph matching is not the only approach to detect anomalies in smart grid problems. For instance, normative subgraphs can also be considered [16].

3.3 Cost-effective railway building

Keywords: weighted graph, minimum spanning tree

An engineer has got a mission to design a railroad network connecting the n neighbourhoods h1, h2, · · · , hn, where a underground station is located at

each neighbourhood. The cost of building a railroad between every pair of stations {si, sj} is given as a positive number cij = cji. Help the engineer

design a railroad network which costs as little as possible.

If h1, h2, · · · , hn are the vertices in a graph G, one can draw all the

conceivable edges to visualise all the potential costs of construction between two stations. Each edge is a weighted undirected edge with weight cij > 0,

where i, j ∈ {1, 2, · · · , n}. Since the railroad network needs to connect all the neighbourhoods, one wants to determine a spanning tree in G with the minimum total weight. In other words, one wants to find a minimum spanning tree in the graph.

A good question here is why one would look after a spanning tree, which is a connected spanning subgraph of G with no cycles. The “connected spanning” part is about connecting all neighbourhoods, but why “no cycles”? The reason is simple. If one has a subgraph which goes through all the vertices in G and contains a cycle, one or some edges in the cycle can always be removed to create a new subgraph which still goes through all the vertices in G but has a less total weight since each individual edge weight is positive. This explains why a spanning tree is better than any other connected spanning subgraph with cycles.

This kind of cost-effective railway network may, however, not be time-effective for the residents in some neighbourhoods.

(38)

3.4. Logistic network and optimal routing

since it is essentially about reduction in networks to minimise cost, decrease complexity, extract the most important links and so on. In a power distribution network with numerous switches, power flow paths can be easier monitored using minimum spanning trees combined with Kruskal’s algorithm to identify these [33]. When studying financial markets, such as stock markets [40] and energy commodity markets [41], a minimum spanning tree can be a simple, yet powerful, tool to uncover the characteristics and dynamics of financial instruments and institutions.

3.4 Logistic network and optimal routing

Keywords: shortest path

One of the most typical and important questions each logistic network faces is to find the shortest path between two stations, which can be train stations, petrol filling stations or pick-up and delivery stations. Take an undirected weighted graph G = (V, E) where V is the sets of all stations and each edge between two vertices represents a physical road between two stations. There can be multiple edges between two vertices in this case and an edge weight measures how far two stations are from each other.

Three elementary problems about shortest path in a logistic network are as follows:

◦ The single-pair shortest path problem: find the shortest path between two given vertices x, y ∈ V .

◦ The single-source shortest path problem: find the shortest path between a given source x ∈ V to every other vertex in V .

◦ The all-pairs shortest path problem: find the shortest path between each pair of vertices in V .

(39)

3.5. Planar embedding of graphs

known to have almost no cycles, Takaoka’s algorithm can be used to accelerate the search [38].

Adaptation: Fortum has a business unit called Charge & Drive which manages charging infrastructure for electric vehicles [26]. How to guide vehicles to charging stations is a relevant and non-trivial optimisation problem. Not only the geographic distance but also the driving time, waiting time, charging time, longevity of battery et cetera need considering. Logistic graphs and shortest path algorithms can come into the picture and offer efficient solutions [37].

Despite not being about optimal routing, another related problem about charging infrastructure is siting and sizing of charging stations. A seemingly natural choice for reasonable placements is the vicinity of fast food stores or petrol stations. However, more insights and motivated decisions to minimise costs for both the company and the users can be gained with a mathematical study using graph-based models [2, 30].

3.5 Planar embedding of graphs

Keywords: planar graph

In design problems for radioelectronic circuits, utility lines and underground passageways, one may find it important to come up with a graph representation may not intersect each other, or intersect each other as little as possible. Two intersecting edges in a circuit may for example lead to destructive interference. If some underground stations are placed at disadvantageous locations, one may need to build their connecting tunnels on different level which causes a higher construction expense.

There are theorems in graph theory which reveal in advance whether a graph is planar or not, meaning it can be drawn (synonym: embedded) without intersecting edges. Loosely speaking, if a graph has in some way too many edges, there is a high probability that the graph is non-planar. A famous example is the utility graph K3,3, which is a complete bipartite graph

G = (V, E) = ({V1, V2}, E) where each of the vertex sets V1 and V2 consists

of three vertices and there is an edge connecting each vertex in V1 with each

(40)

3.6. Winter road maintenance

a b c

E G W

Figure 3.1: The complete bipartite graph K3,3 is not planar. A real-life

interpretation is as follows: Given three households V1 = {a, b, c} and three

utility sources V2 = {E, G, W } (electricity, gas, water) on a plane ground,

draw a line connecting each household with each utility source (nine lines in total). Anyhow one places the households or the utility sources and bends

the lines, there will always be at least two lines crossing each other.

Adaptation: Planar graphs have appeared in several applications involving design of networks, namely both physical networks and communication networks, and assisting generators for creating planar embeddings have been of interest [27, 31]. However, there do not seem to be many researches about design of such networks which are relevant for Fortum.

On the other hand, as pointed out in [19, 28], knowledge about planar graphs and their properties can help elaborating customised problem-solving algorithms which are highly suitable for planar graphs but not for graphs in general. In that case, graph planarity can be considered a valuable supporting concept to further study network models and develop algorithms, rather than a way to solve drawing problems.

3.6 Winter road maintenance

Keywords: graph traversal, Eulerian trail

Being a prominent graph theoretician, a technician is asked to design a route plan for a snowplough to clean all the local streets a winter day. The task is to find an optimal route where the snowplough visit every street in a district, such that the total travel distance is as short as possible.

(41)

3.7. Social network

weight representing the geographic length of the corresponding street. Under the assumption the snowplough is allowed to go two ways, the edges are undirected. The original task is now equivalent to find a walk in this graph such that the total edge weight is minimised.

The fact is that if at most two vertices have odd degrees, there exists an Eulerian trail which passes every edge exactly once and we have our solution. Otherwise, the snowplough needs to visit some edges more than once if all the streets need to be traversed. It turns out that it is possible to come up with a shortest trail which passes each edge exactly once or at most twice [11].

Adaptation: The related graph walk Hamiltonian cycle can be useful in monitoring of smart grids, or distribution networks in general. State estimation is significant to sustain distribution management systems. In [13], a calculation scheme based on Hamiltonian cycle is proposed to quickly obtain the network states.

From a wider perspective, traversing in graphs can be a powerful technique for searching in graph databases where data is represented and stored as a graph structure [17], if Fortum wishes to model their data sets this way. To give a concrete example, graph traversals can be used to perform a recovery sequence in automated fault management for medium voltage direct current shipboard power systems [22].

3.7 Social network

Keywords: vertex degree, centrality

The degree of a vertex in a personal network can be directly interpreted as how many relationships a person x has, and to some extent the popularity of that person. A simple graph model consists of vertices corresponding to people and an undirected, unweighted edge between two vertices means that those people know each other. The higher degree a vertex has, or in other words the higher the degree centrality of a vertex is, the more people x has a relationship with.

(42)

3.7. Social network

may be weak and x may not have much influence on others. A politician may in this case be interested in another useful graph metrics called closeness centrality, which is defined as the reciprocal of the distance sum P

y∈V,y6=xdist(x, y). Here, dist(x, y) denotes the shortest distance

between two people x and y, if one assigns each edge in the graph with a weight which indicates the relationship distance between two people, meaning how well they know each other.

In general, it is often desirable to detect which vertices that have the most important role in a graph, or in other words the most central vertex. There are some different ways to mathematically define what “most central” means, depending on the real-life interpretations. Beside degree centrality and closeness centrality, some other useful types of centrality are betweenness centrality and eigenvector centrality. A famous variant of eigenvector centrality is PageRank centrality 1 _{used by Google for ranking}

web pages in their search engine [35]. A small but illustrative example is given on the next page. Note that the most central vertices are different depending on what centrality measures are used.

Adaptation: Developers working with strategies involving customers can use graph-based social networks, with centrality among other concepts, as a powerful decision support tool. A company can gain more customers and profits by for example identifying influential customers and creating suitable marketing campaigns [23]. Some interesting ideas for analysing consumer behaviours and purchasing patterns are proposed by [10]. A graph-based approach for valuing the explicit financial worth of paying customers as well as the implicit social value of non-paying customers is suggested by [44]. An application of centrality measures worth mentioning for a utility company is management of power systems. A plan for efficient placement of security sensors in a large network can be achieved with the notion of graph centrality [14]. Hydraulic centrality metrics can be derived from the conventional centrality metrics to extract useful information of a hydropower plant or a water distribution network [8].

1_{The word “PageRank”, spelled with a capital P, a capital R and with no space, was}

(43)

3.8. Tournament ranking system a b c d e f g

Figure 3.2: The vertices in a graph can be most central in different aspects. 

  y

Centrality a b c d e f g most centralvertices

Degree 2 3 1 1 3 2 2 b, e

Closeness 0.10 0.09 0.06 0.06 0.09 0.07 0.07 a

Betweenness 9 9 0 0 8 0 0 a, b

PageRank 0.14 0.23 0.09 0.09 0.19 0.13 0.13 b Table 3.1: The centrality measures of the vertices in Figure 3.2.

3.8 Tournament ranking system

Keywords: directed graph, Hamilton path, spectral ranking

A judge group has received a list of n participants (players) p1, p2, · · · , pn

in a tournament, where one participant competes against another at a time and the results of the matches (win, lose, draw). Assume that each player completes in exactly n − 1 matches against the other n − 1 players. The judges need to suggest a ranking over all the participants, from the best to the least good one.

The results can easily be visualised with a directed graph with n vertices p1, p2, · · · , pn. If player pi wins against player pj, one draws a directed edge

from vertex pi to vertex pi, and no directed edge from pj to pi. If the match

concludes in a draw, two directed edges between pi and pj are drawn: one

from pito pj and one from pj to pi. This way, if a directed Hamilton path

(44)

3.9. Image segmentation

to the first vertex in the directed Hamilton path.

The hard part is that a tournament graph may not have a unique directed Hamilton path, meaning that this approach can give contradictive rankings where a player pi is considered better than pj in one Hamilton path but

worse than pj in another Hamilton path [36]. A more reasonable ranking

procedure is spectral ranking using eigenvalues and eigenvectors in linear algebra applied to the adjacency matrix of the tournament graph [39]. Adaptation: A similar directed graph model can work as a general-purpose decision support tool to rank a number of choices from the most to the least favourable [7, 34]. In data mining, spectral ranking can also be useful for anomaly detection. The main advantage of an anomaly ranking compared to a binary classification is that even a measure of relative abnormality is provided, which forms a basis for cost-benefit analysis [43].

3.9 Image segmentation

Keywords: graph partitioning, minimum cut

Image segmentation is a part of digital image processing where an image is divided into segments, or in other words pixel clusters, which often depict essentially different entities in the image. When contemplating a classical fruity still-life painting, our eyes can easily segment the whole painting of hundreds shades (or thousands pixels) into three main regions: the fruit bowl, the background and the tablecloth. Other people looking at images may sometimes only want to study a specific interesting part of the whole image for professional identification purposes: sharp objects (airport security control), human faces (criminal recognition), breast cancer (mammography), vehicle registration plate (traffic control), letters (optical character recognition) or rare birds (birdwatching).

(45)

3.9. Image segmentation

A graph-based approach for image segmentation can make use of so-called graph cuts [24]. Given a graph G = (V, E), a cut C = (S1, S2) is a partition

of V into two disjoint subsets S1 and S2, i.e. S1∪ S2 = V and S1∩ S2 = ∅.

In a weighted graph, a minimum cut or simply min-cut is defined as a cut such that the sum of the weights of the edges between the subsets S1 and

S2 is minimised. Such a min-cut can be used to segment an image into two

essentially different regions, and the process can be repeated to obtain more segments if desired. Although the idea is simple, the optimisation problem of finding a min-cut can turn out to be computationally intractable, meaning that it can be solved in theory but would take too much time (or other resources) in practice to be cost-effective.

Adaptation: Graph segmentation using minimum cuts can be useful in electric power networks. A power network experiencing high loads or various types of interruptions can lead to serious cascading failures. A preventive solution is to partition the network into smaller networks called islands, where the imbalance between generation and load in each island is to be minimised [3,32]. On the other hand, a security index problem about false data injection attacks can also be formulated as graph partitioning problem, as an efficient tool for security analysis for power transmission networks [15].

(46)

Summary

Graph theory is a part of mathematical combinatorics with many interesting concepts and applications highly connected to reality in general and energy technology in particular. With proper investment in exploring this discipline, a utility company can likely gain business value from profound insights on their operations. Whether it concerns energy production, database management or administrative decision making, graph theory can find a way and provide novel perspectives to model and assess possibilities. It is important to understand that graphs in the area of graph theory is not merely for delightful visual representations. A graph is a mathematical combinatorial structure for describing pairwise relations and solve engineering problems effectively with exclusive methods.

(47)

You always have a choice.

Harvey Specter

Part II:

(48)

(49)

CHAPTER

4

Introduction to feature

selection

4.1 Background

Large-scale data with many variables, also called features, come with both wealthy information and challenges in engineering modelling problems. Not only do high-dimensional data cause high computational time and memory errors for computing systems, they also lead to a problem called overfitting. This means that the mathematical model is too closely fitted to many variables, where some of them are noisy, inaccurate or irrelevant data [57]. A consequence is that the model is too complicated to analyse and generalise. It is therefore suitable to reduce the number of features before the modelling phase, for example by removing the redundant features and keeping the most representative ones.

(50)

4.2. Purpose and outline

Although feature selection can be done by empirical expert knowledge in the respective fields, this approach can be loaded with time-consuming manual work and uncertainty when discriminating the importance of different features. Various methods for automatic feature selection have been proposed to solve this challenge. Some typical methods in traditional regression analysis are stepwise regression, Akaike information criterion and principal component analysis [57]. In information theory, features can be selected using informative fragments [65], conditional infomax learning [48] and fast correlation-based filter [54], among many other methods.

In the task of feature selection, a huge contribution of graphs is to capture the inherent geometric structure of the data, also known as the data manifold [64]. In many real-life applications, high dimensional data points are actually samples from a low-dimensional manifold which is a subspace of a high-dimensional manifold [63]. In other words, it is usually assumed that a data point given by hundreds features can be described by fewer features while preserving the structure of the original data manifold. This is the intuition behind several graph-based methods for feature selection. A feature is considered good if it in some sense respects the data manifold.

4.2 Purpose and outline

In this thesis, five graph-based feature selection methods, proposed in earlier researches, are studied to understand how a feature selection problem can be formulated and solved as an optimisation problem. The methods start with constructing a graph where each vertex represents a data sample and the edge weights show the similarities between the samples. Criteria for good features are then formulated as a minimisation problem which ultimately results in a ranking where the top features are to be selected.

(51)

CHAPTER

5

Preliminaries

In this chapter, some mathematical notions recurrently used in the work of feature selection are briefly presented. How these notions are related to selecting features and assessing the feature selections will be described in details in Chapter 6and Chapter 7.

5.1 Laplacian matrix

Given an undirected graph G = (V, E, W) with m vertices v1, · · · , vm, the

degree matrix D of G is the m × m-matrix defined as

Dij =    deg(vi) if i = j 0 if i 6= j

where deg(vi) is the degree of the vertex vi. In other words,

D = diag deg(v1), · · · , deg(vm), where diag(v) denotes the diagonal

m × m-matrix with the elements of v ∈ Rm on the main diagonal. Recall that the adjacency matrix A of G is the m × m-matrix defined as

Aij =

 



1 if vertices vi and vj are adjacent

(52)

5.2. Nearest neighbour graph

Then, the unweighted Laplacian matrix L of G is the m × m-matrix defined according to

L = D − A

If the edge weights given by the weight matrix W are considered, the formula above is modified to

L = diag(W1) − W where 1 = h1 1 · · · 1i

T

∈ Rm_.

By construction, L is a symmetric matrix. It has also been shown that L is positive semidefinite and has therefore real, non-negative eigenvalues. In spectral graph theory, the spectrum 1 _{of L can be used to understand the}

underlying data structure of a graph, for example its connectivity [62] and cluster structure [47].

5.2 Nearest neighbour graph

Assume a graph with m vertices representing the points x1, · · · , xm ∈ Rn

where some edges between the vertices need to be defined. In some applications where one wishes to link a vertex to its most similar vertices given some measure for similarity, it is reasonable to connect two vertices vi

and vj with an edge if the corresponding points xi and xj are close to each

other according to some distance metric. A common metric is the Euclidean distance xi − xj = q (xi1− xj1)2+ (xi2− xj2)2+ · · · + (xin− xjn)2

There are two versions of defining edges as suggested by [59]:

1. Let vi and vj be connected if the distance between xi and xj is small

enough, meaning if xi− xj

< ε for some chosen threshold ε > 0. Although this version seems intuitive, it is difficult to choose ε for different sets of points and the resulting graph is often non-connected. 2. Let vi and vj be connected if vi is among the p nearest neighbours of vj

or if vj is among the p nearest neighbours of vi. Although this version is

(53)

5.3. Graph clustering

resulting graph is almost always connected. In the rest of this paper, this kind of graph will be called the p-nearest neighbour graph 2_.

5.3 Graph clustering

In some applications in data analysis, the m vertices in a graph G representing m data vectors need to be grouped into κ disjoint clusters C1, · · · , Cκ. It is

meaningful that the vertices assigned to the same cluster are more similar to one another compared to the vertices assigned to other clusters, given some measure of similarity.

A clustering of a graph can be described with a cluster label vector y ∈ Rm such that yi = Cj if the vertex viis assigned to the cluster Cj, for i = 1, · · · , m

and j = 1, · · · , κ. A clustering can alternatively be described with a cluster indicator matrix Y ∈ {0, 1}m×κ _{such that}

Yij =

 



1 if vertex vi is assigned to cluster Cj

0 otherwise

There are different ways to perform a clustering on a set of vertices without necessarily considering what the corresponding data stand for in real life. Some common methods are k-means clustering 3_{, mean shift clustering and}

expectation maximisation using Gaussian mixture models [1, 9].

5.4 Comparison of two clusterings

A situation where two clusterings of the same vertex set with m vertices need comparing is where a clustering is manually defined and a clustering is automatically computed by an algorithm, for example. Two commonly used comparison metrics are clustering accuracy (ACC) and normalised mutual information (NMI). In this paper, a third metric called adjusted mutual information (AMI) is also included. The range of all three metrics is [0, 1]. In general, the higher ACC, NMI and AMI are, the higher the clustering quality.

2_{That is to say not “p-nearest neighbours graph”.}

3_{The k in “k-means clustering” is an inherent part of the name and is not the same as}

(54)

5.4. Comparison of two clusterings

5.4.1 Clustering accuracy

Assume two cluster label vectors y and ˜y describing two clusterings C and ˜C respectively, the corresponding clustering accuracy is defined as

ACC(C, ˜C) = 1 m m X i=1 δ ˜yi, Map(yi)

where Map is the permutation mapping function which maps each cluster label yi to an equivalent label in ˜y [45] and the Kronecker delta function δ

is defined as δ(a, b) =    1 if a = b 0 if a 6= b

The latter function is named after the German mathematician Leopold Kronecker (1823–1891) 4.

5.4.2 Normalised mutual information

In order to define normalised mutual information, two fundamental concepts called entropy and mutual information from the field of information theory need introducing first.

Assume a random variable X with outcome space ΩX and probability mass

function pX(x). The entropy, also known as information entropy, of X is

denoted H(X) and defined as H(X) = − X

x∈ΩX

pX(x) log2 pX(x)

A way to interprete this quantity is how unexpected, uncertain and unpredictable the information contained in X is [49]. Interestingly enough, the expression for entropy does not depend on the values of X but rather on the probability distribution of X.

For two random variables X and Y , the mutual information between them I(X, Y ) is defined as I(X, Y ) = X y∈ΩY X x∈ΩX p(X,Y )(x, y) log2 p(X,Y )(x, y) pX(x) pY(y) !

4_{Kronecker is famous for having said “Die ganzen Zahlen hat der liebe Gott gemacht,}

(55)

5.5. Similarity comparison of two sets

where ΩY is the outcome space of Y , pY(y) the probability mass function of Y

and p(X,Y )(x, y) the joint probability mass function of X and Y . Intuitively,

the mutual information of two random variables tells how much one can know about one variable given the information about the other variable [52]. The normalised mutual information of two clusterings C and ˜C with cluster label vectors y and ˜y respectively can now be defined as

NMI(C, ˜C) = 2I(y, ˜y) H(y) + H(˜y)

where I(y, ˜y) is the mutual information between y and ˜y and H(y) the information entropy of y.

5.4.3 Adjusted mutual information

This version of mutual information contains an adjustment for the situations where two clusterings agree by coincidence, which is not considered in the normalised mutual information [56]. The adjusted mutual information of two clusterings C and ˜C is defined as

AMI(C, ˜C) = I(y, ˜y) − E I(y, ˜y)

max H(y), H (˜y) − E I(y, ˜y)

where E I(y, ˜y) denotes the expectation of the mutual information between C and ˜C.

5.5 Similarity comparison of two sets

Assume two non-empty sets A and B with equally or unequally many elements. A measure for how similar these sets are is the Jaccard index J defined as

J (A, B) = |A ∩ B| |A ∪ B|

(56)

(57)

CHAPTER

6

Feature selection methods

Five graph-based methods for feature selection with different points of view are presented in this chapter. Given a data set with m samples and n features, the aim is to select the best k n features according to certain criteria. In chronological order, the five methods are as follows:

1. Laplacian score (LS), proposed 2005 [60]: This method favours the features which have large variances and preserve the local manifold structure of the data.

2. Multi-cluster feature selection (MCFS), proposed 2010 [68]: This method favours the features which preserve the multiple cluster structure of the data. The optimisation problem involves spectral regression with `1-norm regularisation.

3. Non-negative discriminative feature selection (NDFS), proposed 2012 [69]: This method favours the most discriminative features. The optimisation problem involves non-negative spectral analysis and regression with `2,1-norm regularisation.

4. Feature selection via non-negative spectral analysis and redundancy control (NSCR), proposed 2015 [70]: This method is an extension of NDFS which even controls the redundancy between features. The optimisation problem involves non-negative spectral analysis and regression with `2,q-norm regularisation. Abbreviation: “NS” for

(58)

5. Feature selection via adaptive similarity learning and subspace clustering (SCFS), proposed 2019 [67]: This method favours the most discriminative features which also preserve the similarity structure of the data. The optimisation problem involves regression with `2,1-norm

regularisation. Abbreviation: “SC” for “subspace clustering”.

Of all five methods, Laplacian score is the only one which contains neither iterations, regression, regularisation nor clustering of data.

For simplicity, some notations will be used consistently in the descriptions of all methods according to Table 6.1.

X ∈ Rm×n data matrix m number of data samples

n number of original features k number of selected features f1, f2, · · · , fn the n original features

x1, x2, · · · , xm ∈ Rn the m sample vectors

f1, f2, · · · , fn∈ Rm the n feature vectors

Table 6.1: Notations associated with a given data set.

Furthermore, the bold uppercase letters A, B, C, · · · denote in general matrices and the bold lowercase letter a, b, c, · · · denote vectors. For an arbitrary matrix A, the notation with two subscript indices Aij means the

element at row i and column j in A, the notation with one subscript index ai means the row vector at row i in A while the notation with one

superscript index At _{means A at iteration step t.}

The notation kxk always refers to the Euclidean norm of the vector x, while kAkF means the Frobenius norm of the matrix A. For an arbitrary matrix

A ∈ Rm×n with real elements, its Frobenius norm is defined as kAkF = v u u t m X i=1 n X i=1 Aij 2 =ptr (AT_A)

(59)

6.1. Laplacian score (LS)

6.1 Laplacian score (LS)

The main idea of this approach is to rank the features according to their power of preserving locality, by a number called Laplacian score. Selecting the k best features is equivalent to selecting the k features with the lowest Laplacian scores [60].

6.1.1 Preparation

Create the p-nearest neighbour graph G = (V, E, W) with m vertices corresponding to the m sample vectors x1, x2, · · · , xm ∈ Rn, for some own

choice of p. There are two ways to define the edge weight matrix W [59]: 1. Let Wij = 1 if the vertices vi and vj are adjacent and Wij = 0

otherwise. 2. Let Wij = ψ(xi, xj) =    e−ωkxi−xjk 2

if vi and vj are adjacent

0 otherwise

for some chosen parameter ω > 0 which needs tuning in practice. The latter suggested edge weight matrix reflects the similarity between the data samples. The lower the distance xi− xj

, the higher Wij and the more

similar the data points xi and xj are. As explained in [59], this particular

choice of W is advantageous due to a related partial differential equation about heat distribution, where ψ is also called the heat kernel. A motivated choice for the parameter ω is ω = (ln 2)/ ¯d, where ¯d is the arithmetic mean value of the distancesn xi− xj

o [53]. By construction, W is a symmetric matrix.

6.1.2 Optimisation problem

(60)

6.1. Laplacian score (LS) where (?) : Pm i,j=1 (fr)_i− (fr)_j 2 Wij Var(fr)

shall be minimised. The weight Wij in the expression

(fr)_i− (fr)_j

2 Wij

can be interpreted as a kind of penalty to ensure the locality preserving ability of a feature. To be more concrete, the closer xi and xj are to each

other, meaning the larger Wij, then the smaller

(fr)i− (fr)j

2

should be. With Var(fr), denoting the weighted variance of the feature fr, in the

denominator, LS(fr) is minimised by maximising Var(fr). This also allows

the most representative features to be selected.

Algebraic simplification: To simplify (?), use the re-writings

m X i,j=1 (fr)_i− (fr)_j 2 Wij = m X i,j=1 (fr)_i 2 +h(fr)_j i2 − 2 (fr)_i(fr)_j Wij = 2 m X i,j=1 (fr)_i 2 Wij − 2 m X i,j=1 (fr)_iWij(fr)_j = 2frTDfr− 2frTWfr = 2frTLfr and Var(fr) ≈ n X i=1 (fr)i − µr 2 Dii= n X i=1 " (fr)i− fT r D1 1T_D1 #2 Dii shown in [59], where 1 = h1 1 · · · 1i T , D = diag(W1) and L = D − W is the Laplacian matrix of the graph G.

Further improvement: As pointed out in [60], there is a risk that the vector fr can be a non-zero constant vector such as 1. This leads to

f_rTLfr= 1TL1 = 0 ⇒ (?) : Pm i,j=1 (fr)_i− (fr)_j 2 Wij Var(fr) = 2f T rLfr Var(fr) = 0

which gives no information about the feature fr. This problem can be avoided

(61)

6.1. Laplacian score (LS) which gives    f_rTLfr = ˜frTL˜fr Var(fr) ≈ ˜frTD˜fr

Hence, the optimisation problem is about minimising the quotient (?) : 2˜f T r L˜fr ˜ fT r D˜fr

which is equivalent to minimising the objective function LS(fr) = ˜ fT r L˜fr ˜ fT r D˜fr

where LS(fr) stands for the Laplacian score of the feature fr. At this point,

one can note that a good feature fr has a low Laplacian score. An algorithm

for ranking the features according to their Laplacian scores is now ready to be formulated.

6.1.3 Feature selection algorithm

Given a data matrix X ∈ Rm×n _{with m samples and n features, the best k}

features can be found as follows:

1. Construct a p-nearest neighbour graph G with m vertices corresponding to the m data vectors as described in 6.1.1.

2. Compute the weight matrix W ∈ Rm_{, in this case also known as}

similarity matrix or affinity matrix, with the heat kernel as described in 6.1.1.

3. For a feature fr with corresponding feature vector fr, define ˜fr as

˜ fr = fr− fT r D1 1T_D11 where 1 = h1 1 · · · 1i T ∈ Rm _{and D = diag(W1).}

4. Compute the Laplacian matrix L = D − W of the graph G and the Laplacian score LS of each feature fr as

LS(fr) = ˜ f_rTL˜fr ˜ fT r D˜fr

(62)

6.2. Multi-cluster feature selection (MCFS)

6.2 Multi-cluster feature selection (MCFS)

This approach ranks the features according to their power of preserving the cluster structure of the data as well as covering all the possible clusters [68]. The higher the score MCFS(fr), the better the feature fr.

6.2.1 Preparation

A graph here is constructed in the same way as described in 6.1.1 for the Laplacian score method. After obtaining a p-nearest neighbour graph and the Laplacian matrix L = D − W, solve the generalised eigenvalue problem

Lu = λDu

to obtain a flat embedding for the data points. According to [59], the equation Lu = λDu is derived from an optimisation problem of constructing a representation for data lying on a low-dimensional manifold embedded in a high-dimensional space, where local neighbourhood information is preserved.

Let U = [u1, · · · , uκ] be a matrix containing the eigenvectors of the above

mentioned eigenvalue problem corresponding to the κ smallest eigenvalues. Here, κ can be interpreted as the insintric dimensionality of the data and each eigenvector reflects how the data is distributed along the corresponding dimension, or in other word the corresponding cluster [68]. If the number of clusters of the data set is known, κ can be chosen to be this number.

6.2.2 Optimisation problem

For each vector ui in U, where i = 1, · · · , κ, find a vector ai ∈ Rn which

minimises the fitting error

kui− Xaik 2

+ β |ai|

where X is the data matrix, |ai| denotes the sum

Pn r=1 (ai)r and β is some parameter. This optimisation problem is about assessing the importance of each feature for differentiating the clusters. The solution ai contains the

combination coefficients for the n different features in approximating ui.

Then, one can select the most relevant features with respect to ui, which

(63)

6.2. Multi-cluster feature selection (MCFS)

Using the `1-norm regularisation on the last term β |ai| as a penalty, a sparse

vector aican be obtained if the parameter β is chosen to be large enough. The

larger β is, the more elements in ai are being forced to shrink to zero. The

sparsity of ai helps prevent the case where too many features get selected,

some of which some may be noisy or irrelevant.

For each feature fr, define its multi-cluster feature selection score of as

MCFS(fr) = max i (ai)r

The k of n features which should be selected according to this approach are now those with highest multi-cluster feature selection scores.

6.2.3 Feature selection algorithm

Given a chosen number of clusters κ, by default κ = 5, the following algorithm returns the k best features according to their multi-cluster feature selection scores.

1. Construct a p-nearest neighbour graph as described in 6.1. Compute the matrices D and W as described in the same section and the Laplacian matrix L = D − W.

2. Solve the generalised eigenvalue problem Lu = λDu

and obtain κ eigenvectors u1, · · · , uκ corresponding to the κ smallest

eigenvalues.

3. Solve the minimisation problem min

ai

kui− Xaik 2

+ β |ai|

for each ui and obtain κ vectors {ai}.

4. Compute the score of each feature as MCFS(fr) = max i (ai)r

(64)

6.3. Non-negative discriminative feature selection (NDFS)

6.3 Non-negative

discriminative

feature

selection (NDFS)

This approach aims to select the most discrimitive features with the aid of the cluster labels of the data [69]. In practice, a data set may have well-defined clusters, commonly known as categories, considering the real-life interpretation of the data. If this is not the case, pseudo cluster labels can be generated using cluster detection methods such as k-means clustering algorithm.

6.3.1 Preparation

A p-nearest neighbour graph G is created as described in6.1. In this proposed method, instead of the Laplacian matrix L = D − W, the corresponding normalised Laplacian matrix

ˆ

L = D−1/2(D − W)D−1/2

is considered. This particular chosen version of the Laplacian matrix for the task of feature selection is not motivated in [69]. However, it is pointed out in [62] that ˆL in general has its theoretical advantages where many results can be generalised to all graphs and not only regular graphs.

Define even a cluster indicator matrix Y ∈ {0, 1}m×κ_{based on a clustering of}

the vertices in G, where κ is the number of clusters. Then, the corresponding scaled cluster indicator matrix F ∈ Rm×κ is defined as

F = Y

YTY −1/2

According to this definition, F is an orthogonal matrix since FTF = YTY −1/2 YT YYTY −1/2 =YTY −1/2 YTY YTY −1/2 = I

(65)

6.3.2 Optimisation problem

The idea is to minimise the sum

tr(FTLF) + αkXZ − Ykˆ 2_F + βkZk2,1

over Y and Z ∈ Rn×κ for some parameters α and β, where the scaled cluster label matrix F and the feature selection matrix Z are optimised simultaneously. The first term tr(FT_{LF) is about modelling the local}_ˆ

structure of the data manifold, in a similar manner to what is done in LS. This structure is important for clustering the data points. The second term αkXZ − Yk2

F is about minimising the fitting error when the original data

X is embedded on a low-dimensional subspace with a transformation matrix Z.

In the third term, the `2,1-norm ensures the row sparsity of Z to filter out

noisy features. This is due to the fact that the `2,1-norm of an arbitrary

matrix A ∈ Rm×n _{is defined as} kAk2,1 = n X j=1 kajk2 = n X j=1   m X i=1 Aij 2   1/2

If the feature fj is not highly correlated to the pseudo cluster labels described

by F, it is likely that the row zj will be shrunk to the zero vector if β is large

enough.

Since F is a cluster indicator matrix, it is necessary that F ≥ 0. This gives now the optimisation problem

min

F,Z tr(F

T_{LF) + αkXZ − Fk}_ˆ 2

F + βkZk2,1

s. t. F = Y(YTY)−1/2 and F ≥ 0

Relaxation: This problem is, however, intractable because the constraint F = Y(YT_Y)−1/2 _{forces the elements in F to be discrete values. A relaxed}

tractable continuous optimisation problem can be obtained using the constraint FT_{F = I for orthogonality of F [}₆₁_]:

min

F,Z tr(F

T_{LF) + αkXZ − Fk}_ˆ 2

F + βkZk2,1

(66)

After one additional relaxation, the problem is simplified further to min F,Z φ(F, Z) = tr(F T_ˆ LF) + αkXZ − Fk2_F + βkZk2,1 (e) + γkFTF − Ik2_F s. t. F ≥ 0

where γ is practically a large constant to make sure the orthogonality constraint FT_{F = I is respected.}

Finding optimal Z: The matrix Z for which φ(F, Z) becomes minimal can now be found by deriving φ partially with respect to Z and set it to zero. For simplification, introduce the auxiliary diagonal matrix G with the main diagonal elements

Gii=

1 2kzik

for i = 1, · · · , n. Then, some algebraic re-writings give ∂φ ∂Z = 0 ⇒ 2αX T_{(XZ − F) + 2βGZ = 0} ⇒ Z =αXTX + βG −1 αXTF ⇒ Z =     XTX + β αG | {z } :=E     −1 XTF = E−1XTF

Since both XT_{X and G are symmetric, E is also symmetric and thus}

ZT _{= F}T_XE−1_. _{Substituting this Z}T _{into φ(F, Z), while using the}

properties kZk2,1 = tr(ZTGZ) and kAk2F = tr(ATA) for an arbitrary

matrix A, yields αkXZ − Fk2_F + βkZk2,1 = α tr(XZ − F)T(XZ − F)+ β tr(ZTGZ) = α trZTXTXZ − ZTXTF − FTXZ + FTF+ β tr(ZTGZ) = α trFTF− 2α trZTXTF | {z } depends on F + α trZTXTXZ+ β tr(ZTGZ) | {z }

does not depend on F