Modelling Hierarchical Structures in Networks Using Graph Theory

(1)

UPTEC F 20038

Examensarbete 30 hp Juni 2020

Modelling Hierarchical Structures in Networks Using Graph Theory

With Application to Knowledge Networks in Graph Curricula

Emil Wengle

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Modelling Hierarchical Structures in Networks Using Graph Theory

Emil Wengle

Community detection is a topic in network theory that involves assigning labels to nodes based on some distance measure or centrality index. Detecting communities within a network can be useful to perform information condensation. In this thesis we explore how to use the approach for pedagogical purposes, and more precisely to condense and visualise the networks of facts, concepts and procedures (also called Knowledge Components (KCs)) that are offered in higher education programmes.

In details, we consider one of the most common quantities used to evaluate the goodness of a community classification, which is the concept of modularity. Detecting communities by computing the maximum possible modularity indexes is indeed usually desired, but this approach is generally unavailable because the associated optimisation problem is NP-complete.

This is why practitioners use other algorithms, that instead of computing the optimum they rely on various heuristics to find communities: some use modularity directly, some start from the entire graph and divide it repeatedly, and some contain random elements.

This thesis investigates the trade-offs of using different community detection

algorithms and variations of the concept of modularity first in general terms, and then for the purpose of identifying communities in knowledge graphs associated to higher education programmes, which can be modelled as directed graphs of KCs.

We discover, tweaking and applying these algorithms both on synthetic but also field data that the Louvain algorithm is among the better algorithms of those that we considered, which is mostly thanks to its efficiency. It does not produce a full hierarchy, however, so we recommend Fast Newman if hierarchy is important.

ISSN: 1401-5757, UPTEC F 20038 Examinator: Tomas Nyberg Ämnesgranskare: Steffi Knorn Handledare: Damiano Varagnolo

(3)

Populärvetenskaplig sammanfattning

Högskoleutbildningar och andra utbildningsprogram på akademiska institutioner kan hand- la om nästan vad som helst. Alla har en gemensam nämnare: de består av ett antal kurser, som i sin tur kan beskrivas av fakta, begrepp och procedurer, eller kunskapsmål, som ingår.

Kunskapsmålen kan anges antingen som förkunskapskrav eller som lärandemål.

En utbildning kan modelleras som en matematisk graf, vilket är en konstruktion som byggs upp av ett antal hörn som är anslutna med ett antal kanter. Varje hörn representerar ett kunskapsmål och varje kant ett förkunskapskrav. En kant som går från ”division” till

”bråk” säger att man behöver förstå division för att lära sig om bråk, till exempel. En modell av detta slag kan dock bli svårläslig eftersom en utbildning kan behandla tusentals begrepp.

Vi undersökte därför algoritmer som kan kondensera informationen i sådana grafer till av- sevärt mindre grafer. Sådana algoritmer kan beskrivas som metoder för att hitta samhälls- struktur, eller ”community detection”. Det finns ett flertal algoritmer som söker samhällen i grafer, däribland algoritmen ”Louvain” av Blondel m. fl., Girvan-Newman-algoritmen och den snabba Newman-algoritmen.

Samhällenas styrka kan mätas med hjälp av ett mått som kallas modularitet (”modularity”) och som Newman och Girvan har föreslagit. Det här måttet är högre om samhällena förbinds internt med ett större antal kanter än om kanterna hade dragits mellan grafens alla hörn på måfå. Andra författare har föreslagit andra mått: Kim, Son och Jeong föreslår en variant av modularitet som de kallar ”LinkRank” och Li m. fl. ett i grunden olikt mått som de kallar ”modularity density”.

Vi använde algoritmerna för att hitta samhällen i det ofta använda nätverket ”Zacharys karateklubb”. Syftet var att bilda en första uppfattning om hur algoritmerna fungerar i praktiken. Sedan tillämpade vi ett urval av dem på en graf som skulle kunna beskriva en serie kurser i linjär algebra. Här var målet att undersöka hur olika modularitetsbaserade algoritmer beter sig när de används med Newmans, Kim, Son och Jeongs respektive Li m. fl.s mått på samhällenas trovärdighet. Utifrån dessa observationer ville vi sedan avgöra vilket mått som är lämpligast för att utvärdera om samhällena i en kunskapsgraf är vettiga.

Genom dessa experiment kom vi fram till att Louvain var den effektivaste algoritmen av de som vi använde. Det enda problemet med den är att den inte visar hela hierarkin eftersom algoritmen kan flytta hörn från ett samhälle till ett annat. Newmans snabba algoritm slår istället ihop två olika samhällen till ett i varje steg, vilket gör att den kan producera en fullständig hierarki. Samma algoritm är dock mindre effektiv än Louvain, så den kan inte tillämpas på väldigt stora grafer om man vill ha sina resultat inom en rimlig tid.Våra resultat skulle kunna användas för att förenkla kommunikationen mellan olika parter i utbildningsväsendet. Programstyrelsen för ett femårigt utbildningsprogram vill till exempel inte ha en komplicerad och invecklad modell när de diskuterar ändringar i studieplanen – de vill ha en enkel och lättläst modell. Det går att ta fram en förenklad modell av deras program med hjälp av till exempel Louvain. Studenterna vill hellre ha en detaljerad modell som ett hjälpmedel i sina studier. De kan därför välja en låg nivå i hierarkin som Newmans snabba algoritm tar fram åt dem.

(4)

1. Introduction

Any network can be modelled as a graph structure which contains nodes and edges. Some groups of nodes may have unusually many edges between each other compared to the remainder of the graph, which might indicate that those nodes form a community. Mod- ularity (Section 2.4) is a common measure of community strength that was introduced by Newman and Girvan. It is defined as the fraction of edges that go within communities minus the expected fraction of such edges in the corresponding null model. Higher values indicate that the communities are less likely to have formed by chance [2]. Finding a good community division that maximises an objective function like modularity is an NP- complete problem [5], which means we cannot find one on a computer in reasonable time if the problem is too big.

This makes people turn to designing algorithms that use various heuristics to find a

“good enough” division. Most people design algorithms that support ordinary graphs, but there exist algorithms that support directed graphs, where the edges go from one node to another but not necessarily back. Examples of directed graphs include citation networks, where each paper is a node and each citation is a directed edge starting in the citing paper, the internet, and skill trees in video games, such as the technology tree in the “Sid Meier’s Civilization” strategy games.

A university programme or the like, from here on referred to as a Higher Education Programme (HEP), can be modelled as a directed graph, where the nodes represent facts, concepts or procedures, and the edges represent dependencies. Communities in a HEP can be interpreted as topics of knowledge, such as “organic chemistry”, “solid state physics”,

“control engineering” and “digital electronics”. But why would it be interesting to use community detection to find these topics by machine? This section aims to answer this question.

1.1. Problem Statement

A programme in higher education is built on a set of courses that cover a number of fields of science. Any programme can be described as a set of Knowledge Components (KCs) that spans a variable number of fields of science. A KC can be anything that a student can learn, including, but not limited to, a fact, a concept or a procedure [6].

All courses in the programme have prerequisites: they require understanding of a set of KCs, and teach a different set of KCs in return. This is why we can model any programme as a directed graph of KCs. One approach of modelling a programme is to use specialised spreadsheets that Wengle, Knorn, and Varagnolo call Course Flow Matrices (CFMs) [7].

Each course requires one CFM.

There are a few potential shortcomings with this approach of modelling programmes, however. One is that the CFMs must be constructed by hand, which requires input from teachers. The same course may be given by different teachers, and these teachers might have different views on what the course contents should be. Varagnolo et al. have also pointed out that students might disagree with the teachers [8–10].

Another is that the resulting graph model might become rather big and detailed. The finalised model of a programme that consists of 50 courses might contain thousands of KCs and thousands, if not tens of thousands, of dependencies. Graphs of this size are rather difficult to visualise in an effective way without compressing it in any way. The programme usually spans a number of fields of science, so if we could find a way to categorise the KCs, we could obtain a simpler model of the programme.

Doing so would be useful for several stakeholders. The programme board is not interested

(7)

Mathematics

Physics Chemistry

FIR Filter

Windowing FIR Properties Figure 1: Greatly simplified model of a learning tree (left) and a section of a detailed model

of one (right).

in discussing course contents in terms of lessons or lectures, but rather in a way that lets them see the bigger picture. In contrast, students would be interested in the details since it would help them understand their learning and plan their studies.

1.2. The Role of This Thesis

This thesis investigates methods for finding suitable categorisations of the KCs, or “community detection”.

The objective of this thesis is to investigate how community detection can be employed to categorise KCs by topic of science. The graph model should define the KCs as nodes and dependencies as directed edges.

Given the application to knowledge networks in higher education programs, the questions that we seek to answer with this thesis are the following.

• How do we know if the detected communities make sense?

• Which algorithms are better suited to detect communities among KCs?

• What are the benefits and drawbacks of modelling a higher education program as a graph?

The purpose of this thesis is to gain a better understanding of community detection. To be specific, we want to learn the principles of community detection, some algorithms, how to identify a “good” community division, and limitations of quality measures.

Using community detection to categorise KCs is not unlike designing a simplified ontological domain model. In this context, a domain model is a model of a knowledge domain, such as signals and systems, that spans KCs, rules, individuals and the like.

Learning analytics is a field of science where data produced by a learner is used with analysis models to predict and advise their learning. This is related to machine learning, artificial intelligence and statistical analysis.

With this thesis, the student can get a better picture of how their learning progresses at different levels. The level of detail can be controlled by the student: they can view a condensed learning tree, which could be as simple as that in Figure 1 (left), or a complete learning tree, which could be as detailed as is suggested by Figure 1 (right). The teacher can also make use of this sliding resolution to learn the students’ state of learning. This information is useful to determine when to repeat KCs and which the students need help with learning or remembering.

1.3. Prior Work

Others have explored and researched community detection before this thesis was written.

There exist hundreds, if not thousands, of papers on community detection, graph clustering et cetera in the academical world. To the best of our knowledge, none of them have

(8)

considered the application toward knowledge networks, but a handful of prior works are reported here.

Fortunato has written an exhaustive 100-page paper on community detection in graphs.

It lists numerous approaches to find communities, including traditional methods, top-down algorithms, statistical methods and modularity-based methods. It also presents various alternatives to the commonly used Newman modularity [11] (introduced by Newman and Girvan in [2]).

Blondel et al. propose a computationally inexpensive algorithm as a bottom-up approach of community detection. The idea is to move single nodes across communities as long as modularity increases, then aggregate the communities, and repeat until modularity ceases to improve [1].

Bonald et al. use Node Pair Sampling to cluster graphs in a hierarchic way. The algorithm is applied to a graphic representation of streets in central Paris, another of a map of airports around the world, among others. The performance is compared to a spectral algorithm and the algorithm by Blondel et al., on which their algorithm is based [12].

Girvan and Newman propose a top-down algorithm for detecting communities in an undirected graph. The idea behind this algorithm is that edges that connect different communities usually are part of a larger number of shortest paths than edges that go within one [13]. Leicht and Newman also propose a top-down algorithm for community detection. Unlike Girvan and Newman, however, the proposed algorithm considers directed graphs rather than undirected graphs and uses a spectral method [14].

Kernighan and Lin wrote a paper on graph partitioning that dates back to 1970. The method proposed by them can be classified as an optimisation problem in the sense that it attempts to minimise the total weight of the edges that connect different partitions. Given an initial partitioning, the method looks for the pairs of nodes that decrease the cost the most if they were to trade positions. If no such pairs can be found, a local optimum has been found [15].

Duch and Arenas propose a stochastic Extremal Optimisation algorithm as a top-down approach to find communities in a graph. It first divides the graph in two partitions of not necessarily equal parts, then moves nodes across the partitions until modularity ceases to improve. Nodes that contribute less are more likely to be moved. The algorithm then repeats the procedure for each resulting partition [16].

Huijuan, Shixuan, and Yichen present the Improved Shared Neighbors Graph Clustering (ISNGC) algorithm as a bottom-up approach to graph clustering. It is based on the assumption that nodes that have more neighbours in common are more likely to be in the same cluster. It first finds the k-nearest neighbour graph, then merges the clusters that have the highest closeness until the target number of clusters remains [17].

Jaromczyk and Toussaint cover the Relative Neighborhood Graph (RNG), the Gabriel Graph (GG) and a few other types of neighbourhood graphs. It gives the reader basic understanding of neighbourhood graphs [18]. However, the neighbourhood graphs in their paper use geographical information (coordinates, geometry etc.). Since the graphs of this thesis lack such information, neighbourhood graphs are not relevant. Derrible and Kennedy use graph theory to model public transport networks. They say that using concepts of graph theory became reality in the 80’s and 90’s, and provide different ways to visualise public transportation networks [19].

Anders proposes an unsupervised bottom-up method for clustering nodes in a graph.

It uses neighbourhood graphs of different granularity to cluster nodes. It computes a neighbourhood graph, from which the paths that are substantially longer than their neighbours are removed [20]. While the clustering algorithm is hierarchical, it is dependent on

(9)

positional data that does not exist in our graphs of KCs.

Cheng, Kawano, and Scherpen propose an algorithm for node clustering in linear network systems. The algorithm combines the fields of control theory and graph theory to recursively merge the pair of nodes that has the shortest distance in an H2 or H∞ sense until there are sufficiently few nodes remaining [21]. The networks in this paper are not linear network systems, but community detection and graph clustering are similar.

(10)

2. Background and Theory

This section gives some background information.

It begins with a motivation of why using community detection algorithms would help with classifying KCs. Afterwards, the concept of modularity is explained in detail, as this thesis relies on it. The fundamentals of graph theory are also given. Lastly, descriptions of selected algorithms that others have proposed are given.

2.1. Motivation

Community structure can be found in anything that can be modelled as a network. This includes electronic circuits, public transport and citation networks. A large number of electronic components connected to each other can be thought of as a community, as can authors that frequently cite each other’s papers, or a suburb in a city that has many bus stops. These networks could potentially be very large, so it might not be feasible to have humans find the communities without additional information.

The difficulty in manually establishing a community division is one reason for why people turn to community detection. A community division can be found in reasonable time by using community detection algorithms, which use structural properties of mathematical graphs to find a suitable division. There exist quality measures that give an indication of how strong the found communities are. Modularity is the most widely used quality measure. The natural question is, given the intended application, “how does this help with modelling hierarchical structures in HEPs?”

An educational program at a university, college or the like can be modelled as a directed graph. There, the set of nodes represents the KCs that are taught, developed and/or required in the program, and the set of edges represents the dependencies between KCs.

The KC where the edge starts is required to learn the KC that the edge points to.

The programs tend to cover several fields of science, for example physics and computer science. If a directed graph was to be constructed as a model of such a program, the ontological structure of the program might not be visible by just looking at the resulting graph. By applying a community detection algorithm to the graph, it will suggest a structure that is semantically sensible to some extent.

2.2. Graph Theory

The seven bridges of Königsberg, shown in Figure 2, is a well-known mathematical problem where the solver is asked to find a path that crosses each bridge exactly once. The mathematician Leonhard Euler showed mathematically that this problem has no solution [22].

His work came to lead to the emergence of the field of mathematics that is known as graph theory.

2.2.1. The Graph

As stated before, graph theory is a field of mathematics where the graph is the fundamental concept.

Definition 1. A graph G is mathematically defined as

G = (V, E), (1)

where V is a set of points, or nodes, that spans the graph, and E is a set of pairs of points, or edges, that defines connections between nodes. The number of nodes, also known as the

(11)

Figure 2: Model of the seven bridges of Königsberg.

order of the graph, is denoted n = |V |, and the number of edges, or the size of the graph, is denoted m = |E|.

A weighted graph is a graph as in (1) where each edge in E has a weight w > 0. A graph is said to be directed if its edges connect nodes in one direction but not necessarily the other, like the flow of traffic on a one-way street.

2.2.2. Structural Properties

Let G be a graph as in (1). G is connected if and only if there exists a path between each pair of nodes. If G is directed, there exist three levels of connectivity.

1. G is weakly connected if and only if the undirected version of it is connected.

2. G is connected if and only if there exists a path between each pair of nodes in either direction.

3. G is strongly connected if and only if there exists a path between each pair of nodes in both directions.

An example of each level of connectivity follows.

Example 1. Consider the graphs in Figure 3.

1 2 3

Figure 3: Left: a weakly connected graph. Middle: a connected graph. Right: a strongly connected graph.

The left graph is weakly connected because there exists no directed path from node 2 to node 3, but there exists one if the direction of the edges is ignored.

The middle graph is connected because there exists a path from node 2 to node 3, which in turn contains paths from node 2 to node 1 and from node 1 to node 3.

The right graph is strongly connected because there exists a cycle between all nodes in the graph. Any node can reach any other node.

(12)

A connected component of G is a subgraph G⁰ ⊂ Gwhere all nodes in V⁰ are connected to each other, but not to any node in V ∩ (V⁰)^C, where A^C is the complement of A.

A shortest path p12 between nodes v1 6= v₂ is a subset of E containing the edges that connect v1 and v2 with the fewest edges. If G is weighted, p12 is the path between v1 and v2 that minimises the total weight of the edges in the path.

A node vi is a neighbour to node vj if and only if (i, j) ∈ E and G is undirected.

The concept of neighbours is not applicable to directed graphs because an edge does not connect a pair of nodes in both directions. Instead, a node in a directed graph may have predecessors and successors. vi precedes vj iff (i, j) ∈ E, and vi succeeds vj iff (j, i) ∈ E.

Gis a simple graph if and only if the following two conditions are satisfied.

1. G contains no self-loops, or (i, i) /∈ E ∀ i ∈ V . 2. There exist no repeated edges in G.

If Item 2 is not satisfied, G can be reduced to a simple graph by replacing the repeated edges with one edge. If the graph is weighted, the weight of this edge equals the sum of the weights of the repeated edges.

Gis a Directed Acyclic Graph if and only if G is directed and there exist no cycles in G.

A complete graph is a graph in which each distinct pair of nodes is connected by an edge. A complete graph with m nodes has ^m(m−1)₂ edges [23].

A clique is a subgraph G⁰ ⊂ Gin which (i, j) ∈ E⁰ for i, j ∈ V⁰, i 6= j, or, equivalently, G⁰ is a complete graph. In layman’s terms, each node has an edge to all other nodes in V⁰. 2.2.3. Graph Quantities

The adjacency matrix A of G is a symmetric n × n matrix, where n = |V |. Aij = 1if the edge (i, j) ∈ E and 0 otherwise. For weighted graphs, there is a weighted variant where A_ij equals the weight of (i, j) ∈ E. If G is directed, A is asymmetric in general. Also, if G is simple, then Aii= 0 ∀ i ∈ V.

2.2.4. Measures of Importance

Some nodes may be more important in one way or another. For instance, one node might have many edges connecting it with the rest of the graph, or one other node could be part of many shortest paths. The numbers that quantify the importance of the nodes in a graph are called “centrality indexes”.

There exist a few types of centrality indexes. Two such have been hinted at in the preceding paragraph. The following list describes measures that are applicable to undirected graphs.

Degree is a centrality index that measures how connected a node is to other nodes in the graph. The degree of a node equals the number of edges that connect to it. If the graph is weighted, the degree equals the sum of the weights of those edges.

Betweenness is an index that measures how much a node is “between” other nodes in the graph. The betweenness index is given by the number of shortest paths that pass through a node, but do not start or end there.

Closeness is an index that measures how close a node is to the rest of the graph. Multiple definitions of closeness centrality exist. One definition is given by taking the average of the shortest distances to the other nodes and then taking its reciprocal. Other methods to compute closeness include to sum the inverse shortest distances.

(13)

Eigenvector is an index that measures how influential a node is in a graph. The principle is that nodes are influential if they are connected to other nodes that are influential.

Nodes distribute their influence score to neighbouring nodes, starting from an initial distribution. The eigenvector centrality index is then obtained by having the nodes distribute their influence score to their neighbours until the scores reach an equilib- rium. This procedure is like that of the power method to find the largest eigenvalue and its corresponding eigenvector.

PageRank is an index by Page et al. that is based on the eigenvector index. In addition to distributing influence at each time step, PageRank adds some influence to each node [24]. The interested reader may skip ahead to Definition 9 for more information.

The betweenness and PageRank indexes are applicable to directed graphs without mod- ification. The other indexes are split in two variants: one for incoming edges (in-degree, in-closeness and authorities¹) and one for outgoing edges (out-degree, out-closeness and hubs²). We have developed a tool called COnCUR that can compute these centrality indexes in a HEP, given a model that Wengle, Knorn, and Varagnolo describe in [7].

Lightfoot has used centrality indexes for directed graphs as a means of finding assessment points in a curriculum. Their model uses courses as nodes and prerequisite dependencies as directed edges. Courses that have high out-degree are better suited to introduce new material or conduct baseline assessments. A course with high in-degree is well suited to conduct final assessments of the material that is taught in the preceding courses. Courses with high betweenness serve as bridges between parts of the curriculum, which makes them good candidates for assessing content in nearby communities of courses [26]. Lightfoot also use eigenvector centrality as a means of finding significant courses, but this index is not applicable to directed graphs.

2.3. Community Detection

Community detection is central in this thesis, because we are interested in finding hierarchical structures in networks of knowledge components, and through this improve our capability of modelling learning in higher education. This subsection introduces the community detection problem and explains why the reported algorithms are useful to find communities in networks.

2.3.1. Community Structure: A Primer

A graph may have an inherent community structure, which means that there exists some classification that can categorise the nodes as part of a community. A good example of a network or graph that exhibits a community structure is a model of a social network. In such a network, each node represents a person and each edge represents a social relation- ship, intimate or friendly. Groups of friends tend to have a tighter connection with each other, forming a community. Members of one community may know members of other communities, so a community needs not be a connected component of the network. A mathematical definition of a community division follows.

Definition 2. Community division c for a graph G. Let G = (V, E) be defined by its set of nodes n = |V |. Assume a set of community labels L = {l1, . . . , lp} to be specified

1 The hubs and authorities centrality indexes can be interpreted as out-eigenvector and in-eigenvector centrality indexes, respectively. See [25] for more information.

2See Footnote 1.

(14)

a priori. Then, a community division c : V 7→ L is the mapping of the nodes into their corresponding community labels. In other words, node i belongs to the community ci. Example 2. Consider the graph in Figure 4. A community division has been defined for

Figure 4: A graph with a community division.

this graph. Nodes with the same colour are assigned the same community label. From Definition 2, L = {l1, l₂, l₃} and the community division is given by

ci =







l₁ if i is yellow, l2 if i is blue, l3 if i is red.

2.3.2. Need for Community Detection Algorithms

Real-world networks can be substantially larger than that in Figure 4. For instance, the social media platform Facebook has on the order of 10⁹ users, which is too big for humans to find communities in reasonable time without using a computer³.

Finding the optimal community division in an arbitrary graph is an NP-hard problem [5].

An NP-hard problem is a problem that is at least as hard as an NP-complete problem, which in turn is an NP problem X that any other NP problem Y can be polynomially reduced to. A problem that is in NP can be solved in polynomial time on a non-deterministic Turing machine. A short introduction to the Turing machine is given in Section 2.3.3.

2.3.3. NP and the Turing Machine

All problems have an inherent cost in terms of computational time. This cost increases in general with the problem size. Computational complexity theory is about investigating the cost of solving a given problem in terms of computational resources. This theory explains why we cannot find an “easy” solution to the community detection problem [27].

The previous section mentioned the Turing machine, which is a cornerstone in computational complexity theory [27]. The Turing machine is an abstract computer that was postulated by the British mathematician Alan Turing. It consists of a tape that can fit an arbitrary number of symbols from a finite alphabet, a control unit and a head for reading and writing symbols. The control unit is in one state of a finite set of states, which includes a final state. At each step, the Turing machine reads a symbol from the tape at its current position, writes a symbol to the tape (which may be the same symbol), and may move the tape one step left or right depending on its state.

3Community detection on 10⁹-node graphs costs considerable time, even with a state-of-the-art computer that runs a linear-time algorithm. The author dares not make claims about quantum computers, however.

(15)

A problem that is in NP is verifiable in polynomial time on a Turing machine, but needs not necessarily be solvable in polynomial time on one. An NP problem can be solved in polynomial time on a non-deterministic Turing machine, however. A non-deterministic Turing machine is similar to the Turing machine that was mentioned earlier, but it can have more than one transition for each state. This allows the non-deterministic Turing machine to explore multiple solutions in parallel, whereas the deterministic Turing machine cannot [27, 28].

In community detection, NP-hardness means that any community division can be verified in polynomial time, but finding the optimal one via an exhaustive search is generally too hard to be done on a computer. Instead, people have developed algorithms that use various heuristics, such as optimising an objective function, to find “good” community divisions in polynomial time. Section 2.4 explains a common measure of the goodness of a community division: modularity. Maximising modularity is an NP-complete problem; that is, it is both NP-hard and in NP [5].

2.3.4. The Ordo Notation

The execution time of a community detection algorithm scales with the size of the graph.

A common way to express how the execution time of an algorithm scales with the problem size is to use the ordo notation O (f(n)). It means that the average time cost is on the order of f(n), where f : R^m 7→ R is the term that grows the fastest with the problem size nand m is the number of dimensions in n.

Example 3. Consider a sorted list with n comparable elements (numbers, characters, etc.). A selection of operations and their cost follows.

Accessing the first element costs O (1) because it can be done in constant time.

Searching for an element costs O (log n) because the list is sorted.

Computing the sum of all elements costs O (n) because each element must be accessed once.

If the list is not sorted, it usually costs O (n log n) or O n²

to sort it, depending on the algorithm.

2.4. Modularity

Suppose that a community detection algorithm has found a community division, i.e., a splitting of the original network into subgraphs, each to be considered a community. The goodness of this division usually cannot be obtained by just looking at the division — a quality measure is needed. The most popular measure of goodness of a given community division is modularity [3, 11, 14, 16, 29], which was first introduced by Newman and Girvan [2].

Modularity is a scalar that typically lies between 0 and 1. A high modularity indicates a strong community structure that is unlikely to have been formed by chance.

2.4.1. The Definition of Modularity

Newman and Girvan define modularity in their paper on community detection, using fractions of edges that connect nodes inside communities and fractions of edges that connect to a community [2]. Later work uses an equivalent, but slightly different, definition. Rather

(16)

than explicitly using fractions of edges inside communities, Leicht and Newman use the adjacency matrix of the graph together with the degrees of the nodes and community labels.

Their version of Newman and Girvan’s definition is given in Definition 3.

Definition 3. Modularity of a graph G given a community division c. Let the undirected graph G = (V, E) be defined by its set of nodes n = |V |, its set of edges m = |E|, and its unweighted adjacency matrix A ∈ N^n×n. Let c be a given community division for G.

Then, the modularity of G given c is Q = 1

2m X

i,j

A_ij−k_ik_j 2m

δ_c_i_,c_j, (2)

where ki =P

jA_ij =P

jA_ji is the degree of node i.

The term in parentheses inside the summation can be expressed as elements of a matrix that Newman introduced in [30]. Its definition is recollected here.

Definition 4. The modularity matrix. Let G = (V, E). Then, the modularity matrix B of G is given by

B = A − 1

2mkk^T. (3)

The modularity matrix has one eigenvector with all elements equal that is associated with the zero eigenvalue. This is by construction.

The mathematical interpretation of Definition 3 requires introducing the null model of a graph, which is stated here.

Definition 5. Null model of a graph G. Let G = (V, E) be an undirected and unweighted graph with degree sequence k (that is, the vector of degree indexes of the nodes). Then, the null model GN(V_N, E_N) of G is the subset of the corresponding Erdős-Rényi random graph model [31] that satisfies the following conditions.

1. VN = V; that is, GN and G have an identical set of nodes.

2. E (kN) = k; that is, the expected value of the degree sequence kN of GN must equal the degree sequence k of G.

3. |EN| = |E|; that is, GN and G have the same number of edges.

All edges in EN are randomly chosen, subject to these conditions [11, §3.2.3]. This imposes a probability distribution on the number of edges that connect any two nodes (vi, vj) in the null model with expectation value

E (p(i, j)) = k_ik_j

2m. (4)

Given Definition 5, the modularity of G given c is the fraction of edges that connect the nodes in the same community minus the expected fraction of such edges in the null model.

Thus, if the fraction of edges that go within the suggested communities is significantly greater than the null model suggests, the proposed community structure is stronger.

For example, consider Figure 5. The graph to the left is the original graph G and the graph to the right is a realisation of its null model GN. G appears to have a community structure: nodes 1, 2, 3 have more connections between themselves than to others, and a similar observation can be made from the other nodes. There is only one edge between

(17)

1 2 3

4 5

6

1 2 3

4 5

6

Figure 5: A graph and a realisation of its corresponding null model.

these two groups of nodes: that between nodes 3 and 4. (2) with opportune communities gives Q ≈ 0.36.

In the special case where there are exactly two communities in the graph (as is the case with the left graph in Figure 5), Newman proposes an alternative method to compute the modularity. It is stated in Definition 6, where variables are as defined in Definition 3.

Definition 6. Let {g1, g2} be a bipartition of V such that g1∪ g₂ = V and g1∩ g₂ = ∅. Introduce s ∈ Z^n×1, whose elements are defined as

s_i =

(+1, vi ∈ g₁;

−1, v_i ∈ g₂. Then, the modularity is given by

Q = 1

4ms^TBs, (5)

where B ∈ R^n×n is the modularity matrix of the graph. Refer to Definition 4 for the modularity matrix. Bij is the difference between the actual number of edges between nodes (i, j) and the expected number of such edges in the null model of the graph. Refer to Definition 5 for more information about the null model.

A generalised version of Definition 3 exists for weighted graphs, where Aij is equal to the weight of the edge (i, j) if (i, j) ∈ E and zero otherwise. Similarly, ki is the sum of weights of all edges (i, j) ∈ E with i held fixed.

The problem with modularity as Newman and Girvan defined it is that it is limited to undirected graphs. Leicht and Newman use an extension of modularity that works with directed graphs [14].

Definition 7. Let G = (V, E) be a directed graph and c a community division. Then, modularity is given by

Q = 1 m

X

i,j

Aij−k_i^outk_jⁱⁿ m

!

δci,cj, (6)

where k_iⁱⁿ=P

jA_ij is the in-degree of node i and k_j^out=P

iA_ij is the out-degree of node j.

(6) is similar to (2) with a few differences. The adjacency matrix counts each edge only once, so the sum of all elements in A equals the number of edges (or the sum of all weights).

Also, the null model of a directed graph preserves the expected out-degree and in-degree sequences because a distinction is made between incoming edges and outgoing edges.

Leicht and Newman also define an alternative way to compute modularity when there are exactly two communities. It is stated as follows.

(18)

Definition 8. Let g1, g2 be a bipartition of V . Introduce s as in Definition 6. Then, modularity is given by

Q = 1

4ms^T B + B^T s, (7)

where B is the modularity matrix of the directed graph. The elements of B are given by Bij = Aij− ^k

out i kⁱⁿ_j

m .

(7) uses B + B^T rather than B because the spectral method by Newman assumes that B is symmetric. If B is asymmetric, adding B^T restores symmetry and enables use of the spectral algorithm [14].

Kim et al. propose a variant of modularity that will be the topic of the next section.

They claim that their definition takes the direction of the edges into greater account than the directed modularity that Leicht and Newman use [3], as explained in more details below.

2.4.2. LinkRank: Alternative Modularity for Directed Graphs

The original definition of modularity (as in Definition 3) assumed an undirected graph.

Leicht and Newman used a variant of modularity that was extended to support directed graphs [14]. Kim, Son, and Jeong point out a problem with this extension of modularity, which is that the direction of the flow between two nodes might not be considered correctly.

This is demonstrated in the following example.

Example 4. Consider the directed graphs in Figure 6. Nodes 1 and 1⁰ both have equal in-degree and out-degree, as do nodes 2 and 2⁰. There is a clear flow in the top graph, but not in the bottom graph.

1 2

1⁰ 2⁰

Figure 6: Illustration of the problem with generalised modularity in directed graphs. This figure is a TikZ reconstruction of Figure 1 in [3].

Since the summation in the directed modularity proposed by Leicht and Newman X

i,j

Aij −k^out_i kⁱⁿ_j m

! δci,cj

can be rewritten as 1 2

X

i,j

A_ij + A_ji−k^out_i k_jⁱⁿ+ k_iⁱⁿk^out_j m

! δ_c_i_,c_j,

it can be seen that the direction of the links might not be distinguishable [3].

To overcome this problem, Kim et al. define an alternative to the directed modularity that was used by Leicht and Newman.

(19)

Definition 9. LinkRank modularity. Let G = (V, E) be a graph, directed or not. Define the Google matrix G as the stochastic matrix⁴ with elements

G_ij =

(α_k^Aout^ij i

+¹_n(1 − α) , k_i^out > 0,

1

n, k_i^out = 0, (8)

where 1 − α is the probability that a random walker jumps to a random node instead of following an edge to a (succeeding) neighbour. If a node lacks (succeeding) neighbours, it jumps to any node with equal probability. Then, given the stationary vector π that solves the PageRank equation

π^T = π^TG (9)

with kπk1 = 1, LinkRank modularity is given by Q^lr =X

i,j

(Lij− E (Lij)) δci,cj, (10) where Lij = πiGijis the LinkRank of the transition (i, j) and E (Lij) = πiπjis the expected value of LinkRank in the null model. This null model differs to that of Definition 5 [3].

(Recall that kuk1 =P

i|u_i|.)

Loosely speaking, LinkRank modularity is the fraction of time that a random walker spends moving inside any community given the teleportation probability 1 − α minus the expected value of this fraction in the null model. A thought experiment is given to gain an intuition of how a random walk works.

Imagine a pedestrian who is lost in a city. Each street intersection is a node and each edge connects adjacent intersections. The pedestrian begins at an intersection and follows a random street. The next time he or she reaches an intersection, he or she continues in a random direction, including the one he or she came from. Each direction is chosen with equal probability. In the directed case, the streets in the lost pedestrian example are one-way streets. He or she cannot go against the direction of the streets, which also means that he or she can get stuck.

The choice of 1 − α determines how closely LinkRank should follow the graph structure.

Higher values make a random walker more likely to jump randomly, and vice versa. This thesis uses 1 − α = 0.15, which is in line with many other researchers [3]. In the lost pedestrian analogy, the pedestrian chooses to teleport to a random point of intersection instead of taking a random street with probability 1 − α.

As stated in Definition 9, Kim et al. use a different null model than Newman and Girvan do. Rather than preserving the expected degree sequence of the graph (see Definition 5), the LinkRank null model is an Erdős-Rényi random graph where the expected PageRank sequence π is preserved. Using the degree-preserving null model does not work because the degree centrality index is not related to the random walk process [3].

2.4.3. The Pedagogical Interpretation of Modularity

Let G = (V, E) be a simple graph and GN = (VN, EN) be the null model of G as given in Definition 5. Then, the term −^k_2mⁱ^k^j in (2) is the negative of the expected number of edges between nodes i and j when G is unweighted and undirected.

4A stochastic matrix is a matrix where the sum of the elements is one for all rows.

(20)

Figure 7: The Knowledge Component Flow Graph of linear algebra. The names of the KCs are given in Table 1.

This Thesis’ Interpretation Consider the flow of KCs in a university program about elec- trical engineering. In the first year, the students take courses on the basics of electronics, some on university-level mathematics, and perhaps a few courses on other topics. Over this first year, students should become familiar with a set of facts, concepts and procedures, or KCs, related to the fields of electronics and mathematics.

If we were to model the KC flow of this program as a directed graph and apply a community detection algorithm to it, the resulting communities could be interpreted as topics of knowledge within the program. Modularity would then be a measure of the goodness of this division into these topics. Values of Q near 0 would indicate that the detected knowledge topics make little to no sense semantically, while higher values of Q suggest that the detected classification of knowledge topics is more likely to be semantically reasonable.

Let us take a look at an example. Consider the KC flow graph in Figure 7. Each node corresponds to a KC in linear algebra and each edge says that the source KC is required to learn the target KC. This model was obtained from the Linear Algebra KC dataset in Section 3.3.1.

The professor that compiled this dataset provided a classification of the KCs, which can be interpreted as a community division. The communities are identified by their marker shapes. For instance, they say that KC 22 is part of the same community as KC 27, but is part of a different community than KC 7. Table 1 shows the (possibly compressed) names of the KCs. This prior community division has Q = 0.471, taking into account that some communities (square and star markers) are not even weakly connected⁵.

The linear algebra dataset could be a description of the contents of a course in linear algebra. A community division with high modularity is a (potentially) good suggestion of how the KCs should be grouped together in the course syllabus. Aggregating the graph such that each community is one node then gives an idea of how these groups should be ordered. If the aggregated graph contains cycles, some groups may need to be taught together, at least in part.

5Splitting them in weakly connected components would raise modularity to 0.503.

(21)

Table 1: The KCs in the Linear Algebra dataset.

№ KC № KC № KC

1 Vectors 15 Least Squares 28 Compute P-Inverse

2 Dot Product 16 Gaussian Elimination 29 Triangular Matrices 3 Cross Product 17 Bilinear Forms 30 Symmetric Matrices 4 Matrices 18 Positive Definiteness 31 Cayley-Hamilton Thm 5 Matrix-Vector Mult. 19 Symmetric Pos. Def. 32 Linear Comb: Vectors 6 Matrix Multiplication 20 Pos. Def. Matrices 33 Linear Subspace 7 Matrix Inverse 21 Basis Change/Similarity 34 Linear Subspace Basis 8 Matrix Inv. by G-J 22 Eigenvalues & -vectors 35 Vector Orthogonality 9 Determinant 23 Diagonalisable Matrices 36 Matrix Orthogonality

10 Compute Det. 24 Jordan Form 37 Orthonormal Bases

11 Linear System 25 QR Decomposition 38 Matrix Kernel, Image 12 Rouché-Capelli Thm 26 Singular Value Decomp. 39 Matrix Rank

13 Solve Full-Rank Sys. 27 Pseudo Inverse 40 Kernel-Image Link 14 Solve Underdet. Sys.

2.4.4. The Resolution Limit of Modularity

As Bonald et al. mentioned, modularity is a widely used measure of community structure.

However, the community division obtained from community detection algorithms might have merged smaller communities into bigger ones because doing so increases modularity. Fortunato and Barthélemy have investigated this resolution limit using two example graphs [32]. They are given in Figure 8.

The first example graph is a ring of n cliques with m nodes each and n being an even number as in Figure 8a. The entire graph has L = ^nm(m−1)₂ + n edges, where the term

nm(m−1)

2 is the number of edges that connect nodes in the same clique. Each clique connects to its two nearest neighbours with one edge. Fortunato and Barthélemy showed that community detection is expected to find single cliques rather than pairs of cliques only if n <√

L[32]. Blondel et al. illustrated this by applying Louvain to one ring of 30 5^th-order cliques. The first pass found the individual cliques with Q = 0.876, but the second pass found pairs of cliques with Q = 0.888 [1].

The second example graph is a connected pair of cliques with m nodes each and a connected pair of cliques with p < m nodes each as in Figure 8b. If p is sufficiently small, Q will be greater if the p-cliques are in the same community rather than in different ones [32].

To overcome the resolution limit, Fortunato and Barthélemy used a community detection algorithm to find an initial community division. They then used it again, treating each community as a separate graph. While modularity decreases after this procedure, the overall size of the communities decreased [32].

A different approach to overcome the resolution limit is to introduce a parameter called resolution, denoted γ. This parameter is a nonnegative real number and controls the

“penalty” that the null model imposes on community size. The resulting formulation of modularity becomes

Q = 1 2m

X

i,j

A_ij − γk_ik_j 2m

δ_c_i_,c_j, (11)

where variables are as in Definition 3.

Modelling Hierarchical Structures in Networks Using Graph Theory

Examensarbete 30 hp Juni 2020

Modelling Hierarchical Structures in Networks Using Graph Theory

With Application to Knowledge Networks in Graph Curricula

Emil Wengle

Abstract

Modelling Hierarchical Structures in Networks Using Graph Theory

Populärvetenskaplig sammanfattning

Contents

1. Introduction

2. Background and Theory