• No results found

Analytical and Iterative Methods of Computing PageRank of Networks

N/A
N/A
Protected

Academic year: 2021

Share "Analytical and Iterative Methods of Computing PageRank of Networks"

Copied!
44
0
0

Loading.... (view fulltext now)

Full text

(1)

ISBN 978-91-7485-482-4 ISSN 1651-4238

Address: P.O. Box 883, SE-721 23 Västerås. Sweden Address: P.O. Box 325, SE-631 05 Eskilstuna. Sweden E-mail: info@mdh.se Web: www.mdh.se

TE R A TIV E M ET H O D S O F C O M PU TIN G P A G ER A N K O F N ET W O R K S 2020

(2)

ANALYTICAL AND ITERATIVE METHODS OF

COMPUTING PAGERANK OF NETWORKS

Pitos Seleka Biganda 2020

School of Education, Culture and Communication

ANALYTICAL AND ITERATIVE METHODS OF

COMPUTING PAGERANK OF NETWORKS

Pitos Seleka Biganda 2020

(3)

Copyright © Pitos Seleka Biganda, 2020 ISBN 978-91-7485-482-4

ISSN 1651-4238

Printed by E-Print AB, Stockholm, Sweden

Copyright © Pitos Seleka Biganda, 2020 ISBN 978-91-7485-482-4

ISSN 1651-4238

(4)

ANALYTICAL AND ITERATIVE METHODS OF COMPUTING PAGERANK OF NETWORKS

Pitos Seleka Biganda

Akademisk avhandling

som för avläggande av filosofie doktorsexamen i matematik/tillämpad matematik vid Akademin för utbildning, kultur och kommunikation kommer att offentligen försvaras fredagen den 20 november 2020, 10.15 i Kappa +(Zoom), Mälardalens högskola, Västerås.

Fakultetsopponent: Professor Oleg Seleznjev, Umeå University

Akademin för utbildning, kultur och kommunikation

ANALYTICAL AND ITERATIVE METHODS OF COMPUTING PAGERANK OF NETWORKS

Pitos Seleka Biganda

Akademisk avhandling

som för avläggande av filosofie doktorsexamen i matematik/tillämpad matematik vid Akademin för utbildning, kultur och kommunikation kommer att offentligen försvaras fredagen den 20 november 2020, 10.15 i Kappa +(Zoom), Mälardalens högskola, Västerås.

Fakultetsopponent: Professor Oleg Seleznjev, Umeå University

(5)

case, a corresponding formula for each of the two variants of PageRank is provided.

Chapter 3 is dedicated to the exploration of relationships that exist between three known variants of PageRank: ordinary PageRank, lazy PageRank and random walk with backstep PageRank in terms of their convergence and consistency in rank scores for different graph structures with reference to PageRank parameters, the damping factor c and backstep parameter β.

In Chapter 4, we discuss numerical methods used in solving the PageRank problem as a linear system and evaluate some stopping criteria that can be employed in such methods.

Finally, in Chapter 5, we address the PageRank problem as a first order perturbed Markov chain problem and study the perturbation analysis for stationary distributions of Markov chains with damping component. We illustrate our results on asymptotic perturbation analysis by using different computational examples.

ISBN 978-91-7485-482-4 ISSN 1651-4238

case, a corresponding formula for each of the two variants of PageRank is provided.

Chapter 3 is dedicated to the exploration of relationships that exist between three known variants of PageRank: ordinary PageRank, lazy PageRank and random walk with backstep PageRank in terms of their convergence and consistency in rank scores for different graph structures with reference to PageRank parameters, the damping factor c and backstep parameter β.

In Chapter 4, we discuss numerical methods used in solving the PageRank problem as a linear system and evaluate some stopping criteria that can be employed in such methods.

Finally, in Chapter 5, we address the PageRank problem as a first order perturbed Markov chain problem and study the perturbation analysis for stationary distributions of Markov chains with damping component. We illustrate our results on asymptotic perturbation analysis by using different computational examples.

ISBN 978-91-7485-482-4 ISSN 1651-4238

(6)

I dedicate this thesis to Dismas Seleka Kibika, my brother who built a great foundation for my entire education.

I dedicate this thesis to Dismas Seleka Kibika, my brother who built a great foundation for my entire education.

(7)
(8)

Acknowledgements

First, I would like to express my warm thanks to my supervisor Professor Sergei Silvestrov and co-supervisor Dr. Christopher Engström, both of whom introduced me to PageRank, which was then a very new research area to me. Your valuable comments and guidance helped to shape my ideas as I was writing this thesis. Many thanks to my other co-supervisor Professor emeritus Dmitrii Silvestrov for your tireless advice, comments and timely feedback, and for your lectures on Markov chains and related topics that has contributed much to the success of this work. I am also grateful to my co-supervisors Dr. Milica Rančić and Professor Anatoliy Malyarenko for their motivational lectures, seminars, advice and guidance they have given me during this journey.

I especially thank my fellow doctoral student Benard Abola (with support from his supervisors Prof. John Mango and Dr. Godwin Kakuba) for your valuable cooperation. To all other current or former fellow students, I thank you all for your kind support and being there for me especially during the difficult times.

I extend my deepest gratitude to my family, my lovely wife Monica Dani and our three children Rebecca, Meshack and Shadrack for being supportive and for encouraging me not to give up. I would like to express my gratitude to the Swedish International Development Cooperation Agency (Sida), International Science Programme (ISP) in Mathematical Sciences (IPMS) and Sida Bilateral Research Programme for research and education capacity building in Mathematics in Tanzania for the financial support. Special thanks to Dr. Bengt-Ove Turesson, Linköping University, Sweden and Dr. Sylvester Rugeihyamu, University of Dar es Salaam, Tanzania for your collaboration to bring into action the Mathematics sub-programme of the Tanzania-Sida Bilateral Research Programme, a project that has supported me.

I am also grateful to the research environment Mathematics and Applied Mathematics (MAM), Division of Applied Mathematics, Mälardalen University for providing an excellent and inspiring research environment throughout my studies.

Above all, I thank the Almighty God for many blessings upon my life and for opportunities that motivate me to strive to the best that I can be.

Västerås, October 2020 Pitos Seleka Biganda

Acknowledgements

First, I would like to express my warm thanks to my supervisor Professor Sergei Silvestrov and co-supervisor Dr. Christopher Engström, both of whom introduced me to PageRank, which was then a very new research area to me. Your valuable comments and guidance helped to shape my ideas as I was writing this thesis. Many thanks to my other co-supervisor Professor emeritus Dmitrii Silvestrov for your tireless advice, comments and timely feedback, and for your lectures on Markov chains and related topics that has contributed much to the success of this work. I am also grateful to my co-supervisors Dr. Milica Rančić and Professor Anatoliy Malyarenko for their motivational lectures, seminars, advice and guidance they have given me during this journey.

I especially thank my fellow doctoral student Benard Abola (with support from his supervisors Prof. John Mango and Dr. Godwin Kakuba) for your valuable cooperation. To all other current or former fellow students, I thank you all for your kind support and being there for me especially during the difficult times.

I extend my deepest gratitude to my family, my lovely wife Monica Dani and our three children Rebecca, Meshack and Shadrack for being supportive and for encouraging me not to give up. I would like to express my gratitude to the Swedish International Development Cooperation Agency (Sida), International Science Programme (ISP) in Mathematical Sciences (IPMS) and Sida Bilateral Research Programme for research and education capacity building in Mathematics in Tanzania for the financial support. Special thanks to Dr. Bengt-Ove Turesson, Linköping University, Sweden and Dr. Sylvester Rugeihyamu, University of Dar es Salaam, Tanzania for your collaboration to bring into action the Mathematics sub-programme of the Tanzania-Sida Bilateral Research Programme, a project that has supported me.

I am also grateful to the research environment Mathematics and Applied Mathematics (MAM), Division of Applied Mathematics, Mälardalen University for providing an excellent and inspiring research environment throughout my studies.

Above all, I thank the Almighty God for many blessings upon my life and for opportunities that motivate me to strive to the best that I can be.

Västerås, October 2020 Pitos Seleka Biganda

(9)

denna avhandling, som ger en rangordningar till webbsidor på internet beroende på hur betydelsefulla de är. Det görs genom att anta att webbsidor är viktiga om de har många länkar från andra viktiga sidor. Således anses PageRank korrelera bra med mänskliga begrepp av betydelse.

Internet är ett enormt data- och informationsförvar som innehåller olika format och typer av data. Således kan det betraktas som ett informationsnätverk som kan beskrivas av Markov-kedjemodeller (stokastiska modeller som beskriver en sekvens av möjliga händelser där sannolikheten för varje händelse endast beror på tillståndet som uppnåtts i föregående händelse). I sådana modeller associeras en Markov-kedja till motsvarande webblänksdiagram av informationsnätverket och rankningen av webbsidorna (noder) görs genom den stationära fördelningen av en Markov-kedja (även benämnt PageRank-vektor).

PageRank har vunnit stor berömmelse, eftersom det är den grundläggande metoden för hur en sökmotor viktar informationsvärdet av olika hemsidor sinsemellan. Till exempel, när en person är intresserad av att få viss information från internet, kommer han/hon troligen att använda en sökmotor för att leta efter sådan information. Dessutom kommer personen att vara intresserad av att få den mest relevanta informationen. PageRank algoritmen används för att sortera ut och lista de mest relevanta sidorna efter sökningen.

Förutom den primära användningen av PageRank för att analysera och rangordna sökresultaten i en sökmotor finns numera flera andra tillämpningar av PageRank. För att bara nämna några, inkluderar de online-annonsrankning, kluster och klassificering av webbsidor, identifiering av meningen av ord beroende på sammanhang, samhällsdetektering, bildtaggning och spårning av utrotningar i studien av näringsväv, vilket är de komplexa nätverk över vem som äter vem i ett ekosystem. Andra tillämpningar inkluderar rangordning av meddelanden och föreslag av vänner över sociala nätverk och tillämpningar i rekommendationssystem (t.ex. att hitta några nya filmer som matchar ens smak).

Genom att överväga dessa flertalet framväxande tillämpningar av PageRank är denna avhandling avsedd att studera metoder för PageRank-beräkning med betoning på vissa nätverk, främst bestående av den så kallade enkla linjen (ett exempel på ett enkelt informationsnätverk där information flyter i en riktning) och kompletta grafer (informationsnätverk där information flyter enhetligt i alla riktningar i ett slutet system). Avhandlingen analyserar också informationsvärdet (PageRank-värdet) för varje nod i nätverket när det finns ett möjligt antal grenar av informationsflöde till eller från huvudnätverket.

denna avhandling, som ger en rangordningar till webbsidor på internet beroende på hur betydelsefulla de är. Det görs genom att anta att webbsidor är viktiga om de har många länkar från andra viktiga sidor. Således anses PageRank korrelera bra med mänskliga begrepp av betydelse.

Internet är ett enormt data- och informationsförvar som innehåller olika format och typer av data. Således kan det betraktas som ett informationsnätverk som kan beskrivas av Markov-kedjemodeller (stokastiska modeller som beskriver en sekvens av möjliga händelser där sannolikheten för varje händelse endast beror på tillståndet som uppnåtts i föregående händelse). I sådana modeller associeras en Markov-kedja till motsvarande webblänksdiagram av informationsnätverket och rankningen av webbsidorna (noder) görs genom den stationära fördelningen av en Markov-kedja (även benämnt PageRank-vektor).

PageRank har vunnit stor berömmelse, eftersom det är den grundläggande metoden för hur en sökmotor viktar informationsvärdet av olika hemsidor sinsemellan. Till exempel, när en person är intresserad av att få viss information från internet, kommer han/hon troligen att använda en sökmotor för att leta efter sådan information. Dessutom kommer personen att vara intresserad av att få den mest relevanta informationen. PageRank algoritmen används för att sortera ut och lista de mest relevanta sidorna efter sökningen.

Förutom den primära användningen av PageRank för att analysera och rangordna sökresultaten i en sökmotor finns numera flera andra tillämpningar av PageRank. För att bara nämna några, inkluderar de online-annonsrankning, kluster och klassificering av webbsidor, identifiering av meningen av ord beroende på sammanhang, samhällsdetektering, bildtaggning och spårning av utrotningar i studien av näringsväv, vilket är de komplexa nätverk över vem som äter vem i ett ekosystem. Andra tillämpningar inkluderar rangordning av meddelanden och föreslag av vänner över sociala nätverk och tillämpningar i rekommendationssystem (t.ex. att hitta några nya filmer som matchar ens smak).

Genom att överväga dessa flertalet framväxande tillämpningar av PageRank är denna avhandling avsedd att studera metoder för PageRank-beräkning med betoning på vissa nätverk, främst bestående av den så kallade enkla linjen (ett exempel på ett enkelt informationsnätverk där information flyter i en riktning) och kompletta grafer (informationsnätverk där information flyter enhetligt i alla riktningar i ett slutet system). Avhandlingen analyserar också informationsvärdet (PageRank-värdet) för varje nod i nätverket när det finns ett möjligt antal grenar av informationsflöde till eller från huvudnätverket.

(10)

thesis, which gives ranks to webpages in the world wide web in order of their importance. It does so by assuming that webpages linked by many other important pages are themselves likely to be important. Thus, PageRank is thought to correlate well with human concepts of importance.

The world wide web is a huge data and information repository that contains different formats and types of data. Thus, it may be considered to be an information network that can be described by Markov chain models (stochastic models describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event). In such models, a Markov chain associated to corresponding web-links graph represents the information network and the ranking of the webpages (nodes) is done through the stationary distribution of a Markov chain (also termed here as PageRank vector).

PageRank has gained great fame, as it is the basic method of how a search engine weighs the information value of different websites against each other. For instance, when a person is interested in getting certain information from the internet, he/she is most likely going to use a search engine to look for such information. Moreover, the person will be interested in getting the most relevant information. The PageRank algorithm is used to sort out and list the most relevant pages after the search.

Apart from the primary use of PageRank to analyse and rank search results in a search engine, several other applications of PageRank exist to date. To mention but a few, they include online advertisement ranking, clustering and classification of webpages, word sense disambiguation, community detection, image tagging and tracking extinctions in the study of food webs, which are the complex networks of who eats who in an ecosystem. Other applications are ranking of messages and suggesting friends in social networks and applications in recommendation systems (e.g. finding some new movies that match one's taste).

By considering the vast emerging applications of PageRank, this thesis is dedicated to the study of methods of PageRank computation with emphasis on some networks, mainly consisting of the so-called simple line (an example of a simple information network where information flows in one direction) and complete graphs (information networks where information flow uniformly in all directions in a closed system). The thesis also analyses the information value (PageRank value) of each node of the network when there is a possible number of branches of the information flow to or from the main network.

thesis, which gives ranks to webpages in the world wide web in order of their importance. It does so by assuming that webpages linked by many other important pages are themselves likely to be important. Thus, PageRank is thought to correlate well with human concepts of importance.

The world wide web is a huge data and information repository that contains different formats and types of data. Thus, it may be considered to be an information network that can be described by Markov chain models (stochastic models describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event). In such models, a Markov chain associated to corresponding web-links graph represents the information network and the ranking of the webpages (nodes) is done through the stationary distribution of a Markov chain (also termed here as PageRank vector).

PageRank has gained great fame, as it is the basic method of how a search engine weighs the information value of different websites against each other. For instance, when a person is interested in getting certain information from the internet, he/she is most likely going to use a search engine to look for such information. Moreover, the person will be interested in getting the most relevant information. The PageRank algorithm is used to sort out and list the most relevant pages after the search.

Apart from the primary use of PageRank to analyse and rank search results in a search engine, several other applications of PageRank exist to date. To mention but a few, they include online advertisement ranking, clustering and classification of webpages, word sense disambiguation, community detection, image tagging and tracking extinctions in the study of food webs, which are the complex networks of who eats who in an ecosystem. Other applications are ranking of messages and suggesting friends in social networks and applications in recommendation systems (e.g. finding some new movies that match one's taste).

By considering the vast emerging applications of PageRank, this thesis is dedicated to the study of methods of PageRank computation with emphasis on some networks, mainly consisting of the so-called simple line (an example of a simple information network where information flows in one direction) and complete graphs (information networks where information flow uniformly in all directions in a closed system). The thesis also analyses the information value (PageRank value) of each node of the network when there is a possible number of branches of the information flow to or from the main network.

(11)

This work was financially supported by the Swedish International Develop-ment Cooperation Agency and International Science Programme at Uppsala University - Sweden under the Sida bilateral programme between Sweden and Tanzania in research capacity building in Mathematics with University of

Dar es Salaam in Dar es Salaam, Tanzania and M¨alardalen University in

V¨aster˚as, Sweden.

This work was financially supported by the Swedish International Develop-ment Cooperation Agency and International Science Programme at Uppsala University - Sweden under the Sida bilateral programme between Sweden and Tanzania in research capacity building in Mathematics with University of

Dar es Salaam in Dar es Salaam, Tanzania and M¨alardalen University in

(12)

Contents

List of Papers 17

1 Introduction 21

1.1 Preliminaries . . . 23

1.1.1 Definitions from graph theory . . . 23

1.1.2 Overview of Markov process and stationary distribution 26 1.1.3 Random walks on networks . . . 29

1.2 PageRank and its computation . . . 31

1.2.1 Common techniques of computing PageRank . . . 34

1.3 Summaries of main chapters . . . 36

1.3.1 Chapter 2 . . . 36 1.3.2 Chapter 3 . . . 36 1.3.3 Chapter 4 . . . 37 1.3.4 Chapter 5 . . . 37 References . . . 37

Contents

List of Papers 17 1 Introduction 21 1.1 Preliminaries . . . 23

1.1.1 Definitions from graph theory . . . 23

1.1.2 Overview of Markov process and stationary distribution 26 1.1.3 Random walks on networks . . . 29

1.2 PageRank and its computation . . . 31

1.2.1 Common techniques of computing PageRank . . . 34

1.3 Summaries of main chapters . . . 36

1.3.1 Chapter 2 . . . 36

1.3.2 Chapter 3 . . . 36

1.3.3 Chapter 4 . . . 37

1.3.4 Chapter 5 . . . 37

(13)

2 Analytical Approach of Computing PageRank of Some Graph

Structures 47

2.1 Introduction . . . 47

2.2 Preliminaries . . . 48

2.3 Changes in traditional and lazy PageRanks when connecting the simple line with multiple outside nodes . . . 51

2.3.1 Connecting the simple line with multiple links from outside nodes to one node in the line . . . 53

2.3.2 Connecting a simple line with multiple links from mul-tiple outside nodes to the line . . . 55

2.3.3 Connecting the simple line with two links from the line to two outside nodes . . . 58

2.4 Changes in traditional and lazy PageRanks when connecting a simple line with two links from the line to two complete graphs 63 2.4.1 A comparison of traditional PageRank and lazy PageR-ank for the line connected with complete graphs . . . . 69

2.5 Conclusions . . . 72

References . . . 72

3 PageRank of Different Random Walks on Graphs 77 3.1 Introduction . . . 77

3.2 Notations and basic concepts . . . 79

3.3 Mathematical relationships between variants of PageRank . . 80

3.3.1 Ordinary PageRank ~π(t) . . . 80

3.3.2 Generalized lazy PageRank ~π(g) . . . 81

3.3.3 Random walk with backstep PageRank ~π(b) . . . 83

3.4 Rates of convergence for the variants of PageRank . . . 85

3.5 Comparison of ranking behaviour for the variants of PageRank 88 12 2 Analytical Approach of Computing PageRank of Some Graph Structures 47 2.1 Introduction . . . 47

2.2 Preliminaries . . . 48

2.3 Changes in traditional and lazy PageRanks when connecting the simple line with multiple outside nodes . . . 51

2.3.1 Connecting the simple line with multiple links from outside nodes to one node in the line . . . 53

2.3.2 Connecting a simple line with multiple links from mul-tiple outside nodes to the line . . . 55

2.3.3 Connecting the simple line with two links from the line to two outside nodes . . . 58

2.4 Changes in traditional and lazy PageRanks when connecting a simple line with two links from the line to two complete graphs 63 2.4.1 A comparison of traditional PageRank and lazy PageR-ank for the line connected with complete graphs . . . . 69

2.5 Conclusions . . . 72

References . . . 72

3 PageRank of Different Random Walks on Graphs 77 3.1 Introduction . . . 77

3.2 Notations and basic concepts . . . 79

3.3 Mathematical relationships between variants of PageRank . . 80

3.3.1 Ordinary PageRank ~π(t) . . . 80

3.3.2 Generalized lazy PageRank ~π(g) . . . 81

3.3.3 Random walk with backstep PageRank ~π(b) . . . 83

3.4 Rates of convergence for the variants of PageRank . . . 85 3.5 Comparison of ranking behaviour for the variants of PageRank 88 12

(14)

3.5.1 Comparing PageRank of simple networks . . . 88

3.5.2 Numerical experiments for large network . . . 91

3.6 Conclusion . . . 94

References . . . 94

4 Evaluation of Stopping Criteria for Ranks in Solving Linear Systems 99 4.1 Introduction . . . 99

4.2 Preliminaries . . . 100

4.3 PageRank problem as a linear system of equations . . . 101

4.4 Iterative methods of solving linear systems . . . 102

4.4.1 Jacobi method . . . 103

4.4.2 Successive overrelaxation (SOR) method . . . 103

4.4.3 Power series method . . . 104

4.5 Stopping criteria for solving linear systems . . . 105

4.5.1 Geometric distance measures . . . 106

4.5.2 Rank distance measures . . . 109

4.5.2.1 Kendall’s τ rank correlation coefficient . . . . 109

4.6 Numerical experimentation of stopping criteria . . . 110

4.6.1 Convergence of stopping criterion . . . 110

4.6.2 Quantiles . . . 113

4.6.3 Kendall’s correlation coefficient as stopping criterion . 115 4.7 Conclusions . . . 116

References . . . 117

3.5.1 Comparing PageRank of simple networks . . . 88

3.5.2 Numerical experiments for large network . . . 91

3.6 Conclusion . . . 94

References . . . 94

4 Evaluation of Stopping Criteria for Ranks in Solving Linear Systems 99 4.1 Introduction . . . 99

4.2 Preliminaries . . . 100

4.3 PageRank problem as a linear system of equations . . . 101

4.4 Iterative methods of solving linear systems . . . 102

4.4.1 Jacobi method . . . 103

4.4.2 Successive overrelaxation (SOR) method . . . 103

4.4.3 Power series method . . . 104

4.5 Stopping criteria for solving linear systems . . . 105

4.5.1 Geometric distance measures . . . 106

4.5.2 Rank distance measures . . . 109

4.5.2.1 Kendall’s τ rank correlation coefficient . . . . 109

4.6 Numerical experimentation of stopping criteria . . . 110

4.6.1 Convergence of stopping criterion . . . 110

4.6.2 Quantiles . . . 113

4.6.3 Kendall’s correlation coefficient as stopping criterion . 115 4.7 Conclusions . . . 116

(15)

5 PageRank Problem and Perturbation Analysis for Stationary Distributions of Markov Chains with Damping Component 123 5.1 Introduction . . . 123

5.1.1 PageRank as a stationary distribution of perturbed

Markov chains . . . 125 5.2 PageRank problem of first order perturbed Markov chains . . 126 5.3 Markov chains with damping component (MCDC) . . . 127

5.3.1 Stochastic modelling of Markov chains with damping

component . . . 127

5.3.2 Regenerative properties of Markov chains with

damp-ing component . . . 128

5.3.3 Renewal type equations for transition probabilities of

Markov chains with damping component . . . 129 5.4 Stationary distributions of Markov chains with damping

com-ponent . . . 131

5.4.1 Stationary distributions of Markov chains Xε,n and Yε,n 131

5.4.2 Stationary distribution of Markov chain X0,n . . . 133

5.5 Asymptotic expansions for stationary distributions of perturbed Markov chains with damping component . . . 135

5.5.1 Asymptotic expansions for stationary distributions of

regularly perturbed Markov chains with damping com-ponent . . . 136

5.5.2 Asymptotic expansions for stationary distributions of

singularly perturbed Markov chains with damping com-ponent . . . 138

5.5.2.1 Case I: Phase space without transient states . 140

5.5.2.2 Case II: Phase space with transient states . . 141

5.6 Numerical examples of asymptotic expansions . . . 147

5.6.1 Example 1 . . . 147

14

5 PageRank Problem and Perturbation Analysis for Stationary Distributions of Markov Chains with Damping Component 123 5.1 Introduction . . . 123

5.1.1 PageRank as a stationary distribution of perturbed

Markov chains . . . 125 5.2 PageRank problem of first order perturbed Markov chains . . 126 5.3 Markov chains with damping component (MCDC) . . . 127

5.3.1 Stochastic modelling of Markov chains with damping

component . . . 127

5.3.2 Regenerative properties of Markov chains with

damp-ing component . . . 128

5.3.3 Renewal type equations for transition probabilities of

Markov chains with damping component . . . 129 5.4 Stationary distributions of Markov chains with damping

com-ponent . . . 131

5.4.1 Stationary distributions of Markov chains Xε,nand Yε,n 131

5.4.2 Stationary distribution of Markov chain X0,n . . . 133

5.5 Asymptotic expansions for stationary distributions of perturbed Markov chains with damping component . . . 135

5.5.1 Asymptotic expansions for stationary distributions of

regularly perturbed Markov chains with damping com-ponent . . . 136

5.5.2 Asymptotic expansions for stationary distributions of

singularly perturbed Markov chains with damping com-ponent . . . 138

5.5.2.1 Case I: Phase space without transient states . 140

5.5.2.2 Case II: Phase space with transient states . . 141

5.6 Numerical examples of asymptotic expansions . . . 147

5.6.1 Example 1 . . . 147

(16)

5.6.2 Example 2 . . . 151 5.6.3 Example 3 . . . 153 5.6.4 Conclusion . . . 156 References . . . 157 5.6.2 Example 2 . . . 151 5.6.3 Example 3 . . . 153 5.6.4 Conclusion . . . 156 References . . . 157

(17)
(18)

List of Papers

This thesis is based on the following papers:

Paper A. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, Sergei Silve-strov, PageRank, connecting a line of nodes with multiple complete graphs, Proceedings of the 17th Applied Stochastic Models and Data Analysis In-ternational Conference with the 6th Demographics Workshop. London, UK, 2017 (C. H. Skiadas, ed.), ISAST: International Society for the Advancement of Science and Technology, 2017, 113–126.

Paper B. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, Traditional and lazy PageRanks for a line of nodes connected with complete graphs, Stochastic Processes and Applications (S. Silvestrov, M. Ran˘ci´c, A. Malyarenko, eds.), Springer Pro-ceedings in Mathematics & Statistics, 271, Springer, Cham, 2018, Chapter 17, 391–412.

Paper C. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, Exploring the relationship be-tween ordinary PageRank, lazy PageRank and random walk with backstep PageRank for different graph structures, Data Analysis and Applications 3 (A. Makrides, A. Karagrigoriou, C. H. Skiadas, eds.), Computational, Clas-sification, Financial, Statistical and Stochastic Methods, 5, ISTE, Wiley, 2020, Chapter 3, 53–73.

Paper D. Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, Sergei Silve-strov, Evaluation of stopping criteria for ranks in solving linear systems, Data Analysis and Applications 1 (C. H. Skiadas, J. R. Bozeman, eds.), Clus-tering and Regression, Modeling-estimating, Forecasting and Data Mining, 2, ISTE, Wiley, 2019, Chapter 10, 137–152.

Paper E. Benard Abola, Pitos Seleka Biganda, Dmitrii Silvestrov, Sergei Silvestrov, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Perturbed

List of Papers

This thesis is based on the following papers:

Paper A. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, Sergei Silve-strov, PageRank, connecting a line of nodes with multiple complete graphs, Proceedings of the 17th Applied Stochastic Models and Data Analysis In-ternational Conference with the 6th Demographics Workshop. London, UK, 2017 (C. H. Skiadas, ed.), ISAST: International Society for the Advancement of Science and Technology, 2017, 113–126.

Paper B. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, Traditional and lazy PageRanks for a line of nodes connected with complete graphs, Stochastic Processes and Applications (S. Silvestrov, M. Ran˘ci´c, A. Malyarenko, eds.), Springer Pro-ceedings in Mathematics & Statistics, 271, Springer, Cham, 2018, Chapter 17, 391–412.

Paper C. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, Exploring the relationship be-tween ordinary PageRank, lazy PageRank and random walk with backstep PageRank for different graph structures, Data Analysis and Applications 3 (A. Makrides, A. Karagrigoriou, C. H. Skiadas, eds.), Computational, Clas-sification, Financial, Statistical and Stochastic Methods, 5, ISTE, Wiley, 2020, Chapter 3, 53–73.

Paper D. Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, Sergei Silve-strov, Evaluation of stopping criteria for ranks in solving linear systems, Data Analysis and Applications 1 (C. H. Skiadas, J. R. Bozeman, eds.), Clus-tering and Regression, Modeling-estimating, Forecasting and Data Mining, 2, ISTE, Wiley, 2019, Chapter 10, 137–152.

Paper E. Benard Abola, Pitos Seleka Biganda, Dmitrii Silvestrov, Sergei Silvestrov, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Perturbed

(19)

Markov chains and information networks, ArXiv:1901.11483v3 [math.PR] 2 May 2019, 60 p. (2019).

Paper F. Dmitrii Silvestrov, Sergei Silvestrov, Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Perturbation analysis for stationary distributions of Markov chains with damping compo-nent, Algebraic Structures and Applications (S. Silvestrov, A. Malyarenko, M. Ran˘ci´c, eds.), Springer Proceedings in Mathematics and Statistics, 317, Springer, Cham, 2020, Chapter 38, 903–933.

Paper G. Benard Abola, Pitos Seleka Biganda, Sergei Silvestrov, Dmitrii Silvestrov, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Nonlinearly perturbed Markov chains and information networks, Proceedings of the 18th Applied Stochastic Models and Data Analysis International Conference with the Demographics 2019 Workshop. Florence, Italy, 2019 (C. H. Skiadas, ed.), ISAST: International Society for the Advancement of Science and Technol-ogy, 2019, 51–79.

Paper H. Dmitrii Silvestrov, Sergei Silvestrov, Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Perturbed Markov chains with damping component, Methodology and Computing in Applied Probability, 31 p. (2020). https://doi.org/10.1007/s11009-020-09815-9

Reprints were made with permission from the respective publishers.

Parts of this thesis have been presented in communications given at the fol-lowing international conferences:

1: ASMDA2017 - 17th Applied Stochastic Models and Data Analysis Interna-tional Conference with the 6th Demographics Workshop, 6 - 9 June 2017, London, UK.

2: SPAS2017 - International Conference on Stochastic Processes and Alge-braic Structures - From Theory Towards Applications, 4 - 6 October 2017, V¨aster˚as and Stockholm, Sweden.

3: SMTDA2018 - 5th Stochastic Modeling Techniques and Data Analysis In-ternational Conference with Demographics Workshop, 12 - 15 June 2018, Chania, Crete, Greece.

4: ASMDA2019 - 18th Applied Stochastic Models and Data Analysis Interna-tional Conference with the Demographics 2019 Workshop, 11 - 14 June 2019, Florence, Italy.

18

Markov chains and information networks, ArXiv:1901.11483v3 [math.PR] 2 May 2019, 60 p. (2019).

Paper F. Dmitrii Silvestrov, Sergei Silvestrov, Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Perturbation analysis for stationary distributions of Markov chains with damping compo-nent, Algebraic Structures and Applications (S. Silvestrov, A. Malyarenko, M. Ran˘ci´c, eds.), Springer Proceedings in Mathematics and Statistics, 317, Springer, Cham, 2020, Chapter 38, 903–933.

Paper G. Benard Abola, Pitos Seleka Biganda, Sergei Silvestrov, Dmitrii Silvestrov, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Nonlinearly perturbed Markov chains and information networks, Proceedings of the 18th Applied Stochastic Models and Data Analysis International Conference with the Demographics 2019 Workshop. Florence, Italy, 2019 (C. H. Skiadas, ed.), ISAST: International Society for the Advancement of Science and Technol-ogy, 2019, 51–79.

Paper H. Dmitrii Silvestrov, Sergei Silvestrov, Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Perturbed Markov chains with damping component, Methodology and Computing in Applied Probability, 31 p. (2020). https://doi.org/10.1007/s11009-020-09815-9

Reprints were made with permission from the respective publishers.

Parts of this thesis have been presented in communications given at the fol-lowing international conferences:

1: ASMDA2017 - 17th Applied Stochastic Models and Data Analysis Interna-tional Conference with the 6th Demographics Workshop, 6 - 9 June 2017, London, UK.

2: SPAS2017 - International Conference on Stochastic Processes and Alge-braic Structures - From Theory Towards Applications, 4 - 6 October 2017, V¨aster˚as and Stockholm, Sweden.

3: SMTDA2018 - 5th Stochastic Modeling Techniques and Data Analysis In-ternational Conference with Demographics Workshop, 12 - 15 June 2018, Chania, Crete, Greece.

4: ASMDA2019 - 18th Applied Stochastic Models and Data Analysis Interna-tional Conference with the Demographics 2019 Workshop, 11 - 14 June 2019, Florence, Italy.

(20)

The following papers are not included in the thesis:

• Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, PageRank in evolving tree graphs, Stochastic Processes and Applications (S. Silvestrov, M. Ran˘ci´c, A. Mal-yarenko, eds.), Springer Proceedings in Mathematics & Statistics, 271, Springer, Cham, 2018, Chapter 16, 375–390.

• Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, Updating of PageRank in evolv-ing tree graphs, Data Analysis and Applications 3 (A. Makrides, A. Kara-grigoriou, C. H. Skiadas, eds.), Computational, Classification, Financial, Statistical and Stochastic Methods, 5, ISTE, Wiley, 2020, Chapter 2, 35– 51.

• Dmitrii Silvestrov, Sergei Silvestrov, Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Coupling and ergodic theorems for Markov chains with damping component, Theory of Probability and Mathematical Statistics, 101 (2019), no. 2, 212–231.

The following papers are not included in the thesis:

• Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, PageRank in evolving tree graphs, Stochastic Processes and Applications (S. Silvestrov, M. Ran˘ci´c, A. Mal-yarenko, eds.), Springer Proceedings in Mathematics & Statistics, 271, Springer, Cham, 2018, Chapter 16, 375–390.

• Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, Updating of PageRank in evolv-ing tree graphs, Data Analysis and Applications 3 (A. Makrides, A. Kara-grigoriou, C. H. Skiadas, eds.), Computational, Classification, Financial, Statistical and Stochastic Methods, 5, ISTE, Wiley, 2020, Chapter 2, 35– 51.

• Dmitrii Silvestrov, Sergei Silvestrov, Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Coupling and ergodic theorems for Markov chains with damping component, Theory of Probability and Mathematical Statistics, 101 (2019), no. 2, 212–231.

(21)
(22)

Chapter 1

Introduction

A focus on information retrieval has been on the centre of the stage over 60 years back [48]. In the recent years, the demand for reliable and accurate information seems to overwhelm information retrieval systems [2, 3]. This comes, as humans tend to levy their successes or tasks on computer’s ability with a view to getting results in real time. It is well known that search engines can handle this challenge, but one should be aware that the primary goal of such systems is to retrieve all items related to the query while the quality of relevance is left to the user. Many search engines exist today, but the largest ones (based on Worldwide search engine market share) are Google, Bing, Baidu, Yahoo!, Yandex, Ask, DuckDuckGo, Naver, AOL, and Seznam [30]. Other search engines are Loycos, WebClawler and many more [45].

With the exception of Google, many search engines are keyword-based methods of ranking search results and rank pages basing on how often the search terms occur in the page, or how strongly associated the search terms are within each resulting page. The Google search engine, on the other hand, uses PageRank algorithm, which analyses human generated links with an assumption that web pages linked from many important pages are themselves likely to be important. In this way, PageRank is thought to correlate well with human concepts of importance, and it is this fact that makes Google the leading search engine in the world [47, 48]. However, despite this creativity, spammers have been trying to manipulate Google and other search engines for their own gain. Essentially, web spam refers to hyperlinked pages on the World Wide Web developed intentionally to mislead search engines. This alone has created research interests in the search engines’ domain with the aim to obtain authentic ranking. Examples of such studies include trustrank

Chapter 1

Introduction

A focus on information retrieval has been on the centre of the stage over 60 years back [48]. In the recent years, the demand for reliable and accurate information seems to overwhelm information retrieval systems [2, 3]. This comes, as humans tend to levy their successes or tasks on computer’s ability with a view to getting results in real time. It is well known that search engines can handle this challenge, but one should be aware that the primary goal of such systems is to retrieve all items related to the query while the quality of relevance is left to the user. Many search engines exist today, but the largest ones (based on Worldwide search engine market share) are Google, Bing, Baidu, Yahoo!, Yandex, Ask, DuckDuckGo, Naver, AOL, and Seznam [30]. Other search engines are Loycos, WebClawler and many more [45].

With the exception of Google, many search engines are keyword-based methods of ranking search results and rank pages basing on how often the search terms occur in the page, or how strongly associated the search terms are within each resulting page. The Google search engine, on the other hand, uses PageRank algorithm, which analyses human generated links with an assumption that web pages linked from many important pages are themselves likely to be important. In this way, PageRank is thought to correlate well with human concepts of importance, and it is this fact that makes Google the leading search engine in the world [47, 48]. However, despite this creativity, spammers have been trying to manipulate Google and other search engines for their own gain. Essentially, web spam refers to hyperlinked pages on the World Wide Web developed intentionally to mislead search engines. This alone has created research interests in the search engines’ domain with the aim to obtain authentic ranking. Examples of such studies include trustrank

(23)

[23], topical trustrank [60] and anti-trustrank [33]. Other studies on anti-web spam algorithms, we refer you to [11, 12, 27, 39, 43, 62].

This thesis is dedicated to the study of PageRank computations, focusing on both analytical and numerical methods for some representative networks (or graphs). Specifically, the following are the objectives of the thesis.

1. To develop analytical formulae for PageRank of a line graph subject to addition of some number of nodes (or other types of graph structures) to the graph by considering two variants of PageRank.

2. To study relationships between variants of PageRank in terms of their numerical computations.

3. To evaluate stopping criteria on some iterative methods used in solving the PageRank problem.

4. To formulate the PageRank problem as a first order perturbed Markov chain problem and study the perturbation analysis for stationary dis-tributions of Markov chains with damping component.

The thesis is organised as follows. In the remaining part of this chap-ter, we will give a short summary of some mathematical background that addresses the main topic of the thesis. This include a summary of some key and well known definitions and results from graph theory, random walks on graphs and matrix theory. Then a short introduction to the main topic, PageRank: historical background, definitions and methods of calculation, follows. The chapter is concluded with summaries of the main chapters of the thesis.

In Chapter 2, which is based on Papers A and B, we will give closed form formulae of ordinary PageRank and lazy PageRank for some specific simple line graphs. We will consider different cases of changes made to the simple line graph and provide corresponding formulae for the two variants of PageRank in each case. In addition, we will compare the ranking behaviour of the two variants of PageRank.

In Chapter 3, which is based on Paper C, we will explore the relationships between three known variants of PageRank: ordinary PageRank, lazy PageR-ank and random walk with backstep PageRPageR-ank, in terms of their convergence and consistency in rank scores for different graph structures with reference to PageRank parameters, the damping factor c and backstep parameter β. 22

[23], topical trustrank [60] and anti-trustrank [33]. Other studies on anti-web spam algorithms, we refer you to [11, 12, 27, 39, 43, 62].

This thesis is dedicated to the study of PageRank computations, focusing on both analytical and numerical methods for some representative networks (or graphs). Specifically, the following are the objectives of the thesis.

1. To develop analytical formulae for PageRank of a line graph subject to addition of some number of nodes (or other types of graph structures) to the graph by considering two variants of PageRank.

2. To study relationships between variants of PageRank in terms of their numerical computations.

3. To evaluate stopping criteria on some iterative methods used in solving the PageRank problem.

4. To formulate the PageRank problem as a first order perturbed Markov chain problem and study the perturbation analysis for stationary dis-tributions of Markov chains with damping component.

The thesis is organised as follows. In the remaining part of this chap-ter, we will give a short summary of some mathematical background that addresses the main topic of the thesis. This include a summary of some key and well known definitions and results from graph theory, random walks on graphs and matrix theory. Then a short introduction to the main topic, PageRank: historical background, definitions and methods of calculation, follows. The chapter is concluded with summaries of the main chapters of the thesis.

In Chapter 2, which is based on Papers A and B, we will give closed form formulae of ordinary PageRank and lazy PageRank for some specific simple line graphs. We will consider different cases of changes made to the simple line graph and provide corresponding formulae for the two variants of PageRank in each case. In addition, we will compare the ranking behaviour of the two variants of PageRank.

In Chapter 3, which is based on Paper C, we will explore the relationships between three known variants of PageRank: ordinary PageRank, lazy PageR-ank and random walk with backstep PageRPageR-ank, in terms of their convergence and consistency in rank scores for different graph structures with reference to PageRank parameters, the damping factor c and backstep parameter β. 22

(24)

In Chapter 4, which is based on Paper D, we will discuss some stop-ping criteria that can be employed in numerical methods used in solving the PageRank problem as a linear system.

Finally, Chapter 5, which is based on Papers E, F, G and H, ad-dresses the PageRank problem as a first order perturbed Markov chain model and perturbation analysis for stationary distributions corresponding to this Markov chain is discussed. In addition, computational examples illustrating results on asymptotic perturbation analysis are presented.

1.1

Preliminaries

1.1.1

Definitions from graph theory

In this section, we will highlight some key definitions and notations necessary for the reader to have a clear follow-up throughout this thesis. Materials presented here are partly based on [56] and [58]. But, of course any textbook on graph theory can also cover these topics. We start by defining a graph as given below.

Definition 1.1.1. A graph, denoted by G = (V, E), is defined as a set V of

vertices (or nodes) and set E of pairs of vertices called edges that represent

relations between vertices. A single edge between two vertices va and vb is

written as (va, vb) and the total number of vertices and edges in G are denoted

by|V | and |E|, respectively, where |A| denotes the cardinality of set A.

• Directed graph: A graph is called a directed graph if the edge set

con-tains ordered pairs of vertices such that (va, vb) represents an edge

starting from vertex vaand ending at vertex vb.

• Loop: A loop is an edge whose endpoints are equal.

• Simple graph: A simple graph is a graph with no loops or multiple edges, where multiple edges are edges with the same ordered pair of endpoints.

• Weighted graph: A weighted graph is a graph whose every edge is assigned a positive value called a weight.

Many problems can be modelled using graphs. Examples of such mod-els include physical networks such as road networks and pipe or electricity

In Chapter 4, which is based on Paper D, we will discuss some stop-ping criteria that can be employed in numerical methods used in solving the PageRank problem as a linear system.

Finally, Chapter 5, which is based on Papers E, F, G and H, ad-dresses the PageRank problem as a first order perturbed Markov chain model and perturbation analysis for stationary distributions corresponding to this Markov chain is discussed. In addition, computational examples illustrating results on asymptotic perturbation analysis are presented.

1.1

Preliminaries

1.1.1

Definitions from graph theory

In this section, we will highlight some key definitions and notations necessary for the reader to have a clear follow-up throughout this thesis. Materials presented here are partly based on [56] and [58]. But, of course any textbook on graph theory can also cover these topics. We start by defining a graph as given below.

Definition 1.1.1. A graph, denoted by G = (V, E), is defined as a set V of

vertices (or nodes) and set E of pairs of vertices called edges that represent

relations between vertices. A single edge between two vertices va and vb is

written as (va, vb) and the total number of vertices and edges in G are denoted

by|V | and |E|, respectively, where |A| denotes the cardinality of set A.

• Directed graph: A graph is called a directed graph if the edge set

con-tains ordered pairs of vertices such that (va, vb) represents an edge

starting from vertex vaand ending at vertex vb.

• Loop: A loop is an edge whose endpoints are equal.

• Simple graph: A simple graph is a graph with no loops or multiple edges, where multiple edges are edges with the same ordered pair of endpoints.

• Weighted graph: A weighted graph is a graph whose every edge is assigned a positive value called a weight.

Many problems can be modelled using graphs. Examples of such mod-els include physical networks such as road networks and pipe or electricity

(25)

networks. Others arise from epidemiology to show the spread of diseases. In addition, networks showing interaction between genes in the human body and networks of linguistic relations between words to the internet exist [17]. The largest graph is a network of homepages in the World Wide Web [41].

Note in this thesis that, the following terms will be used interchangeably to refer to the same thing: a graph and a network, a node and a vertex and a link and an edge.

In this work, we are mainly interested with directed weighted graphs. In some few cases, graphs having loops will be considered as well.

A graph is typically drawn on paper by representing vertices as dots or cir-cles and edges as lines (or arrows in the case of directed graphs) between pairs of vertices. Figure 1.1 gives an example of a simple directed and weighted graph with five vertices and ten edges.

v1 v2 v3 v4 v5 1 3 1 1 3 1 3 1 3 1 3 1 2 1 2 1 3 1

Figure 1.1: An example of a weighted graph with 5 vertices and 10 edges.

Among other ways, graphs are most commonly represented by using ad-jacency matrices as defined below.

Definition 1.1.2.Let G := (V, E) be a directed graph with|V | vertices and

|E| edges. The adjacency matrix A = (aij) of G is the|V | × |V | matrix with

elements

aij =

(

1, if (vi, vj)∈ E ,

0, otherwise.

For an undirected graph, the adjacency matrix is obtained by considering a directed graph where every edge in the undirected graph is represented by 24

networks. Others arise from epidemiology to show the spread of diseases. In addition, networks showing interaction between genes in the human body and networks of linguistic relations between words to the internet exist [17]. The largest graph is a network of homepages in the World Wide Web [41].

Note in this thesis that, the following terms will be used interchangeably to refer to the same thing: a graph and a network, a node and a vertex and a link and an edge.

In this work, we are mainly interested with directed weighted graphs. In some few cases, graphs having loops will be considered as well.

A graph is typically drawn on paper by representing vertices as dots or cir-cles and edges as lines (or arrows in the case of directed graphs) between pairs of vertices. Figure 1.1 gives an example of a simple directed and weighted graph with five vertices and ten edges.

v1 v2 v3 v4 v5 1 3 1 1 3 1 3 1 3 1 3 1 2 1 2 1 3 1

Figure 1.1: An example of a weighted graph with 5 vertices and 10 edges.

Among other ways, graphs are most commonly represented by using ad-jacency matrices as defined below.

Definition 1.1.2.Let G := (V, E) be a directed graph with|V | vertices and

|E| edges. The adjacency matrix A = (aij) of G is the|V | × |V | matrix with

elements

aij =

(

1, if (vi, vj)∈ E ,

0, otherwise.

For an undirected graph, the adjacency matrix is obtained by considering a directed graph where every edge in the undirected graph is represented by 24

(26)

an edge in both directions. On the other hand, if the graph is weighted, its adjacency matrix A is called distance matrix or modified adjacency matrix

whose elements aij take the form,

aij=

(

w, if (vi, vj)∈ E ,

0, otherwise,

where w is the weight of an edge (vi, vj).

In this thesis, we are mainly interested in modified adjacency matrices of directed graphs, such as transition matrices in the theory of Markov chains, where the weights are the transition probabilities. An example of a modified adjacency matrix is matrix B, which is derived from Figure 1.1.

B=        0 1 3 1 3 1 3 0 0 0 1 3 1 3 1 3 1 0 0 0 0 1 2 0 1 2 0 0 0 0 0 1 0        .

The adjacency matrix and/or modified adjacency matrix A is very useful in networks analysis. It tells if two nodes of a network have an edge (or a path of length 1) between them. It follows that, a matrix product of k

copies of A, that is Ak, has an interpretation that an element of Ak at

an entry (i, j) provides a number of walks (traversing edges of a graph) of

length k from vertex vito vj. In other words, if k is the smallest non-negative

integer such that for some i, j, the entry (i, j) of Ak is positive, then k is

the distance between the vertices viand vj. This in turn, has an application

in determining whether the graph is connected or not. A graph is said to be connected if it has at least one vertex and between every pair of vertices there is an edge. An example of a connected graph is a complete graph, which is defined as the graph in which every pair of distinct vertices is connected by a pair of unique edges (one in each direction). See, for example, Figure 1.2.

Walks between vertices of a graph have a profound application in itera-tions of a Markov chain. Each state in the Markov chain is represented by a vertex in the random walk and edges represent possible transitions with pos-itive probabilities to move between different states. This theory is detailed in subsection 1.1.3.

an edge in both directions. On the other hand, if the graph is weighted, its adjacency matrix A is called distance matrix or modified adjacency matrix

whose elements aij take the form,

aij=

(

w, if (vi, vj)∈ E ,

0, otherwise,

where w is the weight of an edge (vi, vj).

In this thesis, we are mainly interested in modified adjacency matrices of directed graphs, such as transition matrices in the theory of Markov chains, where the weights are the transition probabilities. An example of a modified adjacency matrix is matrix B, which is derived from Figure 1.1.

B=        0 1 3 1 3 1 3 0 0 0 1 3 1 3 1 3 1 0 0 0 0 1 2 0 1 2 0 0 0 0 0 1 0        .

The adjacency matrix and/or modified adjacency matrix A is very useful in networks analysis. It tells if two nodes of a network have an edge (or a path of length 1) between them. It follows that, a matrix product of k

copies of A, that is Ak, has an interpretation that an element of Ak at

an entry (i, j) provides a number of walks (traversing edges of a graph) of

length k from vertex vito vj. In other words, if k is the smallest non-negative

integer such that for some i, j, the entry (i, j) of Ak is positive, then k is

the distance between the vertices viand vj. This in turn, has an application

in determining whether the graph is connected or not. A graph is said to be connected if it has at least one vertex and between every pair of vertices there is an edge. An example of a connected graph is a complete graph, which is defined as the graph in which every pair of distinct vertices is connected by a pair of unique edges (one in each direction). See, for example, Figure 1.2.

Walks between vertices of a graph have a profound application in itera-tions of a Markov chain. Each state in the Markov chain is represented by a vertex in the random walk and edges represent possible transitions with pos-itive probabilities to move between different states. This theory is detailed in subsection 1.1.3.

(27)

v1

v2

v3

v4

Figure 1.2: An example of a complete graph.

Connectivity of a graph, on the other hand, implies that the adjacency (or modified adjacency) matrix corresponding to the graph is irreducible (see Definitions 1.1.6 and 1.1.7).

Before we look at some key concepts from matrix theory in the next section, let us define two important types of vertices in graph theory.

Definition 1.1.3. Consider two vertices u, v∈ V of a directed graph G :=

(V, E). The vertex u is called a dangling vertex if there is no edge from u to any other vertex, and v is called a root vertex if it has no incoming edge.

Examples of dangling vertices are webpages (such as image and pdf files) with no outgoing links in the World Wide Web, whereas an example of a root vertex is a reference to the root of a tree graph.

1.1.2

Overview of Markov process and stationary

dis-tribution

In this section, we describe the basic concepts of Markov processes that are necessary throughout this thesis. We begin by looking at the Markov property of Markov processes. Later we will give properties of transition probabilities that ensure unique stationary distribution of a Markov chain.

Stochastic processes are vital in understanding spatial (state space) and temporal (time) random processes that occur in nature and in engineering 26

v1

v2

v3

v4

Figure 1.2: An example of a complete graph.

Connectivity of a graph, on the other hand, implies that the adjacency (or modified adjacency) matrix corresponding to the graph is irreducible (see Definitions 1.1.6 and 1.1.7).

Before we look at some key concepts from matrix theory in the next section, let us define two important types of vertices in graph theory.

Definition 1.1.3. Consider two vertices u, v∈ V of a directed graph G :=

(V, E). The vertex u is called a dangling vertex if there is no edge from u to any other vertex, and v is called a root vertex if it has no incoming edge.

Examples of dangling vertices are webpages (such as image and pdf files) with no outgoing links in the World Wide Web, whereas an example of a root vertex is a reference to the root of a tree graph.

1.1.2

Overview of Markov process and stationary

dis-tribution

In this section, we describe the basic concepts of Markov processes that are necessary throughout this thesis. We begin by looking at the Markov property of Markov processes. Later we will give properties of transition probabilities that ensure unique stationary distribution of a Markov chain.

Stochastic processes are vital in understanding spatial (state space) and temporal (time) random processes that occur in nature and in engineering 26

(28)

systems. In particular, our focus is on Markov process, which is known to be endowed with rich mathematical properties.

One of important properties of Markov processes is the ‘memoryless prop-erty’ (also called the forgetfulness property or the Markov property), which means that a given probability distribution is independent of its history. It is understood as the conditional probability of a future state of the process to be independent of the past given the present state [54]. That is, if one has knowledge of the current state, then the future state can be predicted without information from the history.

Four classes of Markov processes exist in literature, namely discrete-time Markov chain, continuous-discrete-time Markov chain, discrete-discrete-time Markov process and continuous-time Markov process [29]. In this thesis, discrete-time Markov chains are considered. This class has many applications in science and engineering, for instance, web analysis, branching processes, Markov decision process, intergeneration social mobility and many others [9, 32, 44, 49, 57].

A discrete time Markov chain is defined as follows.

Definition 1.1.4.Let X be a discrete (finite or countable) phase space. A

sequence X0, X1, . . . of a stochastic process {Xn} is called a discrete time

Markov chain if for arbitrary times n− 1 < n < n + 1, [15]

P(Xn+1= j|Xn= i,Xn−1= in−1, . . . , X0= i0)

= P (Xn+1= j|Xn= i), (1.1)

holds for all i, j, i0, . . . , in−1∈ X and all n ≥ 0.

It can be seen that relation (1.1) is a conditional probability that the future state j at time moment n + 1 depends only on the current state i at time moment n. Thus, relation (1.1) is a Markov property. In addition, if

P(Xn+1= j|Xn= i) = pij, where 0≤ pij ≤ 1 (1.2)

for all i, j ∈ X, then the Markov chain is called time homogeneous. The

probability pij for i, j ∈ X is called transition probability and P = [pij],

i, j= 1, 2, . . . , n, where n =|X|, is a matrix of transition probabilities and

satisfies the following condition. X

j∈X

pij = 1, i∈ X. (1.3)

systems. In particular, our focus is on Markov process, which is known to be endowed with rich mathematical properties.

One of important properties of Markov processes is the ‘memoryless prop-erty’ (also called the forgetfulness property or the Markov property), which means that a given probability distribution is independent of its history. It is understood as the conditional probability of a future state of the process to be independent of the past given the present state [54]. That is, if one has knowledge of the current state, then the future state can be predicted without information from the history.

Four classes of Markov processes exist in literature, namely discrete-time Markov chain, continuous-discrete-time Markov chain, discrete-discrete-time Markov process and continuous-time Markov process [29]. In this thesis, discrete-time Markov chains are considered. This class has many applications in science and engineering, for instance, web analysis, branching processes, Markov decision process, intergeneration social mobility and many others [9, 32, 44, 49, 57].

A discrete time Markov chain is defined as follows.

Definition 1.1.4. Let X be a discrete (finite or countable) phase space. A

sequence X0, X1, . . . of a stochastic process {Xn} is called a discrete time

Markov chain if for arbitrary times n− 1 < n < n + 1, [15]

P(Xn+1= j|Xn= i,Xn−1= in−1, . . . , X0= i0)

= P (Xn+1= j|Xn= i), (1.1)

holds for all i, j, i0, . . . , in−1∈ X and all n ≥ 0.

It can be seen that relation (1.1) is a conditional probability that the future state j at time moment n + 1 depends only on the current state i at time moment n. Thus, relation (1.1) is a Markov property. In addition, if

P(Xn+1= j|Xn= i) = pij, where 0≤ pij ≤ 1 (1.2)

for all i, j ∈ X, then the Markov chain is called time homogeneous. The

probability pij for i, j ∈ X is called transition probability and P = [pij],

i, j = 1, 2, . . . , n, where n =|X|, is a matrix of transition probabilities and

satisfies the following condition. X

j∈X

(29)

In other words, P is a stochastic matrix. It takes non-negative elements and further results associated with it rely on Perron-Frobenius theorem, the theorem which provides eigen-space properties that ensure unique limiting distribution of Markov chains. This theorem will be stated explicitly in a due course.

Definition 1.1.5. A distribution ~π is an invariant distribution of transition

matrix P if

~π= P~π. (1.4)

In dealing with Markov chains, much attention is paid to invariant

dis-tribution or limiting disdis-tribution limn→∞Pn.

Definition 1.1.6. A non-negative square matrix A is irreducible if there

exist no permutation matrix T such that

TAT>=V W

0 X



, (1.5)

where V and X are square matrices, W may not be square matrix and 0 is a matrix with zero entries.

If a graph has a modified adjacency matrix A that can be written as in relation (1.5), then it is not strongly connected. In other words, a graph is strongly connected if and only if its adjacency matrix is irreducible. This plays an important role in existence of limiting behaviour of Markov chains provided that the corresponding matrix of transition probabilities is also primitive [36].

Definition 1.1.7. A Markov chain is said to be primitive if for some integer

n >0 we have aij >0, where aij is an element of Anfor any pair (i, j)∈ X,

where X is the finite phase space.

The following theorem tell us more about non-negative irreducible matri-ces.

Theorem 1.1.1(The Perron-Frobenius Theorem [15]). Suppose A≥ 0 and

irreducible. Then

(i) A has a real positive eigenvalue λ1with the following properties;

28

In other words, P is a stochastic matrix. It takes non-negative elements and further results associated with it rely on Perron-Frobenius theorem, the theorem which provides eigen-space properties that ensure unique limiting distribution of Markov chains. This theorem will be stated explicitly in a due course.

Definition 1.1.5. A distribution ~π is an invariant distribution of transition

matrix P if

~π= P~π. (1.4)

In dealing with Markov chains, much attention is paid to invariant

dis-tribution or limiting disdis-tribution limn→∞Pn.

Definition 1.1.6. A non-negative square matrix A is irreducible if there

exist no permutation matrix T such that

TAT>=V W

0 X



, (1.5)

where V and X are square matrices, W may not be square matrix and 0 is a matrix with zero entries.

If a graph has a modified adjacency matrix A that can be written as in relation (1.5), then it is not strongly connected. In other words, a graph is strongly connected if and only if its adjacency matrix is irreducible. This plays an important role in existence of limiting behaviour of Markov chains provided that the corresponding matrix of transition probabilities is also primitive [36].

Definition 1.1.7.A Markov chain is said to be primitive if for some integer

n >0 we have aij >0, where aij is an element of Anfor any pair (i, j)∈ X,

where X is the finite phase space.

The following theorem tell us more about non-negative irreducible matri-ces.

Theorem 1.1.1(The Perron-Frobenius Theorem [15]). Suppose A≥ 0 and

irreducible. Then

(i) A has a real positive eigenvalue λ1with the following properties;

(30)

(ii) corresponding to λ1there is an eigenvector ~π all of whose elements may

be taken as positive, that is, there exists a vector ~π >0 such that

A~π= λ1~π; (1.6)

(iii) if α is any other eigenvalue of A, then|α| ≤ λ1.

(iv) λ1increases when any element of A increases.

(v) λ1is a simple root of the determinantal equation|λ1I− A| = 0.

(vi) λ1≤ maxi  P jpij  , λ1≤ maxj(Pipij).

Acording to Theorem 1.1.1, a stochastic matrix A is primitive if the

largest eigenvalue λ1is the only one, is bounded above and its corresponding

eigenvector, which is the unique solution to relation (1.6), is non-negative. It turns out that irreducibility and aperiodicity (or primitivity) are the nec-essary conditions to guarantee uniqueness of a solution to the eigenvalue equation (1.6), that is the invariant distribution. Alternatively, to guarantee uniqueness of a stationary distribution of a Markov chain, an ergodic theorem for primitive Markov chain holds [15], that is

lim

n→∞A

n= ~π~e>, (1.7)

where ~π is the stationary distribution with positive entries and ~e is a vector with all elements equal to 1.

It is not always that all systems have properties of irreducibility and primitivity. One might encounter a graph with multiple sub-graphs. In such a case, irreducibility is violated and invariant distribution of the chain is not unique.

1.1.3

Random walks on networks

Random walks on networks have several applications in various scientific areas. For concrete examples of applications the works by [10, 16, 53] are recommended.

In this thesis, four different random walks on graphs are considered. The first is the classical simple random walk, also called traditional random walk on graphs, where the walk is such that a starting (source) vertex is picked

(ii) corresponding to λ1there is an eigenvector ~π all of whose elements may

be taken as positive, that is, there exists a vector ~π >0 such that

A~π= λ1~π; (1.6)

(iii) if α is any other eigenvalue of A, then|α| ≤ λ1.

(iv) λ1increases when any element of A increases.

(v) λ1is a simple root of the determinantal equation|λ1I− A| = 0.

(vi) λ1≤ maxi  P jpij  , λ1≤ maxj(Pipij).

Acording to Theorem 1.1.1, a stochastic matrix A is primitive if the

largest eigenvalue λ1is the only one, is bounded above and its corresponding

eigenvector, which is the unique solution to relation (1.6), is non-negative. It turns out that irreducibility and aperiodicity (or primitivity) are the nec-essary conditions to guarantee uniqueness of a solution to the eigenvalue equation (1.6), that is the invariant distribution. Alternatively, to guarantee uniqueness of a stationary distribution of a Markov chain, an ergodic theorem for primitive Markov chain holds [15], that is

lim

n→∞A

n= ~π~e>, (1.7)

where ~π is the stationary distribution with positive entries and ~e is a vector with all elements equal to 1.

It is not always that all systems have properties of irreducibility and primitivity. One might encounter a graph with multiple sub-graphs. In such a case, irreducibility is violated and invariant distribution of the chain is not unique.

1.1.3

Random walks on networks

Random walks on networks have several applications in various scientific areas. For concrete examples of applications the works by [10, 16, 53] are recommended.

In this thesis, four different random walks on graphs are considered. The first is the classical simple random walk, also called traditional random walk on graphs, where the walk is such that a starting (source) vertex is picked

References

Related documents

PageRank-värden för dessa vektorer var väldigt jämnt fördelade, de låg i inter- vallet från 1.017 · 10 −4 till 1.2 · 10 −4 , vilket är rimligt eftersom att orsaken till

The main findings reported in this thesis are (i) the personality trait extroversion has a U- shaped relationship with conformity propensity – low and high scores on this trait

Re-examination of the actual 2 ♀♀ (ZML) revealed that they are Andrena labialis (det.. Andrena jacobi Perkins: Paxton &amp; al. -Species synonymy- Schwarz &amp; al. scotica while

Samtidigt som man redan idag skickar mindre försändelser direkt till kund skulle även denna verksamhet kunna behållas för att täcka in leveranser som

The main goal of this thesis is to develop methods to analyze the effect of non-ellipsoidal pores on the elastic moduli, and then to use these results to make predictions of

The purpose of this study is to investigate whether PageRank based algorithms can be used to deter- mine how credible a Twitter user is based on how much the user’s posts are

We then propose a model of multilayer network formation that considers target measure for the network to be generated and focuses on the case of finite multiplex networks?.

What is a thing? For an analysis of the role of technological artefacts in eve- ryday life, this might be the first important question to ask. Depending on the epistemological