Analytical and Iterative Methods of Computing PageRank of Networks

(1)

ISBN 978-91-7485-482-4 ISSN 1651-4238

Address: P.O. Box 883, SE-721 23 Västerås. Sweden Address: P.O. Box 325, SE-631 05 Eskilstuna. Sweden E-mail: info@mdh.se Web: www.mdh.se

TE R A TIV E M ET H O D S O F C O M PU TIN G P A G ER A N K O F N ET W O R K S 2020

(2)

ANALYTICAL AND ITERATIVE METHODS OF

COMPUTING PAGERANK OF NETWORKS

Pitos Seleka Biganda 2020

School of Education, Culture and Communication

ANALYTICAL AND ITERATIVE METHODS OF

COMPUTING PAGERANK OF NETWORKS

Pitos Seleka Biganda 2020

(3)

ISSN 1651-4238

Printed by E-Print AB, Stockholm, Sweden

ISSN 1651-4238

(4)

ANALYTICAL AND ITERATIVE METHODS OF COMPUTING PAGERANK OF NETWORKS

Pitos Seleka Biganda

Akademisk avhandling

som för avläggande av filosofie doktorsexamen i matematik/tillämpad matematik vid Akademin för utbildning, kultur och kommunikation kommer att offentligen försvaras fredagen den 20 november 2020, 10.15 i Kappa +(Zoom), Mälardalens högskola, Västerås.

Fakultetsopponent: Professor Oleg Seleznjev, Umeå University

Akademin för utbildning, kultur och kommunikation

ANALYTICAL AND ITERATIVE METHODS OF COMPUTING PAGERANK OF NETWORKS

Pitos Seleka Biganda

Akademisk avhandling

som för avläggande av filosofie doktorsexamen i matematik/tillämpad matematik vid Akademin för utbildning, kultur och kommunikation kommer att offentligen försvaras fredagen den 20 november 2020, 10.15 i Kappa +(Zoom), Mälardalens högskola, Västerås.

Fakultetsopponent: Professor Oleg Seleznjev, Umeå University

(5)

case, a corresponding formula for each of the two variants of PageRank is provided.

Chapter 3 is dedicated to the exploration of relationships that exist between three known variants of PageRank: ordinary PageRank, lazy PageRank and random walk with backstep PageRank in terms of their convergence and consistency in rank scores for different graph structures with reference to PageRank parameters, the damping factor c and backstep parameter β.

In Chapter 4, we discuss numerical methods used in solving the PageRank problem as a linear system and evaluate some stopping criteria that can be employed in such methods.

Finally, in Chapter 5, we address the PageRank problem as a first order perturbed Markov chain problem and study the perturbation analysis for stationary distributions of Markov chains with damping component. We illustrate our results on asymptotic perturbation analysis by using different computational examples.

ISBN 978-91-7485-482-4 ISSN 1651-4238

case, a corresponding formula for each of the two variants of PageRank is provided.

Chapter 3 is dedicated to the exploration of relationships that exist between three known variants of PageRank: ordinary PageRank, lazy PageRank and random walk with backstep PageRank in terms of their convergence and consistency in rank scores for different graph structures with reference to PageRank parameters, the damping factor c and backstep parameter β.

In Chapter 4, we discuss numerical methods used in solving the PageRank problem as a linear system and evaluate some stopping criteria that can be employed in such methods.

Finally, in Chapter 5, we address the PageRank problem as a first order perturbed Markov chain problem and study the perturbation analysis for stationary distributions of Markov chains with damping component. We illustrate our results on asymptotic perturbation analysis by using different computational examples.

ISBN 978-91-7485-482-4 ISSN 1651-4238

(6)

I dedicate this thesis to Dismas Seleka Kibika, my brother who built a great foundation for my entire education.

(7)

(8)

Acknowledgements

First, I would like to express my warm thanks to my supervisor Professor Sergei Silvestrov and co-supervisor Dr. Christopher Engström, both of whom introduced me to PageRank, which was then a very new research area to me. Your valuable comments and guidance helped to shape my ideas as I was writing this thesis. Many thanks to my other co-supervisor Professor emeritus Dmitrii Silvestrov for your tireless advice, comments and timely feedback, and for your lectures on Markov chains and related topics that has contributed much to the success of this work. I am also grateful to my co-supervisors Dr. Milica Rančić and Professor Anatoliy Malyarenko for their motivational lectures, seminars, advice and guidance they have given me during this journey.

I especially thank my fellow doctoral student Benard Abola (with support from his supervisors Prof. John Mango and Dr. Godwin Kakuba) for your valuable cooperation. To all other current or former fellow students, I thank you all for your kind support and being there for me especially during the difficult times.

I extend my deepest gratitude to my family, my lovely wife Monica Dani and our three children Rebecca, Meshack and Shadrack for being supportive and for encouraging me not to give up. I would like to express my gratitude to the Swedish International Development Cooperation Agency (Sida), International Science Programme (ISP) in Mathematical Sciences (IPMS) and Sida Bilateral Research Programme for research and education capacity building in Mathematics in Tanzania for the financial support. Special thanks to Dr. Bengt-Ove Turesson, Linköping University, Sweden and Dr. Sylvester Rugeihyamu, University of Dar es Salaam, Tanzania for your collaboration to bring into action the Mathematics sub-programme of the Tanzania-Sida Bilateral Research Programme, a project that has supported me.

I am also grateful to the research environment Mathematics and Applied Mathematics (MAM), Division of Applied Mathematics, Mälardalen University for providing an excellent and inspiring research environment throughout my studies.

Above all, I thank the Almighty God for many blessings upon my life and for opportunities that motivate me to strive to the best that I can be.

Västerås, October 2020 Pitos Seleka Biganda

Acknowledgements

First, I would like to express my warm thanks to my supervisor Professor Sergei Silvestrov and co-supervisor Dr. Christopher Engström, both of whom introduced me to PageRank, which was then a very new research area to me. Your valuable comments and guidance helped to shape my ideas as I was writing this thesis. Many thanks to my other co-supervisor Professor emeritus Dmitrii Silvestrov for your tireless advice, comments and timely feedback, and for your lectures on Markov chains and related topics that has contributed much to the success of this work. I am also grateful to my co-supervisors Dr. Milica Rančić and Professor Anatoliy Malyarenko for their motivational lectures, seminars, advice and guidance they have given me during this journey.

I especially thank my fellow doctoral student Benard Abola (with support from his supervisors Prof. John Mango and Dr. Godwin Kakuba) for your valuable cooperation. To all other current or former fellow students, I thank you all for your kind support and being there for me especially during the difficult times.

I extend my deepest gratitude to my family, my lovely wife Monica Dani and our three children Rebecca, Meshack and Shadrack for being supportive and for encouraging me not to give up. I would like to express my gratitude to the Swedish International Development Cooperation Agency (Sida), International Science Programme (ISP) in Mathematical Sciences (IPMS) and Sida Bilateral Research Programme for research and education capacity building in Mathematics in Tanzania for the financial support. Special thanks to Dr. Bengt-Ove Turesson, Linköping University, Sweden and Dr. Sylvester Rugeihyamu, University of Dar es Salaam, Tanzania for your collaboration to bring into action the Mathematics sub-programme of the Tanzania-Sida Bilateral Research Programme, a project that has supported me.

I am also grateful to the research environment Mathematics and Applied Mathematics (MAM), Division of Applied Mathematics, Mälardalen University for providing an excellent and inspiring research environment throughout my studies.

Above all, I thank the Almighty God for many blessings upon my life and for opportunities that motivate me to strive to the best that I can be.

Västerås, October 2020 Pitos Seleka Biganda

(9)

denna avhandling, som ger en rangordningar till webbsidor på internet beroende på hur betydelsefulla de är. Det görs genom att anta att webbsidor är viktiga om de har många länkar från andra viktiga sidor. Således anses PageRank korrelera bra med mänskliga begrepp av betydelse.

Internet är ett enormt data- och informationsförvar som innehåller olika format och typer av data. Således kan det betraktas som ett informationsnätverk som kan beskrivas av Markov-kedjemodeller (stokastiska modeller som beskriver en sekvens av möjliga händelser där sannolikheten för varje händelse endast beror på tillståndet som uppnåtts i föregående händelse). I sådana modeller associeras en Markov-kedja till motsvarande webblänksdiagram av informationsnätverket och rankningen av webbsidorna (noder) görs genom den stationära fördelningen av en Markov-kedja (även benämnt PageRank-vektor).

PageRank har vunnit stor berömmelse, eftersom det är den grundläggande metoden för hur en sökmotor viktar informationsvärdet av olika hemsidor sinsemellan. Till exempel, när en person är intresserad av att få viss information från internet, kommer han/hon troligen att använda en sökmotor för att leta efter sådan information. Dessutom kommer personen att vara intresserad av att få den mest relevanta informationen. PageRank algoritmen används för att sortera ut och lista de mest relevanta sidorna efter sökningen.

Förutom den primära användningen av PageRank för att analysera och rangordna sökresultaten i en sökmotor finns numera flera andra tillämpningar av PageRank. För att bara nämna några, inkluderar de online-annonsrankning, kluster och klassificering av webbsidor, identifiering av meningen av ord beroende på sammanhang, samhällsdetektering, bildtaggning och spårning av utrotningar i studien av näringsväv, vilket är de komplexa nätverk över vem som äter vem i ett ekosystem. Andra tillämpningar inkluderar rangordning av meddelanden och föreslag av vänner över sociala nätverk och tillämpningar i rekommendationssystem (t.ex. att hitta några nya filmer som matchar ens smak).

Genom att överväga dessa flertalet framväxande tillämpningar av PageRank är denna avhandling avsedd att studera metoder för PageRank-beräkning med betoning på vissa nätverk, främst bestående av den så kallade enkla linjen (ett exempel på ett enkelt informationsnätverk där information flyter i en riktning) och kompletta grafer (informationsnätverk där information flyter enhetligt i alla riktningar i ett slutet system). Avhandlingen analyserar också informationsvärdet (PageRank-värdet) för varje nod i nätverket när det finns ett möjligt antal grenar av informationsflöde till eller från huvudnätverket.

denna avhandling, som ger en rangordningar till webbsidor på internet beroende på hur betydelsefulla de är. Det görs genom att anta att webbsidor är viktiga om de har många länkar från andra viktiga sidor. Således anses PageRank korrelera bra med mänskliga begrepp av betydelse.

Internet är ett enormt data- och informationsförvar som innehåller olika format och typer av data. Således kan det betraktas som ett informationsnätverk som kan beskrivas av Markov-kedjemodeller (stokastiska modeller som beskriver en sekvens av möjliga händelser där sannolikheten för varje händelse endast beror på tillståndet som uppnåtts i föregående händelse). I sådana modeller associeras en Markov-kedja till motsvarande webblänksdiagram av informationsnätverket och rankningen av webbsidorna (noder) görs genom den stationära fördelningen av en Markov-kedja (även benämnt PageRank-vektor).

PageRank har vunnit stor berömmelse, eftersom det är den grundläggande metoden för hur en sökmotor viktar informationsvärdet av olika hemsidor sinsemellan. Till exempel, när en person är intresserad av att få viss information från internet, kommer han/hon troligen att använda en sökmotor för att leta efter sådan information. Dessutom kommer personen att vara intresserad av att få den mest relevanta informationen. PageRank algoritmen används för att sortera ut och lista de mest relevanta sidorna efter sökningen.

Förutom den primära användningen av PageRank för att analysera och rangordna sökresultaten i en sökmotor finns numera flera andra tillämpningar av PageRank. För att bara nämna några, inkluderar de online-annonsrankning, kluster och klassificering av webbsidor, identifiering av meningen av ord beroende på sammanhang, samhällsdetektering, bildtaggning och spårning av utrotningar i studien av näringsväv, vilket är de komplexa nätverk över vem som äter vem i ett ekosystem. Andra tillämpningar inkluderar rangordning av meddelanden och föreslag av vänner över sociala nätverk och tillämpningar i rekommendationssystem (t.ex. att hitta några nya filmer som matchar ens smak).

Genom att överväga dessa flertalet framväxande tillämpningar av PageRank är denna avhandling avsedd att studera metoder för PageRank-beräkning med betoning på vissa nätverk, främst bestående av den så kallade enkla linjen (ett exempel på ett enkelt informationsnätverk där information flyter i en riktning) och kompletta grafer (informationsnätverk där information flyter enhetligt i alla riktningar i ett slutet system). Avhandlingen analyserar också informationsvärdet (PageRank-värdet) för varje nod i nätverket när det finns ett möjligt antal grenar av informationsflöde till eller från huvudnätverket.

(10)

thesis, which gives ranks to webpages in the world wide web in order of their importance. It does so by assuming that webpages linked by many other important pages are themselves likely to be important. Thus, PageRank is thought to correlate well with human concepts of importance.

The world wide web is a huge data and information repository that contains different formats and types of data. Thus, it may be considered to be an information network that can be described by Markov chain models (stochastic models describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event). In such models, a Markov chain associated to corresponding web-links graph represents the information network and the ranking of the webpages (nodes) is done through the stationary distribution of a Markov chain (also termed here as PageRank vector).

PageRank has gained great fame, as it is the basic method of how a search engine weighs the information value of different websites against each other. For instance, when a person is interested in getting certain information from the internet, he/she is most likely going to use a search engine to look for such information. Moreover, the person will be interested in getting the most relevant information. The PageRank algorithm is used to sort out and list the most relevant pages after the search.

Apart from the primary use of PageRank to analyse and rank search results in a search engine, several other applications of PageRank exist to date. To mention but a few, they include online advertisement ranking, clustering and classification of webpages, word sense disambiguation, community detection, image tagging and tracking extinctions in the study of food webs, which are the complex networks of who eats who in an ecosystem. Other applications are ranking of messages and suggesting friends in social networks and applications in recommendation systems (e.g. finding some new movies that match one's taste).

By considering the vast emerging applications of PageRank, this thesis is dedicated to the study of methods of PageRank computation with emphasis on some networks, mainly consisting of the so-called simple line (an example of a simple information network where information flows in one direction) and complete graphs (information networks where information flow uniformly in all directions in a closed system). The thesis also analyses the information value (PageRank value) of each node of the network when there is a possible number of branches of the information flow to or from the main network.

thesis, which gives ranks to webpages in the world wide web in order of their importance. It does so by assuming that webpages linked by many other important pages are themselves likely to be important. Thus, PageRank is thought to correlate well with human concepts of importance.

The world wide web is a huge data and information repository that contains different formats and types of data. Thus, it may be considered to be an information network that can be described by Markov chain models (stochastic models describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event). In such models, a Markov chain associated to corresponding web-links graph represents the information network and the ranking of the webpages (nodes) is done through the stationary distribution of a Markov chain (also termed here as PageRank vector).

PageRank has gained great fame, as it is the basic method of how a search engine weighs the information value of different websites against each other. For instance, when a person is interested in getting certain information from the internet, he/she is most likely going to use a search engine to look for such information. Moreover, the person will be interested in getting the most relevant information. The PageRank algorithm is used to sort out and list the most relevant pages after the search.

Apart from the primary use of PageRank to analyse and rank search results in a search engine, several other applications of PageRank exist to date. To mention but a few, they include online advertisement ranking, clustering and classification of webpages, word sense disambiguation, community detection, image tagging and tracking extinctions in the study of food webs, which are the complex networks of who eats who in an ecosystem. Other applications are ranking of messages and suggesting friends in social networks and applications in recommendation systems (e.g. finding some new movies that match one's taste).

By considering the vast emerging applications of PageRank, this thesis is dedicated to the study of methods of PageRank computation with emphasis on some networks, mainly consisting of the so-called simple line (an example of a simple information network where information flows in one direction) and complete graphs (information networks where information flow uniformly in all directions in a closed system). The thesis also analyses the information value (PageRank value) of each node of the network when there is a possible number of branches of the information flow to or from the main network.

(11)

This work was financially supported by the Swedish International Develop-ment Cooperation Agency and International Science Programme at Uppsala University - Sweden under the Sida bilateral programme between Sweden and Tanzania in research capacity building in Mathematics with University of

Dar es Salaam in Dar es Salaam, Tanzania and M¨alardalen University in

V¨aster˚as, Sweden.

This work was financially supported by the Swedish International Develop-ment Cooperation Agency and International Science Programme at Uppsala University - Sweden under the Sida bilateral programme between Sweden and Tanzania in research capacity building in Mathematics with University of

Dar es Salaam in Dar es Salaam, Tanzania and M¨alardalen University in

(12)

List of Papers

This thesis is based on the following papers:

Paper A. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, Sergei Silve-strov, PageRank, connecting a line of nodes with multiple complete graphs, Proceedings of the 17th Applied Stochastic Models and Data Analysis In-ternational Conference with the 6th Demographics Workshop. London, UK, 2017 (C. H. Skiadas, ed.), ISAST: International Society for the Advancement of Science and Technology, 2017, 113–126.

Paper B. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, Traditional and lazy PageRanks for a line of nodes connected with complete graphs, Stochastic Processes and Applications (S. Silvestrov, M. Ran˘ci´c, A. Malyarenko, eds.), Springer Pro-ceedings in Mathematics & Statistics, 271, Springer, Cham, 2018, Chapter 17, 391–412.

Paper C. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, Exploring the relationship be-tween ordinary PageRank, lazy PageRank and random walk with backstep PageRank for different graph structures, Data Analysis and Applications 3 (A. Makrides, A. Karagrigoriou, C. H. Skiadas, eds.), Computational, Clas-sification, Financial, Statistical and Stochastic Methods, 5, ISTE, Wiley, 2020, Chapter 3, 53–73.

Paper D. Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, Sergei Silve-strov, Evaluation of stopping criteria for ranks in solving linear systems, Data Analysis and Applications 1 (C. H. Skiadas, J. R. Bozeman, eds.), Clus-tering and Regression, Modeling-estimating, Forecasting and Data Mining, 2, ISTE, Wiley, 2019, Chapter 10, 137–152.

Paper E. Benard Abola, Pitos Seleka Biganda, Dmitrii Silvestrov, Sergei Silvestrov, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Perturbed

List of Papers

This thesis is based on the following papers:

Paper A. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, Sergei Silve-strov, PageRank, connecting a line of nodes with multiple complete graphs, Proceedings of the 17th Applied Stochastic Models and Data Analysis In-ternational Conference with the 6th Demographics Workshop. London, UK, 2017 (C. H. Skiadas, ed.), ISAST: International Society for the Advancement of Science and Technology, 2017, 113–126.

Paper B. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, Traditional and lazy PageRanks for a line of nodes connected with complete graphs, Stochastic Processes and Applications (S. Silvestrov, M. Ran˘ci´c, A. Malyarenko, eds.), Springer Pro-ceedings in Mathematics & Statistics, 271, Springer, Cham, 2018, Chapter 17, 391–412.

Paper C. Pitos Seleka Biganda, Benard Abola, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Sergei Silvestrov, Exploring the relationship be-tween ordinary PageRank, lazy PageRank and random walk with backstep PageRank for different graph structures, Data Analysis and Applications 3 (A. Makrides, A. Karagrigoriou, C. H. Skiadas, eds.), Computational, Clas-sification, Financial, Statistical and Stochastic Methods, 5, ISTE, Wiley, 2020, Chapter 3, 53–73.

Paper D. Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, Sergei Silve-strov, Evaluation of stopping criteria for ranks in solving linear systems, Data Analysis and Applications 1 (C. H. Skiadas, J. R. Bozeman, eds.), Clus-tering and Regression, Modeling-estimating, Forecasting and Data Mining, 2, ISTE, Wiley, 2019, Chapter 10, 137–152.

Paper E. Benard Abola, Pitos Seleka Biganda, Dmitrii Silvestrov, Sergei Silvestrov, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Perturbed

(19)

Markov chains and information networks, ArXiv:1901.11483v3 [math.PR] 2 May 2019, 60 p. (2019).

Paper F. Dmitrii Silvestrov, Sergei Silvestrov, Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Perturbation analysis for stationary distributions of Markov chains with damping compo-nent, Algebraic Structures and Applications (S. Silvestrov, A. Malyarenko, M. Ran˘ci´c, eds.), Springer Proceedings in Mathematics and Statistics, 317, Springer, Cham, 2020, Chapter 38, 903–933.

Paper G. Benard Abola, Pitos Seleka Biganda, Sergei Silvestrov, Dmitrii Silvestrov, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Nonlinearly perturbed Markov chains and information networks, Proceedings of the 18th Applied Stochastic Models and Data Analysis International Conference with the Demographics 2019 Workshop. Florence, Italy, 2019 (C. H. Skiadas, ed.), ISAST: International Society for the Advancement of Science and Technol-ogy, 2019, 51–79.

Paper H. Dmitrii Silvestrov, Sergei Silvestrov, Benard Abola, Pitos Seleka Biganda, Christopher Engstr¨om, John Magero Mango, Godwin Kakuba, Perturbed Markov chains with damping component, Methodology and Computing in Applied Probability, 31 p. (2020). https://doi.org/10.1007/s11009-020-09815-9

Reprints were made with permission from the respective publishers.

Parts of this thesis have been presented in communications given at the fol-lowing international conferences:

1: ASMDA2017 - 17th Applied Stochastic Models and Data Analysis Interna-tional Conference with the 6th Demographics Workshop, 6 - 9 June 2017, London, UK.

2: SPAS2017 - International Conference on Stochastic Processes and Alge-braic Structures - From Theory Towards Applications, 4 - 6 October 2017, V¨aster˚as and Stockholm, Sweden.

3: SMTDA2018 - 5th Stochastic Modeling Techniques and Data Analysis In-ternational Conference with Demographics Workshop, 12 - 15 June 2018, Chania, Crete, Greece.

4: ASMDA2019 - 18th Applied Stochastic Models and Data Analysis Interna-tional Conference with the Demographics 2019 Workshop, 11 - 14 June 2019, Florence, Italy.

18