• No results found

Automatic modeling and analysis of corporate communication through multiple mediums

N/A
N/A
Protected

Academic year: 2021

Share "Automatic modeling and analysis of corporate communication through multiple mediums"

Copied!
61
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2018 | LIU-IDA/LITH-EX-A--18/032--SE

Automa c modeling and

analy-sis of corporate communica on

through mul ple mediums

Automa sk modelering och analys av företagskommunika on

via flera medium

Henning Nåbo

Supervisor : Christer Bäckström Examiner : Peter Jonsson

(2)

Upphovsrä

De a dokument hålls llgängligt på Internet – eller dess fram da ersä are – under 25 år från pub-liceringsdatum under förutsä ning a inga extraordinära omständigheter uppstår. Tillgång ll doku-mentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av doku-mentet kräver upphovsmannens medgivande. För a garantera äktheten, säkerheten och llgäng-ligheten finns lösningar av teknisk och administra v art. Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i så-dant sammanhang som är kränkande för upphovsmannens li erära eller konstnärliga anseende eller egenart. För y erligare informa on om Linköping University Electronic Press se förlagets hemsida h p://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years star ng from the date of publica on barring excep onal circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility. According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement. For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: h p://www.ep.liu.se/.

(3)

Abstract

This thesis describes the process of modeling and analysis of corporate communication through chat and telephone with data taken from the Briteback enterprise communication application. Phone communication is measured by the number and duration of calls, and chat communication by the minimum number of messages sent from one person to another. The measurements are used to calculate a communication score, different methods are tested and a version using principal component analysis is chosen. Different centrality measurements are performed on the graph model that and each tested measure is found to be useful in some way; eigenvector centrality fits the data best, PageRank is easy to understand and can be adapted for dirfferent situations, and betweenness centrality points out users in critical positions in the communication graphs. Personalized PageRank ’focused’ on users or a group of users is tested and shows potential to be of use for social network service companies in many different ways such as when ordering search results or when suggesting new members to a chat channel.

(4)

Acknowledgments

I would like to thank Briteback for letting me perform this thesis, and my dog for always greeting me with loving eyes when I get home.

(5)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1 1.1 Aim: . . . 1 1.2 Research questions: . . . 1 1.3 Delimitations . . . 2 1.4 Structure . . . 2 2 Briteback 3 2.1 The Briteback Application . . . 3

3 Graph Theory 5 3.1 Visualisation . . . 7

3.2 Adjacency matrix . . . 7

3.3 Incidence matrix . . . 7

4 Social Network Analysis 9 4.1 Social Network . . . 9

4.2 The small-world network . . . 9

5 Statistics 11 5.1 Standard Deviation and Variance . . . 11

5.2 Pearson Correlation Coefficient . . . 11

5.3 Skewness . . . 12

5.4 Principal Component Analysis . . . 12

6 Graph measures 13 6.1 Types of Measures . . . 13

6.2 Betweenness . . . 14

6.3 Eccentricity . . . 14

6.4 Radius and Diameter . . . 14

6.5 Degree, 2degree and Weighted Degree . . . 15

6.6 Density . . . 15

6.7 Eigenvector Centrality . . . 16

6.8 PageRank . . . 16

(6)

6.10 Status and Closeness . . . 18 7 Method 19 7.1 Communication measures . . . 19 7.2 Phone Communication . . . 19 7.3 Conference calls . . . 20 7.4 Chat Communication . . . 22 7.5 Communication Score . . . 23 7.6 Analysis types . . . 24 8 Result 26 8.1 Conference calls . . . 26 8.2 Communication Score . . . 28 8.3 Graph Measures . . . 30

8.4 Roles and Teams . . . 31

8.5 Chat channels . . . 32

8.6 Temporal changes . . . 33

9 Discussion 35 9.1 Results . . . 35

9.2 Method . . . 37

9.3 Measures and methods not studied . . . 38

9.4 The Work in a Wider Context . . . 39

10 Conclusion 41 10.1 Future studies . . . 42

A Appendix 43 A.1 Conference Calls . . . 43

A.2 Weights . . . 45

A.3 Graph measurements . . . 46

(7)

List of Figures

2.1 An example view of the mobile version of the Briteback application. . . 4

3.1 An example of a disconnected graphs with two subgraphs. . . 6

3.2 The three graphs that can be created from the incidence matrix given by Table 3.1a. 8 8.1 The distribution of normalized weights in logarithmic scale . . . 27

8.2 The distribution of scores in logarithmic scale. . . 28

8.3 Largest component with edges colored by which communication type has the high-est normalized value. . . 29

8.4 Correlation plot for eccentricity and status . . . 31

8.5 PPR focused on the white-filled vertex. Colored by organizations (edges between organizations are grey). . . 31

8.6 Bipartite graph with teams and users . . . 32

8.7 Bipartite graph over channels and users for one week weighted by the number of messages written in the channel. . . 32

8.8 Total number of calls and messages and total duration of calls for each week. . . . 33

8.9 A comparison between number of vertices (∣V ∣), its 2-logarithm and mean distance, positioned on top of each other for comparison. . . 33

9.1 A comparison between the largest component in the communication graph and a random graph with an equal number of vertices and the same average degree. . . . 36

A.1 Histogram over the number of conference call participants . . . 44

A.2 Correlation between different weights . . . 45

A.3 Eigenvector centrality for the largest component . . . 46

A.4 PageRank for the largest component . . . 46

A.5 Betweenness for the largest component. . . 47

A.6 Total weight (weighted degree) for the largest component . . . 48

A.7 Degree for the largest component . . . 48

A.8 2Degree for the largest component . . . 49

A.9 Eccentricity for the largest component . . . 49

A.10 Status for the largest component . . . 50

A.11 PPR focused on a team (while filled vertices) . . . 50

A.12 Relationship between channels over one week. The weight between two channels is the number of people who has written in both channels. . . 51

A.13 All temporal measurements scaled and centered such that they have a mean of 0 and variance of 1. . . 51

(8)

List of Tables

3.1 An incidence matrix and the three adjacency matrices created from it . . . 8 8.1 Different measurements on graphs with only direct calls, only conference calls and

all calls.. Proportion is the fraction of the value for a graph using only conference calls and a graph using all calls (with the specified conference model). . . 26 8.2 Statistics for duration of direct and conference calls using different methods for

modeling conferences. Direct and Conference are statistics from only the specified type of call. Here each call is counted once, no matter the number of participants. 27 8.3 Summary of statistics for the different communication measures. . . 28 8.4 The Pearson correlation coefficients of the different communication weights and

communication scores. . . 28 8.5 Proportion of weight types used with the max function . . . 29 8.6 The Pearson correlation coefficients between centrality measurements . . . 30 A.1 Statistics of wn based on graph type. Direct and Conference mean that only this

type of call has been measured. . . 43 A.2 Statistics of duration of calls weight based on graph type. . . 43

(9)

1

Introduction

The use of Social Network Services (SNS) for corporate communication is growing [25]. Ac-cording to enterprise reporter Ron Miller there is a new wave of such services with Slack as the standard bearer trying to replace email as the main communication tool [17]. Other com-panies are also emerging on the market, one such company is Briteback. Their product, also called Briteback, gathers all corporate communication in a single client: chat, email, phone and calendar.

All communication between humans forms Social Networks, and by sampling these net-works it is possible to understand the analyse the communication in what is called Social Network Analysis (SNA). When this communication happens through the internet it is easy to analyse this network as the communication data is a sample of the network [22]. The anal-ysis is performed by modeling the communication as a graph and studying it using different measurements and algorithms. SNS companies improve their user’s experience through so-cial network analysis and graph theory, such as friend suggestions [21] and user ranking [15]. To compete with other services, Briteback must implement these types of features, but they also have the opportunity to surpass the competitors by provide features directly tailored for corporate communication.

1.1 Aim:

With communication being spread out over chat rooms, discussion threads, direct messages, conference calls and email, it is hard for a person to evaluate the efficiency and stability of communication. Social Network Analysis (SNA), graph theory and statistics is used to find possible ways for corporate SNS to help its users communicate by analysing communication logs. The aim is to suggest features that can later be implemented in Briteback.

1.2 Research questions:

This thesis will try to answer the following questions:

• How can communication using different mediums be measured using a single communi-cation score?

(10)

1.3. Delimitations

• How can communication graphs be automatically analysed to provide benefits to the users of a social medium?

1.3 Delimitations

Features that analyse existing communication will be studied, not features that provide new ways to communicate. Only communication through chat and telephone will be studied. The differences between corporate communication and general communication will be discussed, but not studied. This thesis will try to find possible ways to improve the Briteback product; it will not try to find what kinds of features the users want or need.

1.4 Structure

The rest of the thesis is structured as follows. In Chapter 2 the structure of the Briteback product and the data is explained. This is followed by three theory chapters: Chapter 3 explains the basics of graph theory, Chapter 4 introduces the basics of Social Network Analysis and Chapter 5 describes some key statistics terms that are used in the evaluation. This is followed by Chapter 6 that describes and evaluates the different graph measures used in the thesis. In Chapter 7 the method of the thesis is described, followed by Chapter 8 that presents the result of the thesis. Finally, in Chapter 9 the results and method is discussed and in Chapter 10 the final conclusions are presented.

(11)

2

Briteback

This is a master’s thesis performed at Linköping University for the company Briteback. Brite-back develops a unified SNS for companies. The SNS integrates calls, chat, email and calendar in a single application. This SNS provides Briteback with a lot of communication data of who talks to whom and how often. They now want to study this data in hope to analyse the communication and provide helpful features for its users. To study the communication of a social network, one must first know how the social network is constructed.

2.1 The Briteback Application

Briteback’s product is an application called Briteback, a Social Network Service (SNS) that supports phone calls, chat, email and calendar. The users are members of organizations, teams and chat channels and can be assigned roles. The general idea is for internal communication to use chat and telephone while external communication uses email and telephone. Figure 2.1 shows an example view of the mobile application. Following is a short description of the different parts of the application.

Organization

Each company that uses the Briteback application has an organization of which it’s employees are members. A user can only access users and chat groups within their organization unless they are invited as a guest to another organization.

Role

The users in an organization can be given one or multiple roles, such as developer and super-visor. These generally correspond to the user’s role within the company.

Team

Users can be assigned to multiple teams. These might be short term work groups of users of different roles, or long term groups of users such as a boards.

(12)

2.1. The Briteback Application

Chat Channels

Chat channels can be created under an organization or a team and can be public or private. Users can also start a one-to-one chat with anyone within their organization. A chat channel can be connected to a role, such that all members with the role automatically becomes members of the channel.

Phone calls

The application supports direct calls and conference calls. Additionally, a conference call can easily be created for all users in a team, organization or chat channel.

(13)

3

Graph Theory

A graph (denoted G) is a mathematical model representing relationship between pairs of objects. There are many types of graphs but only graphs that are ’simple’ and ’undirected’ are explained as they are used in this thesis. An example of a graph can be seen in Figure 3.1. All information in this chapter is based on ’Introduction to graph theory’ by West et al. [27]. A graph consists of a set of vertices (or nodes) denoted V that represents the agents of the network, and a set of edges denoted E that each connects two vertices in the graph (E⊆ V2). The sets of vertices and edges belonging to a specific graph G is denoted V(G) and E(G). The edges of a graph can have weights (denoted w), which corresponds to strength or bandwidth of a relationship or the distance between two vertices. Following is a list of important terms used in this thesis.

Adjacent: Two vertices connected by an edge are adjacent. In Figure 3.1 v1 and v2 are adjacent, but v1and v3are not.

Incident: The edges that are connected to a vertex are incident to this vertex.

Path: A sequence of adjacent vertices is called a path. In Figure 3.1, a path between v1 and v3is shown in red. The shortest path between two vertices is the path between the vertices with the lowest distance. The distance used can either be the geodesic distance (explained later) or the sum of the edge weights along the path.

Connected graph: A graph is connected if there is a path between each pair of vertices in the graph. A graph that is not connected is disconnected. In Figure 3.1, G1 is a disconnected graph, while G2and G3 are connected.

Complete graph: A complete graph (or fully connected graph) is a graph where all vertices are adjacent to each other. A complete graph has

∣V ∣(∣V ∣ − 1)

2 (3.1)

edges. In Figure 3.1, G3is a complete graph.

Geodesic distance: The minimum number of edges along any path between two vertices is the geodesic distance between the vertices. The distance between two unconnected vertices is infinite. In Figure 3.1, the geodesic distance between v1 and v3 is 2.

(14)

Figure 3.1: An example of a disconnected graphs with two subgraphs.

Mean distance: The mean distance of a graph is the mean of the geodesic distances be-tween all pairs of vertices.

Subgraph and supergraph: For two graphs Gand G, if the sets of edges and vertices of Gare subsets of the sets of edges and vertices of G then Gis a subgraph of G and G is a supergraph of G′.

G⊆ G ⇔ V (G) ⊆ V (G), E(G) ⊆ E(G) ∩ V (G′)2 (3.2) Note that the egdes of a graph must connect vertices in the graph, so E(G) ⊆ V (G)2. In Figure 3.1, G2 and G3are subgraphs of G1, and G1 is a supergraph of G2 and G3.

Clique: A subgraph that is a complete graph is called a clique with a size equal to the number of vertices it contains. In Figure 3.1 G3 is a clique in G1 of size 4. Note: it is not the only clique in G1 as every pair of connected vertices form a clique of size 2.

Component: A component is a maximal connected subgraph. A non-connected graph can be partitioned into its components, which will be a set of connected subgraphs (the subgraphs are connected, they are not connected to each other) with no shared vertices. In Figure 3.1 G1 has two components (the subgraphs G2 and G3). The graphs G2 and G3 have a single component each as they are connected.

Bipartite graph: A graph whose edges can be partitioned into two sets V1 and V2 where each edge only connects a vertex from V1 with a vertex from V2 is called a bipartite

(15)

3.1. Visualisation

3.1 Visualisation

In generall, when a graph is visualised the positions of vertices and lengths of edges have no meaning. The only information that is given by the visualisation is the connections between vertices. The size, shape and color of vertices and edges can represent attributes in the data.

3.2 Adjacency matrix

A useful way of representing a graph is with an adjacency matrix. This is a matrix where each row and column corresponds to a vertex in the graph. For an adjacency matrix A, where w(i, j) is the weight between the vertices i and j and 0 if the vertices are not adjacent:

Ai,j= w(i, j), {i, j} ∈ V2 (3.4)

For undirected graphs (the type explained in this chapter) it holds that w(i, j) = w(j, i) which means that A will be a symmetric matrix. The adjacency matrix belonging to a specific graph G is denoted A(G).

3.3 Incidence matrix

An incidence matrix (denoted B) is a rectangular matrix that shows the relationship between two sets of objects. In this case, the rows will be called cases and the columns affiliations. From an incidence matrix the adjacency matrices for case-to-case relations (Equation 3.5) or affiliation-to-affiliation relations (Equation 3.6) can be created (assuming that cases are related if they share an affiliation and vice versa). An Example of an incidence matrix and corresponding unweighted adjacency matrices can be seen in Figure 3.2.

Ac= BBT (3.5)

Aa= BTB (3.6)

These adjacency matrices will have non-zero elements on their diagonal for all rows/columns with non-zero values in B (a case with at least one affiliation shares an affiliation with itself and vice versa). To create adjacency matrices for simple graphs the diagonal elements must be set to zero. An example of this can be seen in Figures 3.1a, 3.1c and 3.1d.

If the incidence matrix contains unweighted relations (Bi,j ∈ {0, 1}) the adjacency

ma-trices will be weighted by the number of shared affinities (or cases). If the incidence matrix contains weighted relations the affinity matrix will be weighted by the sum of product of the cases’ weights in affinities they share (Ai,j = Aj,i= ∑kBi,kBj,k). To get an unweighted

adjacency matrix, all non-zero elements must be set to 1.

It is possible to create a bipartite graph from an incidence matrix with cases and affiliations as vertex sets. This graph has the adjacency matrix:

Ab= (

0 B

BT 0) (3.7)

If B is an n× m matrix, Ab will be an n+ m × n + m matrix. This can be seen in Figures 3.1a

(16)

3.3. Incidence matrix

B = a b c d

1 1 0 1 1

2 1 1 0 0

3 0 1 1 0

(a) An incidence matrix

Ab= 1 2 3 a b c d 1 0 0 0 1 0 1 1 2 0 0 0 1 1 0 0 3 0 0 0 0 1 1 0 a 1 1 0 0 0 0 0 b 0 1 1 0 0 0 0 c 1 0 1 0 0 0 0 d 1 0 0 0 0 0 0

(b) Adjacency matrix for bipartite graph

Ac= 1 2 3

1 0 1 1

2 1 0 1

3 1 1 0

(c) Adjacency matrix for case-by-case graph Aa= a b c d a 0 1 1 1 b 1 0 1 0 c 1 1 0 1 d 1 0 1 0

(d) Adjacency matrix for affinity-by-affinity graph

Table 3.1: An incidence matrix and the three adjacency matrices created from it

1

2

3

a

b

c

d

(a) Bipartite graph

1

2

3

(b) Case-by-case graph

a

b

c

d

(c) Affinity-by-affinity graph

Figure 3.2: The three graphs that can be created from the incidence matrix given by Table 3.1a.

(17)

4

Social Network Analysis

Social Network Analysis (SNA) is the analysis of social networks using graphs. It is an in-terdisciplinary field using social psychology, sociology, statistics and graph theory. It can for example be used to understand human interaction, measure the influence corporations have on each other or predict the spread of disease [26].

4.1 Social Network

Humans interact with other humans around them in different ways, and these in turn interact with others. This network of interactions is called a social network. Social networks are not limited to humans, it exists in other species like apes and bacteria, and for more abstract things like corporations and organizations [22]. Connections in a social network can symbolize many different things. For a network between humans, agents can for example be connected if they are friends, relatives, lovers or coworkers.

Social networks are not something that a scientist creates; they are abstract networks that exists in the real world and that can be analysed with SNA.

To analyse a social network it must first be sampled. This is generally done by studying the communication through specific mediums in a set of agents. A digital medium for human social interaction and relations is called a Social Network Service (SNS) or social media. Analysing communication through an SNS makes this process easy as the data already exists [22]. Modeling direct communication can be done with an adjacency matrix or an incidence matrix.

4.2 The small-world network

In 1967, Milgram [26] studied the phenomenon that at a cocktail party, two strangers often discover that they have mutual acquaintances. This was called The small-world problem and Milgram performed an experiment where packages were sent from Nebraska to Boston through social contacts. He found that on average, the social separation between Nebraska and Boston was six steps [26].

An observation that anyone can make regarding social networks is that two friends of yours are likely to be friends with each other. More formally, two people that share an acquaintance

(18)

4.2. The small-world network

have a higher probability of being friends with each other than two people chosen at random. This is an effect called clustering and is something that occurs in many types of networks. A network that has a short distance between vertices and a high degree of clustering, is called a small world network [26]. Small-worldness can be quintified by comparing the clustering coefficient (a measurement of how clustered a graph is) and average path length of a graph to that of a random graph with the same average degree (see Section 6.5 [14]. A small world network should have a similar average path length and a much higher clustering coefficient [14].

(19)

5

Statistics

This chapter will briefly explain some key statistical terms used in this thesis. X and Y denotes random variables, µ denotes mean value and E is the expectation operator (µX= E(X)).

5.1 Standard Deviation and Variance

Standard deviaton (denoted σ or std) is the most commonly used measure of the spread of a set of observations [8]. It is a measure of how much the data deviates from the mean, so data with a low standard deviation will have most data points close to its mean.

The standard deviation is given by the square root of the variance. σ=

E[(X − µ)2] (5.1)

5.2 Pearson Correlation Coefficient

Pearson correlation coefficient (denoted ρ) is a measure of correlation, the linear relationship, between a pair of variables. There are multiple types of correlation coefficients with Pearson being the most common [23]. A correlation coefficient can be between -1 and 1. The magnitude is the strength of the correlation, and the sign is the direction, i.e for two negatively correlated variables, when one is below its mean, the other is above its mean, and vice versa for positively correlelated variables [28].

Pearson correlation coefficient is given by the covariance between the variables divided by the product of their standard deviations.

ρX,Y =

E[(X − µX)(Y − µY)]

σXσY

(5.2) A correlation matrix is a matrix where rows and columns correspond to random variables and where the elements are the correlation between pairs of variables. This will always be a square, symmetric matrix with all diagonal elements equal to 1 (as the correlation between a variable and itself is 1) [8].

(20)

5.3. Skewness ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −4 −2 0 2 4 −4 −2 0 2 4

x

y

v1 v2

(a) A data matrix in two di-mensions and the two right singular vectors, ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●● ●●●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● −5.0 −2.5 0.0 2.5 5.0 −5.0−2.5 0.0 2.5 5.0

v1

v2

v1 v2 (b) U Σ in the base of V , ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● −2 0 2 −2 0 2

x

y

(c) The data matrix reduced to one rank through PCA.

5.3 Skewness

Skewness is a measure of the lack of symmetry in a distribution. A distribution with positive skewness has a long tail to the right of the mean, and with negative skewness a long tail to the left of the mean[8].

Skewness is given by:

s= E[(X − µ) 3] (E[(X − µ)2])3/2

(5.3)

5.4 Principal Component Analysis

Principal Component Analysis (PCA) is a statistical procedure used in many fields such as statistics, machine learning and SNA. In SNA it is used for multidimensional scaling and dimension reduction [22] and is associated with the Singular Value Decomposition (SVD).

The singular value decomposition of a m× n matrix M is M = UΣV∗ (∗ denotes conjugate transpose, V= VT) where U (m×m) and V (n×n) are unitary (meaning UU = UU= I) and

Σ (m× n) is diagonal. The columns of U are the left singular vectors of M and the columns of V are the right singular vectors of M . Σ contains the singular values of M where Σi,i is the

singlar value corresponding to the singlar vectors ui and vi. The singular values are denoted

σi and are generally in descending order [12].

In PCA, the singular values and singular vectors are analysed with the goals of extract-ing the most important aspects in the data, compressextract-ing the data, simplifyextract-ing the data and undestranding the structure of the observations [1].

For a data matrix M where the rows corresponds to coordinates in two dimensions (Figure 5.1a), the right singular vectors will be a new base corresponding to a rotation (Figure 5.1b). U Σ will be coordinates of the data points in this new basis and σi will be the variance in the

data in the direction vi (in Figure 5.1b the x coordinates are given by the first right singular

vector and the y coordinates are given by the second right singular vector). If one would want to reduce the rank of M from two to one, one could simply remove the second singular vectors and singular value, as these correspond to the smallest variance (Figure 5.1c). This matrix would be given by M1= v1σ1u∗1.

(21)

6

Graph measures

This chapter will explain the different measures and algorithms used in this thesis. It will cover how they work, some known ways of using them, some pros and cons and how well they are expected to fit the for the purpose.

Note that the following are measures applied to graphs, not to be mixed with the measures for communication explained in Section 7.1 that is represented as the weight of edges.

6.1 Types of Measures

A way to analyse large graphs is to perform different types of measurements. There are many different types of measures.

Centrality is a measure of how central a vertex or edge is in a graph. What it means to be a central vertex can differ between centrality measures. A vertex with a higher centrality is more central. There are two types of centrality; local and global.

Local centrality: This type measures the centrality of a vertex based on its immediate prox-imity [22]. Some examples of local centralities are the different types of degree. They can be used to find ’peaks’ in graphs; vertices with a higher local centrality than their neighbours. These peaks correspond to locally influential vertex in the graph.

Global centrality: This type measures the centrality of a vertex based on the entire graph [22]. Examples of global centralities are betweenness, PageRank and eigenvector cen-trality. A global centrality can be used to find an absolute center of the graph (if the graph has a center). It can also be used to rank the vertices of a graph. For instance, Google orders search results using a global centrality measure called PageRank [10]. Eccentricity and status measures the distance of a vertex to other vertices. Radius and diameter on the other hand measure the graph as a whole and can be used to compare different graphs with each other or see how a graph changes over time.

(22)

6.2. Betweenness

6.2 Betweenness

Betweenness is a measure of global centrality that also measures the importance of vertices. It assumens that a central vertex is a vertex that many paths connecting other vertices pass through.

Betweenness of a vertex v is the number of vertex pairs whose shortest connecting path goes through v [22]. When calculating the shortest paths, edge weights are considered being distances between vertices. To calculate betweenness where weights represents strength it is possible to use the inverted weights (1/w) as distances during the calculation. It is also possible to use geodesic distances.

Edge betweenness is the same as vertex betweenness but for edges; the number of pairs of vertices whos shortest connecting path passes through the edge.

Betweenness is a measure of how vertices are keeping the graph connected. Betweenness measures the extent of which a vertex can act as a broker or gatekeeper with potential control over others [22].

Betweenness can also be used to calculate a betweenness clustering where vertices are partitioned into clusters. This is done by removing the edges with highest betweenness to split the graph into components [18]. A person with a high betweenness in a corporate social network would be an important person as he/she keeps the network connected.

A problem with betweenness is that it is costly to compute (time complexity Θ(∣V ∣3), space complexity Θ(∣V ∣2)) [7]. There is a cheaper ways of calculating it but it only works on very sparse graphs [6]. Using the geodesic distance will lead to vertices in a dense area to recieve low betweenness as the number of possible paths are high causing it to be a poor measure of centrality.

6.3 Eccentricity

For a vertex, its eccentricity is the geodesic distance to the vertex furthest away from it [27]. If the geodesic distance from a vertex v to all other vertices is calculated, the eccentricity of v is the largest of these distances.

The general definition says that the eccentricity for all vertices in a disconnected graph is infinite (as the distance between two disconnected vertices is infinite) [27]. A variant of eccentricity that works on disconnected graphs is to only check the distance between connected vertices.

The vertex with the lowest eccentricity is part of the center of a graph. Additionally, if there is a single vertex with lower eccentricity than any other vertex in the component it is the absolute center [22].

As social networks can be disconnected so the variant that only measures eccentricity within components needs to be used. When using this type of eccentricity each component should be analysed separately as vertices in small components will have a smaller eccentricity than vertices in large components. There are ways of working around this, such as adding the diameter of all other components to a vertex’ eccentricity, but removes the simplicity of eccentricity which is its biggest strength. An issue with eccentricity is that it ignores edge weights.

6.4 Radius and Diameter

Radius and diameter use eccentricity to describe the size of a graph. The diameter of a graph is the highest eccentricity of any vertex in the graph. Similarly, the radius of a graph is the lowest eccentricity of any vertex in the graph. The diameter and radius of a disconnected

(23)

6.5. Degree, 2degree and Weighted Degree

The measures are similar, but not equal, to the diameter and radius of a circle. From the definition we can conclude that (with r for radius and d for diameter) r≤ d ≤ 2r (unlike r = 2d for radius and diameter for a circle) [27].

Radius and diameter can be used to describe graphs. A graph with a large difference between radius and diameter is long and slim (the most extreme being d= 2r, a graph that is a single line of connected vertices). A graph where the diameter and radius are equal is more circular (for example a complete graph where d= r).

The radius and diameter of a disconnected graph (which social networks might be) is infinate as the eccentricity of all vertices is infinate. They can be calculated for disconnected graphs with the variant of eccentricity that does not give infinite values but it will be the largest radius and diameter of any component in the graph, not the radius and diameter of the graph as a whole.

6.5 Degree, 2degree and Weighted Degree

Degree is a simple measure of local centrality that is the number of vertices adjacent to a vertex (for simple graphs), or more generally, the number of edges that connects to the vertex [27]. In terms of communication data, degree is the number of persons that a person has communicated with.

An extension of he degree-measure is to count the number of vertices reachable within a set number of steps (geodisic distance lower than or equal to the set number). Distance one and two are the most informative [22] and will be used in this thesis.

Degree can be extended to be the sum of the weights of all edges connected to a vertex [2]. weightedDegree(i) = ∑

j

Ai,j (6.1)

Degree with distance one and two will be called degree and 2degree respectively and the degree extended with weight will be called weighted degree.

All types of degrees can be used as a measure of local centrality. If a vertex has a higher degree than its neighbors it can be seen as a locally central vertex. Note that a local centrality measure will not be able to identify a central point of the network [22].

One use of degree is to find peaks and bridges. A vertex with a higher degree than all of it’s neighbours is considered a peak, and vertices adjacent to at least two peaks are considered bridges [22].

Degree is very simple to compute as one simply has to count the number of edges connecting each vertex. This can be done by calculating column sums in the unweighted adjacency matrix of the graph.

For a highly connected graph or any graph with a low diameter the degree and 2-degree of all vertices will be the same or similar, meaning measuring both will provide little extra information.

6.6 Density

Density is a measure of how connected a graph is. The density of a graph is the ratio between the number of edges in the graph and the number of possible edges in the graph [22]. It is a number between 0 and 1.

density(G) = 2∣E∣

∣V ∣(∣V ∣ − 1) (6.2)

A problem with density is that it is relative to the number of vertices. When the number of vertices in a graph increases, the mean degree must increase as well if the density should be

(24)

6.7. Eigenvector Centrality

constant. This means that it is ’easier’ for small graphs to have a high density than for large graphs.

The density can be expected to be different depending on what you measure. For example a graph of friendship relationships is expected to have a higher density than a graph of lover relationships [22].

Density is well defined for unweighted graphs, but there is little agreement on how it should be calculated for weighted graphs [22].

6.7 Eigenvector Centrality

Eigenvector centrality (or eigen centrality) was suggested as a measure of centrality by Bonacich [4]. It is a global centrality measure based on the eigenvalue decomposition where a vertex receives a high score if it is adjacent to other vertices with high scores.

For an adjacency matrix A, the vector v of eigenvector centralities is the non-negative eigenvector that corresponds to the eigenvalue of A with the largest magnitude [3]. An eigen-vector eigenvalue pair(v, λ) of a matrix A is a vector and a scalar that satisfies the equation vA = λA. The Perron-Frobenius theorem states that the eigenvector corresponding to the largest eigenvalue will be non-negative for all non-negative matrices (a matrix that only con-tains non-negative values).

Eigenvector centrality can be used to analyse many different types graphs. Lohmann et al. used it to map connectivity patterns in the human brain [16] and Fowler and Christakis used it in their study of the spread of happiness in social networks to find central vertices [9].

Eigenvector centrality works for any positively weighted graph (for negative weights, the Perron-Frobenius theorem does not hold). It is also well suited for data where weights are considered the strength of relations.

The main problem with eigenvector centrality is that it can not be explained with more con-crete terms than that vertices connected to vertices with high centrality gets a high centrality. Measurements can therefore be hard to interpret.

6.8 PageRank

PageRank is a global centrality measure that, similarly to eigenvector centrality, gives a high score to vertices with high score.

It can be explained with something called the ”Random surfer”. The name comes from it initially being used to describe a person surfing the internet. Here, the random surfer analogy will instead be explained as a package. Given a workplace, a package is given to a random person. This person will pass the package to one of his/her contacts with a probability relative his communication (a person he/she communicates with 50% of the time has a 50% chance of receiving the package). Every time the package moves there is a small chance that instead is sent to a random person. The PageRank is the probability that a person has the package at a specific time, or the number of times it has visited a person divided by the total amount of steps it has taken [10].

It can be seen as the eigenvalue centrality of a modified adjacency matrix where all weights have been scaled into probabilities and weak edges are added between all vertices making the graph fully connected.

In terms of the random surfer, PageRank for a vertex is the sum of its neighbours’ PageRank times the chance of the surfer moving to this vertex from the neighbour, plus the chance of it moving there randomly [20].

(25)

6.9. Personalized PageRank

that the surfer follows an edge instead of randomly jumping. The PageRank is a probability vector R [20] that satisfies

R(i) = c ∑

j

Ai,jR(j)

kAk,j + (1 − c)p

i (6.3)

It can be seen as the steady state of the Markov matrix that describes the surfer’s movements. To construct this Markov matrix, a version of A is needed such that each column sums to one:

Mi,j=

Ai,j

kAk,j

(6.4) From the p vector a matrix E is created by multiplying p with a row vector only containing ones. E corresponds to an adjacency matrix of a fully connected graph with loops (each vertex is connected to itself) with weights of all edges leading to vertex j given by pj

E= p (1 ⋯ 1) (6.5)

If p is uniformly distributed the Markov matrix will correspond to an undirected graph. The Markov matrix M′for the random surfer is given by:

M= cM + (1 − c)E (6.6)

The stable state of this matrix is given by the eigenvector

x= xM= x(cM + (1 − c)E) (6.7)

which is the positive eigenvector corresponding to the matrix’ largest eigenvalue 1 (will always exist for Markov matrices) [20].

The main use of PageRank is, as the name suggests, to rank vertices. Google uses it to order the search results, based on a network where webpages are vertices and links are edges. As PageRank is commonly used there is plenty of research on its effects and how to calculate it efficiently. It waas initially defined for directed, unweighted graphs and most research on PageRank focuses on these types of graphs. On weighted graphs it ignores the weighted degree of vertices. When using it to analyse human communication, an assumption is made the total communication from each person is the same. This is, as should be obvious, not a very correct assumption and is therefore not a perfect fit for SNA.

As PageRank is used for search engines there is plenty of legitimate methods to affect one’s rank, a process called search engine optimisation, and illegitimate methods to manipulate the rank of oneself and others called search engine poisoning.

6.9 Personalized PageRank

Personalized PageRank (PPR) is a variant of PageRank where the p vector is no longer uni-formly distributed [20]. It creates a centrality from a specific perspective.

A simple version of PPR is where the PageRank is ’focused’ on a set of vertices W ⊂ V

pi=⎧⎪⎪⎨⎪⎪

1/∣W ∣ if vi∈ W

0 otherwise (6.8)

This means that each time the random surfer does a random jump, it only jumps to a vertex in W . By setting c to a large value the PageRank becomes less focused around W and vice versa.

In the original paper where PageRank was first introduced, Page et al. [20] suggested that PPR could be used to tailor the search results of each user by giving each user a p vector based on their search history and bookmarked pages.

(26)

6.10. Status and Closeness

PPR ’focused’ around a set of people can be used to see the influence that the group has in the network. This can for example be used to see who is well connected with different parts of an enterprise.

According to the small-world hypothesis (see Section 4.2) a user is more likely to commu-nicate with people that he/she has a mutual acquaintance with. This means that when a user searches for other users, ordering the result by their PPR focused on the user should give a higher probability of the sought user being among the top results than if a random order was used.

The variable c can be modified to change the amount of focus on W from only looking at the focused vertices (c= 0) to not focusing on them at all (c = 1). Choosing c = 0 will result in a PPR equal to the p vector.

A problem with PageRank in general is that the variable c can have a large impact on the result but it has to be chosen arbitrarily. Google uses c= 0.85 [10] but any value between 0 and 1 can be used.

6.10 Status and Closeness

Status (or total distance) and closeness are measures based on how far it is from a vertex to all other vertices in the graph.

The status of a vertex is the sum of the shortest distance from it to all other vertices in the graph. This is given by:

status(u) = ∑

v∈V

d(u, v) (6.9)

where d(u, v) is the geodesic distance between u and v. Closeness is the inverse of status.

closeness(v) = 1

status(v) (6.10)

Closeness can be used as a global centrality measure while status can be used as a more detailed eccentricity. While many vertices will have the same eccentricity (as it is usually small integers), status will be more detailed (two vertices with the same status in a large graph will be very rare).

The mean status can be used as a measure similar to diameter and radius, but it is very sensitive to changes in the number of vertices.

Status and closeness are completely dependant on the size of the graph. It is therefor not useful to compare these measures between differently sized graphs.

With the standard definition of distance, the status of all vertices in a disconnected graph will be infinite. It is possible to work around this by only measuring the distance to vertices in the same component. If using this version, each component should be analysed separately.

(27)

7

Method

This chapter will describe the method used to create and analyse the communication data. First, three graphs were created, one for each type of communication measurement. From these three graphs a single graph was created weighted by a ’communication score’ calculated from the three communication measurements. Lastly, centrality measurements were performed and temporal changes were measured on the final graph.

The model and analysis was created in the programming language R, a language mainly built for statistics. Graph models were build from log data used igraph, an open source graph library available in R, python and C/C++. This package also includes calculation of most of the chosen measures. Plots were created using ggplot2, a plotting library for R.

7.1 Communication measures

To create communication graphs, the communication between users using phone and chat communication was measured.

Communication was measured by number of calls, total duration of calls in minutes and number of chat messages. These measures were used as weights in different graph models that were merged into a single graph weighted by a communication score. These weights will be denoted wd for duration of calls, wn for number of calls, wc for chat messages.

The decision to use both number of calls and duration of calls was based on an study by Onnela et al.[19]. The study used these two measures to analyse the phone communication between 7 million users.

7.2 Phone Communication

The log of a phone call contains the following: • The ID of the caller

• A time stamp for when the call started • The total duration of the call

(28)

7.3. Conference calls

• A list of events happening during the call, such as users joining or leaving the call or being invited to join.

For each call, every user that joined the call was added to a list of participants. An incidence matrix was created with participants as cases and calls as affiliations, i.e. one column for each call. Only calls that had at least two active users in the user database (calls between registered users), and that lasted for more than five seconds (an arbitrary cut off to remove miss dials or calls that did not provide any communication) were used.

From this incidence matrix an adjacency matrix was created as explained in Section 3.3. For communication weighted by duration, the incidence matrix had the square root of its duration in all relations with participants. This ensured that relationships were weighted by the total duration of calls when calculating the adjacency matrix, and not the duration of calls squared (see Equation 7.1 and 7.2). Note that this is not a general method, it only works for incidence matrices where elements in the same column have the same value.

f(i, k) =⎧⎪⎪⎨⎪⎪ ⎩ 1 if Bi,k≠ 0 0 if Bi,k= 0 (7.1) Ai,j= ∑ k Bi,kBj,k= ∑ kw(k)f(i, k)w(k)f(j, k) = ∑ k w(k)f(i, k)f(j, k) (7.2)

7.3 Conference calls

Some of the phone calls were conference calls; i.e. calls between more than two participants. A user can join a conference call for only a small portion of its duration, and two users might have participated at separate times in the same call. As the calculation must be able to handle a large amount of data the simplification was made that all participants in a conference call are considered participants for the full duration, even if they only joined the call for a short duration.

The examples of many-to-many communication being studied given by Scott [22] all study incidence matrices instead of creating a one-to-one communication model. This was not really an option as there simply was not enough conference calls to analyse them separately. The options were then to ignore conference calls or add them to the direct communication model and i chose to add them to use as much of the data as possible. All participants in a conference call must then be considered to be communicating with all other participants, creating a clique in the graph. But how should the edges created by the conference calls be weighted? All edges must be given the same weight as is it not known who is talking.

Three models to calculate the edge weights were created called ’simple’, ’fractional’ and ’shared’. In this Section the number of participants of a conference call is denoted n and the duration of the call is denoted d. These values will later result in a weight w, either wn

(number of calls) or wd (duration of calls), between the n participants. Any equation given

for wd can be used to calculate wn by setting d= 1.

For each way of modeling conferences, how the weights scale in regards to the number of participants was compared in regards to:

Total communication: The sum of all edge weights created by the conference call. In all cases the number of edges created from a conference call with n participants is equal to the number of edges in a complete graph with n vertices given by Equation 3.1. The total communication (t)is given by

(29)

7.3. Conference calls

Personal communication: The sum of all edge weights connected to a single user created by the conference call. The personal communication (p) is given by:

p(w) = (n − 1)w (7.4)

Simple model

The simplest way of converting many-to-many communication to direct communication is to give the full duration worth of communication between each pair of participants in the conference call.

As w in Equation 7.3 and 7.4 is equal to the duration of the call d the total communication grows quadratically.

t(d) =n(n − 1)

2 d (7.5)

The personal communication is(n − 1)d and grows linearly with the number of participants.

Fractional model

A simple way of reducing the impact of conference calls is to say that a conference call is equal to a participant talking individually to each other person an equal fraction of the call’s duration. That is given by wd=(n−1)d .

With this method we get a total duration

t(w) = t( d n− 1) = n(n − 1) 2 d n− 1= nd 2 (7.6)

It grows linearly with a factor n

2.

The personal communication is constant.

p(w) = p( d

n− 1) = (n − 1) d

n− 1 = d (7.7)

Shared model

The fractional method gives a personal communication equal to direct calls while the simple method gives a personal communication that grows linearly with the number of participants. A middle ground is where the duration is split into one part for each participant where that person talks. When a participant A talks it counts as direct communication between A and all other participants, and when another participant B talks it counts as direct communication between A and B.

In other words, in a conference call with three participants, each participant talks one third of the duration directly with each other participant and one third of the duration with both. This is given by wd= 2dn.

The total duration is

t(w) = t(2d n) = n(n − 1) 2 2d n = (n − 1)d (7.8)

This grows linearly and slightly faster than the fraction method.

The personal communication starts at d for n= 2 and converges to 2d as n → ∞.

lim n→∞p(w) = limn→∞(n − 1) 2d n = limn→∞( 2dn n2d n) = 2d − limn→∞ 2d n = 2d (7.9)

(30)

7.4. Chat Communication

Evaluation

The goal of this analysis was to keep the conference calls in the communication model as they contained a lot of information, while preventing them from ’flooding’ out the rest of the direct communication data. The models were therefore evaluated by how much they affect the statistics of the phone communication weights in relation to how many calls are conference calls.

7.4 Chat Communication

Chat communication (denoted wc) is the measure of communication through chat between

two users. Measuring one-to-one communication from chat logs required a few assumptions. Initially it was planned to model all chat communication (including communication in chat channels) as one-to-one communication but this proved impractical.

Issues with chat communication

The following is a list of issues encountered when modeling chat communication and how these issues were solved.

Should the size of messages be considered? The first thing to decide was what to mea-sure; amount of chat messages sent, total amount of words sent, or both.

In the given data the size of chat messages was not logged. As the modeling of chat communication started late there was no time to gather new data. Because of this the size of chat messages was not used.

Chat communication is directed. Chat communication is, unlike phone communication, directed. To be able to analyse both types of communication in one analysis either chat communication had to be modeled as undirected, or phone communication modeled as directed.

Modeling undirected communication as directed is trivial, you can simply let all commu-nication go in both directions. Modeling directed commucommu-nication as undirected is trickier as some data will be lost. To create an undirected edge from two directed edges, you can take the sum, mean, minimum or maximum of their weights. While some measures work well on directed graphs such as PageRank, others become more complex such as degree that must be split into indegree and outdegree. To keep the analysis simple, chat communication was modeled as undirected communication with edge weights being the minimum of the directed edge weights.

It is not known if someone has read a message. To model chat communication as an undirected graph we must ensure that communication goes both ways. For a phone call one participant must have initiated the call while another has responded. Therefore both participants have actively joined the conversation. This is not the case for chat communication. A message can be missed or ignored and it is hard to know if it has been read as this wasn’t recorded in the logs.

An assumption was made that the number of messages read from a person is correlated to the number of messages sent to this person. Based on this assumption, the undirected communication between two people should be proportional to the minimum number of messages sent from one to the other.

(31)

chan-7.5. Communication Score

2. ’General’ chat rooms can occur where everyone within an organization is a member. 3. The resulting graphs become highly connected and are not small-world networks. Because of this, chat channels were analysed separately from direct communication. A bipartite graph was created with users and chat channels as vertex sets. Additionally, a graph was created with chat channels connected to each other weighted by the number of users who had written at least one message in both channels.

Chat channels with two members: There is a difference between direct channels and channels with two members; a direct channel can only ever have two specific members, while channels with two members might have previously had other members. Because it was not possible to guarantee that the two current members in a chat room were the only members when a message was written and that messages in these channels were very rare it was decided to discard them from the communication analysis and instead have them in a separate analysis where chat channels were analysed.

To calculate the chat communication weights an adjacency matrix was created with a row/column for each user containing only zeroes. For each message in a direct channel the value Ai,jwas incremented where i corresponds to the sender and j corresponds to the receiver.

The adjacency matrix was then updated to be symmetric with values corresponding to the minimum number of messages sent in a direction between two users with the equation

A′= A+ A

T

2 − abs( A− AT

2 ) (7.10)

where the absolute value function is applied element wise.

7.5 Communication Score

As the goal was to have an automated analysis of the data a single communication score (denoted s) was used so that only one graph had to be analysed. If three separate analyses were made the results would still have to be unified in the end. A communication score should be a measure of all types of communication between two users, not just communication through a single medium.

To create a communication score the weights were calculated for each communication weight. After normalizing these weights (dividing with its 2-norm), a communication score was calculated using different unification functions.

Evaluation: The different communication scores were evaluated by their correlation to the communication weights and by how they were distributed. A good communication score should be highly correlated— and similarly distributed— to the different communication measure-ments. Based on the evaluation, one of these communication scores was chosen to be used for the rest of the thesis.

Types of communication scores

Three different ways of calculating a communication score from the communication weights were evaluated.

Max: The max communication score (denoted smax) is the maximum of the normalized

(32)

7.6. Analysis types

PCA: The PCA communication score (denoted spca) uses the left singular vector

corre-sponding to the largest singular value of the matrix

M =⎛⎜ ⎝ ∣ ∣ ∣ N D C ∣ ∣ ∣ ⎞ ⎟ ⎠ (7.11)

where C, N , D are vectors containing wc, wn and wdrespectively. This is guaranteed to be a

positive vector as M M∗ is a positive matrix (the Perron-Frobenius theorem). It will also be a unit vector as singular matrices are unitary.

As the weights have not been scaled to have the mean zero, the first right singular vector will correspond to the mean values of the weights.

PCA2 An alternative version of spca, called spca2, was calculated in the same way as spca

but used the matrix

M=⎛⎜ ⎝ ∣ ∣ ∣ ∣ N D C C ∣ ∣ ∣ ∣ ⎞ ⎟ ⎠ (7.12)

This ensures phone and chat communication has an equal number of dimensions.

7.6 Analysis types

The measured communication was analysed in three different ways: centrality measurements, bipartite graphs and temporal changes.

Centrality measurements

After choosing a conference model and communication score a communication graph was created and the centrality measurements explained in Chapter 6 were performed.

Closeness was not further studied as there was already quite a few centrality measures. Closeness is entirely dependant of distances, and as weights do not correspond to distance between vertices in the model only the geodesic distance could be used. This means that the weights would be ignored for this measure. Both types of betweenness centrality was tested using geodesic distances and the inverse weight, to see which performed better as a centrality measure. For PageRank and PPR the value c= 0.85 was used as this is the value suggested in the original paper describing these measures [20]. The effect of changing this variable was not evaluated. Centrality measurements were performed on the entire graph while other measurements were performed on the largest component.

The correlation between measurements was looked at: a very high correlation might be bad as a measurement holds no additional information, and a low correlation might be bad as the measurements do not measure the same thing.

As PPR is— as the name says— personal, it was analysed separately. It was calculated with focus on one specific person or all users in a team or an organization.

Bipartite graphs

Bipartite graphs were created for users and roles, users and teams and users and chat channels. These were unweighted graphs except for the graph with chat channels which was weighted by the number of messages written in the channel.

(33)

7.6. Analysis types

Temporal changes

To see how the graph measurements changed over time a communication graph was created for each week. Different plots were created comparing the measures diameter, radius, mean distance, number of vectors, number of edges, number of calls, total call duration and number of chat messages. Density was not considered as it is not well defined for weighted graphs.

(34)

8

Result

In this chapter the result of the analysis of the communication data from Briteback is presented. Measurements were performed on phone and chat communication between 102 users during a period of 15 weeks. The data was mainly real communication between real people, but there were accounts that did not correspond to actual users that were removed. Not all ’fake’ communication was removed but it should have little effect on the results as it was a very small portion of the data and it behaved similarly to the ’real’ data.

8.1 Conference calls

Measure Direct Conference All Proportion

Number of vertices 93 44 95 46% Number of edges 183 122 236 52% Number of calls 1,301 129 1,430 9% Duration of calls 10,942 2,314 13,256 17% sum wn(simple) 1,301 608 1,909 32% sum wn(shared) 1,301 324 1,625 20% sum wn(fraction) 1,301 227 1,528 15% sum wd(simple) 10,942 11,746 22,688 52% sum wd(shared) 10,942 6,016 16,957 35% sum wd(fraction) 10,942 4,165 15,106 28%

Table 8.1: Different measurements on graphs with only direct calls, only conference calls and all calls.. Proportion is the fraction of the value for a graph using only conference calls and a graph using all calls (with the specified conference model).

To evaluate the impact of conference calls three different graphs were created measured by phone communication through (1) direct calls, (2) conference calls and (3) all calls. Some simple measurements performed on these graphs is presented in Table 8.1. Note that in the row

(35)

8.2. Communication Score

Type mean std skewness max

Direct 8.41 15.7 4.33 163 Conference, simple 17.9 35.2 5.41 283 Conference, shared 10.5 20.8 5.72 189 Conference, fraction 7.54 15.2 5.91 141 All, simple 9.27 18.5 6.11 283 All, shared 8.60 16.3 4.67 189 All, fraction 8.33 15.7 4.46 163

Table 8.2: Statistics for duration of direct and conference calls using different methods for modeling conferences. Direct and Conference are statistics from only the specified type of call. Here each call is counted once, no matter the number of participants.

0.00 0.25 0.50 0.75 10-4 10-3 10-2 10-1

log(weight)

density

Weight

n d c

Figure 8.1: The distribution of normalized weights in logarithmic scale

as it has 9% of calls, with 17% of the total call duration but is responsible for 52% of wdand

32% of wn.

It is also interesting to see how the distribution of weights differs between direct calls and conference calls with the different models. Some key statistics are presented in table 8.2. As can be seen the average conference call is more than twice as long as the average direct call. We can also see that using the fraction and shared model causes little change in the distribution of call durations.

The mean number of calls weight is 7.1 for direct calls and 8.9, 6.9 and 6.5 for the simple, shared and fractional model as can be seen in Table A.1. So the mean value increases for the simple model, and decreases slightly for the others. For duration of calls weights these numbers are 60, 96, 72 and 64, meaning that the mean value increases when adding conference calls for all models.

Even with the fractional model, which is the most punishing, conference calls have a big impact on the model as can be seen in Table 8.1. To minimize the effect of conference calls on the distribution of wn and wd the fractional model was used for the rest of the analysis.

(36)

8.2. Communication Score 0.00 0.25 0.50 0.75 10-4 10-3 10-2 10-1

log(score)

density

Score

max pca pca2

Figure 8.2: The distribution of scores in logarithmic scale. weight mean std skewness max

wn 6.47 16.8 5.95 170

wd 64.0 189 6.39 1.70× 103

wc 40.8 115 5.74 967

Table 8.3: Summary of statistics for the different communication measures. wn wd wc smax spca spca2

wn 1 0.88 0.39 0.81 0.92 0.79 wd 0.88 1 0.46 0.79 0.94 0.83 wc 0.40 0.46 1 0.83 0.68 0.86 smax 0.81 0.79 0.83 1 0.94 0.97 spca 0.92 0.94 0.68 0.94 1 0.96 spca2 0.79 0.83 0.86 0.97 0.96 1

Table 8.4: The Pearson correlation coefficients of the different communication weights and communication scores.

8.2 Communication Score

It is a reasonable assumption that communication measurements should be similarly dis-tributed, no matter the medium used. The distribution of the communication weights in logarithmic scale can be seen in Figure 8.1 and some statistics for the weights are summarized in Table 8.3. It is worth noting that skewness, which is scale invariant, is very similar for all weight types.

The communication score should be highly correlated to all types of communication mea-sures and have a similar distribution. The distribution of the different communication scores in logarithmic scale can be seen in Figure 8.2. The max-score has been normalized (the other scores are already normalized). The gray lines rare the distribution of communication weights shown in Figure 8.1. The correlation between communication weights and communication

(37)

8.2. Communication Score

Max

weight

c d n

Max

communication

score

0.1 0.2 0.3 0.4 0.5

Figure 8.3: Largest component with edges colored by which communication type has the highest normalized value.

n d c

43% 16% 41%

Table 8.5: Proportion of weight types used with the max function

Max: The proportion of weights being the largest weight in an edge is given by Table 8.5. These proportions can be expected to be even as the weights have been normalized. That duration weight has a lower proportion is probably due to it being heavily correlated to the number of calls weight (Table 8.4) while having a higher skewness (Table 8.3). As can be seen in Table 8.4 the max weight is almost equally correlated to all the weight types.

Figure 8.3 shows which communication type is used most relative to how used the medium is overall. The size of edges is relative to the max communication score It shows that chat is most common in the center of the network, while longer calls dominate between the dense areas and outliers are connected by shorter calls. This is probably not something that will be found in general, it could be that the central vertices are users that use the application as their main communication tool while less central users are newer users that only use the application for phone communication.

PCA: As can be seen in Table 8.4 the correlation between the phone weights and the pca weight is very high while the correlation between pca and chat messages is quite low. PCA minimizes the variance in each dimension and as calls have two highly correlated dimensions, the phone weight variance is more heavily punished than the chat weight variance.

A way to work around this is to add the chat weight vector a second time to the M matrix explained in Section 7.5. This gives the weight called pca2 in 8.4, and as can be seen, this is more evenly correlated to all three measures (and highly correlated to the max weight).

References

Related documents

This is to say it may be easy for me to talk about it now when returned to my home environment in Sweden, - I find my self telling friends about the things I’ve seen, trying

(1997) studie mellan människor med fibromyalgi och människor som ansåg sig vara friska, användes en ”bipolär adjektiv skala”. Exemplen var nöjdhet mot missnöjdhet; oberoende

The teachers at School 1 as well as School 2 all share the opinion that the advantages with the teacher choosing the literature is that they can see to that the students get books

In this thesis we investigated the Internet and social media usage for the truck drivers and owners in Bulgaria, Romania, Turkey and Ukraine, with a special focus on

Theorem 2 Let the frequency data be given by 6 with the noise uniformly bounded knk k1 and let G be a stable nth order linear system with transfer ^ B ^ C ^ D^ be the identi

2 The result shows that if we identify systems with the structure in Theorem 8.3 using a fully parametrized state space model together with the criterion 23 and # = 0 we

• Matching of reports with interviews with analysts. • Matching of reports with interviews with Swedish company representatives. • Selection of full research reports, rather

We focussed on the Estonian example, which we contextualized through comparison with our neighbouring countries (Finland and Russia). We mapped the content of 2139 Estonian,