Analysis of Instagram Stories: A quantitative study of Instagram Stories using PCA

(1)

U.U.D.M. Project Report 2020:31

Examensarbete i matematik, 15 hp Handledare: David Sumpter

Examinator: Veronica Crispin Quinonez Augusti 2020

Department of Mathematics Uppsala University

Analysis of Instagram Stories: A quantitative study of Instagram Stories using PCA

Linnea Eklund

(2)

(3)

Abstract

The online presence of human activity is constantly increasing and with it an opportunity

to investigate human psychology through big data frames containing user footprints. This

offers a unique insight into human behaviour and has opened up a new research field mainly

dominated by big companies like Facebook who uses the user data to improve experiences on

their sites. This thesis analyses three methods of data extraction and how well they work on

a data frame containing of the Instagram Stories of 28 users. The Instagram stories posted

by these 28 users were categorised into 17 different categories in order to find patterns and

correlations in the data. The three methods for analysing the data were general statistics, a

correlation matrix and PCA, principal component analysis. The results of the study shows

that none of the investigated methods fully accounted for the diversity in the original data and

that they consequently did not contained sufficient information from the given data frame.

(4)

1 Introduction 3

2 Background 3

2.1 Instagram . . . . 3

2.2 Singular Value Decomposition . . . . 4

2.3 Principal Component analysis . . . . 4

2.4 Pearson correlation coefficient . . . . 7

3 Statement and research question 7 4 Method 8 4.1 Collection of data . . . . 8

4.2 Categorisation . . . . 8

4.3 General statistics . . . . 8

4.4 Correlation matrices . . . . 8

4.5 PCA . . . . 9

5 Results 10 5.1 General statistics . . . . 10

5.2 Correlations within the data . . . . 12

5.3 Results from PCA . . . . 14

5.4 Affects by the pandemic . . . . 16

6 Discussion 17 6.1 General Discussion . . . . 17

6.1.1 General statistics . . . . 17

6.1.2 Correlation matrix . . . . 17

6.1.3 PCA . . . . 18

6.1.4 What useful information could be extracted about users . . . . 19

6.1.5 To what degree could we categorise people on the basis of this information? 19 6.2 Conclusions . . . . 19

7 Appendix 20 7.1 Codes . . . . 22

8 References 27

(5)

1 Introduction

Everything we do on the internet leaves a digital footprint. In the digital era we live in today the proportion of activities that are being mediated by digital services are increasingly grow- ing. As a consequence the amount of information being held in our digital footprints is also increasing when more data concerning our online activities is collected. In our digital foot- prints everything from paying the bill to posting a picture on Facebook is collected (Kosinski et al. 2016). Social media apps forms a big part of our online activity and influence the life of many people. Since the launch in October 2010 more than one billion active users -an active user is defined as one who used the media at least one time a month- have registered an account at Instagram making it the 6th most popular social media site in 2020. The fact that Instagram has a major impact on the lives of many people is no understatement, over 500 million people use Instagram every day (Omnicore 2020). However the way we portray ourselves and how we use the functions that the social media site offers differ from person to person which make this an interesting medium to investigate.

As social media plays a bigger role in our daily life as a platform for communication a op- portunity for a new channel too understanding human behaviour is opening up, by examining the user data. A recent study investigating the digital footprint containing the information about a persons Facebook-likes showed that this footprint could accurately predict the per- sonality of the user in just over 50% of the studied cases. This indicates a correlation between a persons digital footprints their personality (Youyou et al. 2015). Many big companies, in- cluding Facebook use these digital footprint to further understand their users and thereby improve the user experience on the platform. This has become a growing research field with an increase in interest for questions such as how much of the personality of a person can we detect and illustrate though social media and how can this technology help us in understand- ing ourselves and our online behaviour. An advantage with studying personality’s, or other areas related to human psychology, with big data-frames is that is facilitates the process of discovering patterns while also reducing the risk for sampling errors, an occurring error in the studies of social science. (Kosinski et al. 2016).

The goal of this thesis is to further investigate how we can increase our knowledge and understanding of persons through analysing the digital footprint containing of posted Insta- gram stories. This will be done through general statistics, a correlation matrix and principal component analysis, PCA a dimension reduction technique that presents the most relevant information from the data.

2 Background

2.1 Instagram

Instagram is a Social Media app owned by Facebook, it was created in 2010 and has since then grown rapidly in popularity. At its core Instagram is a app for posting pictures although it today has more features including Instagram Stories (Instagram 2020a).

On Instagram there are two options for picture sharing. First you can post a picture or

movie in Instagram feed and the second opinion is to post it on Instagram Stories. Instagram

stories can be compared to Snapchat stories which has the same functions and is typically

used in the same way. They are meant to be a fast way of communication in which you can

easily share your ever day moment. The idea for this is the possibility to capture events and

(6)

moments throughout the day and collect them for the followers to see on your Instagram Story. When posting a picture in the Instagram feed everyone can see the amount of likes and comments that the picture generates which increased the demands of perfection set by the users as they strive for the numbers of likes and comments to grow. In compared to pictures posted in the ordinary Instagram feed the picture posted on Instagram Stories are not expected to have the same level of perfection. The Stories are rather a instrument for posting pictures in the moment. On Instagram Stories the picture will be visible for your followers (or the public if you have a public account) for 24 hours, after than the picture will disappear which does not happen to the photos posted in the feed. Because of this feature the effort taking into edit and perfect the picture decreases and it becomes a fast track for online social interactions (Instagram 2020b). This tendency is reinforced by the fact that the followers only can react to your Instagram Story in private conversations.

2.2 Singular Value Decomposition

SVD, Singular value decomposition is a dimension reduction technique that was established in that late 20th century and early 21th century by mathematicians such as Beltrami, Jordan, Sylvester, Schmidt and Weyl. (Stewart 1993). The idea behind the dimension reduction technique is to collect the main variation in large data matrix and presenting it in a smaller dimension and thereby providing a more efficient tool for analysis. The foundation behind SVD is that one matrix can be described as a product of three separate matrices. Given an real matrix A of size m ˆ n the following equation a given:

A “ U SV ^T , (1)

where U , S and V ^T all represent different matrices. S is the diagonal matrix consisting of non-negative singular values. Apart from the diagonal all other elements are zero. The di- agonal elements are ranked in a declining scale, where the most important element -the one containing the most variation- is at the highest diagonal point. Matrix U contains left sin- gular vectors and and matrix V ^T right singular vectors. Both the U and V ^T matrices are orthogonal. The dimension wanted in the analysis is refereed to as k and the dimensions of the matrices are U (m ˆ k), S (k) and V ^T (n ˆ k). If k ě rankpAq the product of U SV ^T will be the exact reproduction of the original matrix A. While k ă rankpAq the product U SV ^T gives a approximation of the matrix, as the dimension has been reduced (Kosinski et al. 2016).

SVD, and its subcategories, is one of the most used applications of linear algebra in today’s society. One of the advantages with SVD in contrast to other reduction techniques is that the SVD is guaranteed to exist for any matrix A with real numbers (Peters 2019).

2.3 Principal Component analysis

Principal Component analysis, PCA was first invented by Pearson in 1901. Like SVD, PCA

is a dimension reduction method that calculates the subspace were the data approximately

is stored. PCA can be described as subsection or a special case of SVD were the data is

pre-processed in two steps before performing the SVD. First a mean subtraction is applied

and in the next step the variance is set to unity. PCA is one of the most utilized dimension

reduction techniques and is frequently used in a variation of research fields (Peters 2019).

(7)

Figure 1.1 Data measures Data measures with the vector u drawn

PCA is calculated following these steps. Given a collection of data x, where x P R ⁿ and the required dimension k ă n. The PCA algorithm starts with a two step normalization:

The first step is to let the mean value µ “ 1 m

m

ÿ

i“1

x ^piq (2)

and replace all x ^piq with x ^piq ´ µ (3)

Now let the squared standard deviation σ ² _j “ 1 m

ÿ

i

x ^piq2 (4)

and replace all x ^piq _j with x ^piq _j {σ j (5) In the first two steps -equations 2 and 3- the mean of the data has been zero out while in the last two steps -equation 4 and 5- the coordinates gets rescaled in order to give them a unit variance. This guarantees that all attributes will be handled on the same scale. For example number of posted stories per day (values between zero and 20) and numbers of followers (values in low hundreds) then for these values to be more comparable a normalisa- tion is necessary. At this point all values in σ, the covariance matrix will be between ´1 and 1.

The second step in the PCA algorithm is to compute the major axis of variation, u. the

vector u is the unit vector to which the variance of the projected data is maximized when

the data projects in the corresponding direction to u. In other words u is the vector in the

direction of the most variation of data. The purpose of this step is to capture as much of the

variance as possible in the subspace that belong to u. When the distance for all points x _i

to the origin is maximized then the variance capture in the vector u also maximizes. If we

have a two dimensional data set we can visualise the process. In figure 1.1 we see a number

of points each representing a measurement. We want our line, the vector u to maximize the

distance from all points to the origin. Hence the line shown in figure 1.1 would be a good

example of u since the points are far from the origin and the projected points has a relative

large variance. If the u vector would have been in rotated 90 degrees the projected points

would have been closer the the origin and the variance would have been smaller. So we want

our u to automatically set the direction with the most variance.

(8)

To summarize this step: given a data-point x ^piq and the unit vector u we want to choose a u and a x ^piq such as to maximize the variance of the projection of x onto u. The projection of x onto u is at x ^T u distance from the origin. The square of x ^T u is used to eliminate all negative numbers as we want to maximize the distance of all point to the origin. The unite length vector u is elected by the maximization of equation 6.

1 m

m

ÿ

i“1

´ x ^piq

^T

u

¯ 2

“ 1 m

m

ÿ

i“1

ux ^piq x ^piq

^T

u “ u ^T

˜ 1 m

m

ÿ

i“1

x ^piq x ^piq

^T

¸

u (6)

1 m

m

ÿ

i“1

x ^piq x ^piq

^T

“ the covariance matrix (7) The inner product of the last step of equation 6 is the covariance matrix as shown in equation 7. As previously stated the covariance matrix varies from value ´1 to 1 and therefor the maximization of the last equation 6 can be simplified as demonstrated in equation 8.

Because the maximisation of u ^T u has to be one since σu “ λu.

max u

´ u ^t ÿ

u

¯

“ max

u `u ^t u ˘

should equal 1 (8)

since ÿ

u “ λu (9)

This proves the u is the eigenvector λ ₁ to the σ matrix. The norm of u, || u || will there be equal to one since the sum cannot be larger than λ ( equation 9) which is the eigenvector of the matrix.

In conclusion to project the data in an k dimensional space (k ă n), the first k vector u ₁ , ..., u _k will become the eigenvectors of the matrix ř. In PCA these eigenvectors is referred to as principal components. These new k eigenvectors will become a new, and orthogonal basis for the data (Ng 2008). In order to calculate the the remaining us we rotate our data so that it sets in the plane of the first the first principal component, u and the the process it repeated.

The last step in the PCA algorithm is to compute the corresponding vector in order to represent x ^piq in this new basis.

y ^piq “

»

—

— –

u ^T ₁ x ^piq u ^T ₂ x ^piq

.. . u ^T _m x ^piq

fi ffi ffi ffi fl

(10)

This new matrix y now provides us with a k-dimensional approximation for all points x ^piq .

Each u ^T _i represents a principal component that when combined with the original point gives

the new position in the approximated matrix (Ng 2008). The first principal component is the

eigen vector (sometimes referred to as singular vector) that corresponds to the largest singular

value (Peters 2019) The principal components are sorted according to hierarchy; PC1, the first

principal component, withholds the greatest variance and give the most information about the

original data. The second principal component, PC2 will be orthogonal to PC1 and describe

the greatest variance in the subspace of PC1. There exist as many principal components as

dimensions in the original data and the analyst chooses how many PC:s to be used to represent

the data. In this thesis only the two first principal components, u ^T ₁ and u ^T ₂ will be studied

(9)

The advantage with PCA is that it is calculated using only one eigenvector calculation making it a highly usable tool in interpreting statistics (Ng 2008). PCA is usually applied to data set given by measurements provided by one experiment or measurement. These measurements will then be positioned in a row vector of what will become our data matrix x (Peters 2019). One of the most used applications of PCA is facial recognition, though the eigenface method and other image reconstructing techniques. It can also be used to de-noise data sets and as a first step pre-process the data before performing a supervised learning algorithms in order to reduce the number of dimensions. And lastly it is widely used in many research fields to represent the approximated data in cluster plots (Peters 2019) (Ng 2008).

2.4 Pearson correlation coefficient

Pearson correlation coefficient is a statistic model that measures linear correlation. This is a frequently used method in scientific research and is the beforehand chosen method in many programming languages including Python. This model measure the linear correlation between two variables, x and y.

p xy “ n ř x i y _i ´ ř x i ř y i

a n ř x ² _i ´ př x i q ² a

n ř y _i ² ´ př y i q ²

(11) The correlation coefficient is given by equation 11, where n= number of measures and p _xy is the Pearson correlation coefficient between x and y. This will result in a value between

´1 and 1. At 0 there as no linear correlation between the variables, 1 means a total positive linear correlation and ´1 a total negative linear correlation.

p _xy “ covpX, Y q

σxσy (12)

The Pearson correlation coefficient can also be describes in terms of the coviance and the standard deviation for x and y as in equation 12 (Mathworld.wolfram 2020) .

3 Statement and research question

The aim of this thesis is to investigate how three different methods of data extraction work on a given data set. The analysis will be performed on a data frame containing the usage of Instagram Stories in three steps: in the first step general statistics will be calculated in order to find the what methods best describe Instagram users; secondly a correlation matrix will be constructed with the intention of finding correlations between the types of post the the Instagram users make and thirdly we will investigate if PCA can be used to understand differ- ent patterns in posting Instagram stories. All these aspects will be taking into consideration while performing the investigation of this thesis. The analysis will made from the following statement questions:

1. Can these methods extract useful information about users?

2. And to what degree can we categorise people on the basis of this information?

(10)

4 Method

4.1 Collection of data

The data for this thesis was collected everyday for a 31 day period. During this time the Instagram stories of 28 persons were studied in relation to the beforehand given categories (see section 4.2). The following demands were taken into consideration before selecting the 28 accounts:

- All of the accounts were private in the meaning that they do not belong to public figures, institutions, companies or otherwise public organisations.

- The person behind the account was a friend or acquaintance to the author of this thesis and thereby the account were already being followed by the author on Instagram.

- None of the accounts had a clear theme such as photography, art och food. The aim was to find accounts with general content with would represent them as a person.

- The accounts were active and used the Instagram Story at least one time during the investigated period.

Out of the accounts that fulfilled the claims 28 were selected for the study through a process of randomization. The data was collected two times a day, morning and evening, in order to fully categorise all posts.

4.2 Categorisation

Before the data collecting 17 categories were constructed in order to categorize the users. These categories were constructed with the intention of capturing as much variation as possible in the stories, to as accurately as possible represent the full variation in our diverse reality.

If a subject posted several stories concerning the same content they were all categorised individually to properly show the proportions between the categories among the subjects. If a picture belonged to two categories the one considered the main focus would be categorised. If the picture had a caption that was contradictory in relation to what category they relate to the text were given a higher priority in the categorisation process. The categories are: animals, humour, family, photography, food, nature, hobbies, design, interaction, work, friends, politics, travel, selfies, social events, exercise and nostalgia. In figure 1 a full overview of all categories with corresponding explanations is presented.

4.3 General statistics

To get a overview of the data some general statistics were calculated. During the period of data collection the results were all inserted into Excel and later converted as a csv file into in Python using the JetBrain Pycharm compiler. In order the complete the calculations the packages pandas, mathplotlib.pyplot and nymphy were used. This section is meant as a introduction to the data. The following calculation were made in this part; the amount of post per day, per week, per person and per category were calculated as well as a closer view at two persons who stood out in regards to the number of posted stories.

4.4 Correlation matrices

To get a broader view of the data a correlation matrix was calculated to find the strongest

correlations between the categories in the studied data. This was done on three different data

(11)

Figure 1: Table showing all of the categories used in the thesis along with a short explanation sets; the first data set contained the original data as it was; in the second data set the data was normalized after the users posting the stories and in the third data set the data was normalized after the categories used in the study. The results from the third data set were did not give any new information to the analysis and were therefor not included in the thesis.

The correlation matrix for this data set along with its corresponding PCA plots can be find in the appendix figure 12 , 13, and 14

The correlation matrices were created in Python using the JetBrain Pycharm compiler.

The packages Pandas, Matplotlib.pyplot and Seaborn were used in the calculation. After uploading the data in was converted into a correlation matrix and later visualised with a heatmap. The corr() application in python uses the Pearson correlation coefficient, for the statistic calculations. For a full code for the process view Code for correlation matrices in section 7.1.

4.5 PCA

In order to accomplish the set goal of this thesis the principal component theory was used to

represent the data. Through principal component theory, PCA it is possible to calculate the

most important correlations among a collection of data and represent it in a two dimensional

graph which facilitates a further analysis of the data which is why the method was chosen. The

data obtained by the study was collected in Excell and later, as with the correlation matrices,

transferred into Python using the JetBrain Pycharm compiler. The PCA was calculated in

Python using the class PCA from the package sklearn.decomposition. This class centers the

data without changing the internal relation between the data features and then applying the

SVD algorithm (Scikit-learn 2019). The PCA was then compiled in a two dimensional graph,

or a cluster plot to facilitate the interpretation of the results, for a full version of the code, view

PCA code in section 7.1. Apart from the cluster plot two other plots were made to illustrate

the results: A heatmap over the component weight for each category (Component weight code)

(12)

and a bar chart over the variance contained in each principal component (Explained variance code). These code are also in section 7.1.

5 Results

5.1 General statistics

Figure 2: Descriptive statistics

The data acquired during the collection period reached a total of 1115 Instagram Stories.

For an overview of the data see figure 2. Out of our 28 studied accounts that provides average of a total of 39.82 post per person during the data collection period. Concerning the categories the highest number of stories is 120 and the lowest number of stories is 10, the mean value is 65 posted stories per category and the median is 61 stories. In figure 3 the distribution of the number of stories posted per person is shown. The lowest number of posted stories is 8 while the highest number is 193. The mean value of posted stories is 39.8 while the median is a bit lower, 34 posted stories. In figure 3 we see that about a third -10 people- of the persons posted slightly over 32 stories in total during the period.

Figure 3: To the left: Number of posted stories per person. To the right: Number of posted

stories per day

(13)

On average 34.8 stories were posted each day of the collection period. The highest number of stories were posted on the first and 20th day (96 pcs and 57 pcs), a Tuesday and a Sunday.

The number on the first day can be explained by the fact that there may have been post from the day before the data collection started who were still active -meaning still visible since the stories disappear after 24 hours- and therefore collected and categorised to the data collection. The 20th was a Sunday and in general the amount of stories increases during the weekend, possible as a consequence of people having more free time. During this day the categories politics and friend s stood out with 22.8% and 14% of the posted stories. The following categories did not have any post this day: animals, interaction, travel, social events or selfies. In figure 3 the total number of stories posted each day during the data collection period is visible.

The lowest number of Instagram Stories, 17 pcs, occurred on the 29th day, a Tuesday. This day the categories family, photography, design, friends, politics and social all had zero stories.

The most popular category was nature with 35.3% of the posted stories and food, interaction, nostalgia, exercise and selfies all with 23.5%. The highest number of post stands out more from the data than the lowest and there were in fact a number of days were the total number of post were around 20. For a closer look at the distribution of the data a count of the post per week has been made. During week one 256 stories were posted the respective numbers for the following weeks are 218, 265, 258 and 118 (the last week with only consisted of 4 days -and no weekend- which explains the loser number) a relative balanced distribution. In other words even though the number of post varied a bit throughout the collection period the weekly average remained still indicating no bigger differences in the data over the investigated time.

Figure 4: Distribution of categories

Out of the investigated categories the most popular where food with 10.8% of the total amount of post, friends (10.5 %) and politics ( 10.5 %). The proportions between the cat- egories is illustrated in figure 4. The least popular categories where nostalgia (0.9%), travel (1.5%) -in section 5.4 we discuss the affects on this specific category- and exercise( 2.7%).

Person number 17 had the highest number of stories with a total of 193 while person 20

with a total of 8 stories had the lowest number. As these people represent two extremes

(14)

Figure 5: To the left: The distribution of posted stories by person 17. To the right: The distribution of posted stories by person 20

they will be given a deeper analysis. When investigation the posted stories from person 17 we find that she/he has a preference for the categories politics (27.4%), nature (12.6%) and photography (12.6%) while rarely (or never) posted stories related to nostalgia (0%), travel (0.6%) or family (0.6%). In figure 5 the data on person 17 is demonstrated. In the investigation the least amount of Instagram stories was posted by person 20. In total 8 stories were published during the time of the investigation. The biggest categories in the stories of person 20 are friends with 37.5%. After this main category the distribution among the other posted stories is even with 12.5% of the posted stories belonging to the categories: family, food, hobbies, work and travel. The following categories were absent from the data of person 20: animals, humour, photography, nature, design, interaction, politics, selfies, social, exercise and nostalgia. In figure 5 the posted stories by person 20 is visualised. In comparing person 20 and 17 we find that while both person 17 and 20 has one main category -politics and friends- and a large number of smaller ones. However due to the larger amount of stories posted by person 17, this individual manage to include almost all almost all of the categories and have a larger spread in the variation of the Instagram Stories.

5.2 Correlations within the data

In the previous section the general statistics from the data matrix was presented, in this section the focus is on finding correlations within the data. A correlation matrix was created using the Pearson correlation coefficient in order to discover strongest correlations between the categories. In this section two different correlation matrices will be presented, the first of data set one -the original data- and the second of data set two -the data normalized after the user posting the stories-.

As shown in figure 6 the strongest correlations in data set one are between the categories

photography and politics 0.85 followed by photography and nature 0.81, photography and selfies

0.78 and selfies and nature 0.72. The categories interactions, hobbies and food also have

relative high correlations with a number of other categories. Some correlations gives negative

values which indicates a negative correlation the categories. The greatest negative correlations

(15)

Figure 6: Heatmap showing the results of the correlation matrix

are found between work and travel with a values of ´0.24. The categories animals and family otherwise has the highest number of negative correlations. Between many of the categories the correlations is close to zero with indicates that there is no linear dependence between the categories and therefore no statistical relation. The diagonal of the matrix is the correlation between the same category resulting in a perfect positive linear dependence with value 1.

Figure 7: Heatmap showing the results of the correlation matrix of data set 2

The correlation matrix of data set two, normalized after persons, shows much lower corre- lation than the previous matrix. The strongest correlations are between work and interaction 0.45, nostalgia and design 0.42, family and interaction 0.36 and workout and hobbies 0.34.

Overall the categories interaction and family occupies the larges number of positive correla-

(16)

tions. Interesting to note here is that many of these previously mentioned categories are small ones with few posted stories. The negative correlations in the matrix are almost as high as the positive ones with the correlation between social and family ´0.39 as the greatest negative correlation followed by friends and humor ´0.38, exercise and selfies ´0.36 and exercise and humor ´0.35. Overall the highest amounts of negative correlations can be found with the categories humor and social events.

5.3 Results from PCA

Figure 8: Cluster plot of the two dimensional data approximation performed by the PCA algorithm on data set one

The PCA was calculated in order to find patterns in the data, in figure 8 the result of this method is presented. Each dot represent a different individual in the coordinate-system given by the PC1 and PC2 axes. At this point it is essential to remember that the PCA-axes are ranked according to their importance in representing the original data, hence a given a point a : p0.0q with equal distances to two different points b : p0.1q and c : p1.0q. The original point, a, differs more to point C that moves along the PC1 axis that with point b that moves along the PC2 axis. The figure 8 shows that the majority of the persons are located between the values ´2 and 1 on PC1 and between values ´2 and 2 on PC2. However person 17 who posted the most stories -see section 6.1- is located at 9.8, ´2.5 and three other people at p1.5, 4.7q, p2.7, 0.2q and p3.0, 1.4q are located further out along the PC1 axis. These people show great differences in their usage of Instagram stories in comparison to the remaining 24 persons.

It is worth mentioning here that all of these for people posted a lot of stories 193, 78, 95

respectively 69. This is high compared to the average of 39.8 posted stories. The 7 dots in the

bottom right corner (between values ´2 and ´1 on PC1 and ´2 and ´0.5 on PC2) represent

6 out of the 7 people who posted the least amount of stories (less than 20). So what we see

is a spread from the persons who were the least active in posting stories in the bottom right

corner all clustered together and the with the increasing amount of stories the individuals

represented by the dots spread out over the PC axis.

(17)

Figure 9: To the left: Scree plot of the percent variance captured in each principal component.

To the right:Component weight each category in the three first PCA’s.

In order to completely understand the PCA plot it is essential to understand what infor- mation the different principal component contains. In figure 9 the amount of contribution from each category to the first three principal components is visualised. The first principal component, PC1 contains the variance of the categories photography, nature, hobbies and pol- itics. In other words the people that differs a lot in along the PC1 axis have bigger differences concerning these categories. In the second principal component, PC2 the categories social events, friends and food are highly influential in contrast to photography, travel and nature.

Lastly the third principal component,PC3 expresses the variance of the categories family and travel in contrast to work and interaction.

To evaluate the accuracy of the model given by the PCA it is necessary to look at the percent variation, how much of the variation that was captures by each principal component.

In figure 9 the percent variation is represented in a bar chart. PC1, the principal component containing the most information about the variation of the data, accommodates 30.1% of the total information. PC2 accommodates 12% and PC3 11.5%. The last principal component PC17 contains 0.14% of the information. In figure 8 the data is visualised by PC1 and PC2, however these components only withholds 41.6% of the information.

Lastly in this section we will examine data set 2 where the data has been normalized after the Instagram users. This data set gives a new formation in the PC plot than the first data set. Here the majority is located between value ´1 and 1 on PC1 and between ´2 and 1 on PC2. However the amount of out liners a higher here which makes the distinction between cluster and individual out liners more difficult. The pattern found in data set one can not be found here as persons with both high and low activity are found throughout the component axes.

The component weight for the principal components of data set 2 is represented in figure

11. Here it is visible that PC1 contains most variance in the categories work, exercise and

interactions in contrast to humor and travel. Concerning PC2 we see that family, photography

and interactions stand in contrast to social events and animals. Finally design and nostalgia

are contrasted by selfies, politics and animals in PC3.

(18)

Figure 10: Cluster plot of the two dimensional data approximation performed by the PCA algorithm performed on data set 2

After evaluating the results of the PCA of the second data set, including the the amount of variance from each category compiling in each principal component, the next step is to look at how much variance is captured in each principal component. In figure 11 the result of the explained percent variation is demonstrated. In data set two the first principal component stand for 14.6% of the total variance in the data. The second principal component withholds 11.6% and the third principal component for 10.2% of the collected data. The last principal component, PC17 contains 2.2% of the total variation. The cluster plot of the PCA therefor explains 26.2% of the variance within the data. In other words almost 75% of the information in data set two has disappeared in the PCA process when the first two components were selected to describe the data. If we look at the variance captured by the PCs, in data set two the variance is more evenly spread throughout the PCs while the percent variance declines faster in data set one.

5.4 Affects by the pandemic

During the period of data collection the word was partially paralysed by the spread of the Corona virus Covid 19 (Joachim Kerpner 2020). This was a unforeseen event which have to be taken into consideration when interpreting the results. During the second week of the collection period this pandemic hit Sweden -the location of Uppsala University- and this may have affected the outcome of the study. Among the most important changes in the society that may impact the study is the transition to distance education in all university’s in Sweden as well as an overall recommendation to work from home which affected a lot of the persons in the study (Sweden 2020, Public Health Agency of Sweden 2020). From this we see an increase in the category work -which includes studies- as people change their work habits.

The category travel has also been affected by the pandemic since the Ministry for foreign

affairs advise against all non essential travel (Foreign Affairs 2020). This may also have affect

on other categories as well as the number of posted stories but due to the short period of

(19)

Figure 11: To the left: Scree plot of the percent variance captured in each principal component for data set 2. To the right:Component weight each category in the three first PCA’s for data set 2.

data collection before the outbreak of the pandemic it is not possible to draw any conclusions about any changes in the pattern of usage of Instagram stories.

6 Discussion

6.1 General Discussion

6.1.1 General statistics

In the first stage of the investigation the general statistics were calculated in order to find which of these statistics best describe the Instagram users. The general statistics provides a rather complete picture of our data and just by studying the amount of post per person and per category we receive a good idea for the general patterns in the data. From this section we see that one category in particular, politics, was more influential than the other. As stated in the method any accounts with a clear theme was not included in the research, however the people who posted in politics generally posted more than one post in a row and many times the discussed theme would reappear in the following days. In the section 5.1, general statistics we also saw that during the day with the most post politics had a majority of the post while the day with the least amount of post there was no post in the category. This tendency remains when comparing the two people with the most respectively least amount of post indicating this category has a great influence on the patterns in the data.

6.1.2 Correlation matrix

The second stage of the investigation was to calculate a correlation matrix to find relations be-

tween the types of Instagram Stories that the users make. In the first data frame the category

photography withholds the three strongest correlations with, politics, selfies and nature. The

last two relations seem quite intuitive but the first one, politics has no clear connection with

(20)

photography. When we compare this correlation matrix with matrix normalized after persons, the last one has very lower correlation. This indicates that a few people, the ones with the most posts, has had a high influence one the first correlation matrix. The categories that had the highest correlation in data set two, interaction and work,design and nostalgia and interaction and family, are all apart from work relative small categories, this is also true for photography the category with the strongest correlations in data set one. This result shows that the smaller categories withholds more correlations than the bigger ones who contain more data. To summarize, with the data from correlation matrices a lot of pattern appear that all contribute to a better understanding of the data. However the data from the correla- tion matrices alone is not sufficient as to accurately describe a user based on their Instagram Stories.

6.1.3 PCA

In the last stage of the investigation the PCA was calculated with the intention of an un- derstanding of the different patterns in posting Instagram Stories. In order to evaluate the method the level of captured variance in each principal component can be studied. These levels are very low for both data sets, 41.6% and 26.2% of the variance is captured in the first two principal components. This signifies that the PCA did not capture enough variance in the data to give a reliable analysis and therefor not a suitable method for this particular investigation with its corresponding data frame. However the information captured from data can still give some indications about the patterns i the data.

In the correlation matrix performed on data set one we saw that the people who posts most frequently had a great influence at the outcome of the correlations. This patter is present in the PCA as well; in the PCA plot of data set one the people who posted the most stories tend to be further out to the right on the PC1 axis while the people with the lower activity is situated in the bottom left corner. The PC1 seems to gives more information about the level of activity of the person than any other characteristics, indicating that this is biggest difference in the data. But as we study the categories with the biggest contributed variance to PC1, photography, nature, hobbies and politics, we see that these are the categories corresponds greatly to the most popular categories of person 17. This person posted 17.3% of the total number of post and seems to have had a great impact on the outcome of the PCA. So this PCA on data set one gives a double result, on one hand we see that one person has had a great influence on the outcome of the procedure and on the other hand the pattern with the level of activeness along the PC1 axis reinforces the fact that this in fact seem to be the biggest difference in the captured data. This tendency with a pattern corresponding to the level of activeness was not found in data set two which was normalized after the users.

The normalization performed on data set two shows great differences compared to data

set one both in the correlation matrix and the PCA. Here it is important to reflect over the

effects of the normalisation on the outcome of the data. When we normalise the data, in this

case after the Instagram user, the effect by the persons with lower activity increases. When

the activity is lower, the time period for the investigation has to be extended in order to

give a representative results of the Instagram activity. As this thesis had a 31 day collection

period, the Instagram Activity for the persons with the lowest Instagram activity is not fairly

portrayed. Because of that the normalisation process may give some posts, made by users

with low activity, more influence on the outcome of the calculations. However when the

normalization is not made one individual can greatly impact the calculation which also leads

(21)

to a misguiding results.

6.1.4 What useful information could be extracted about users

The information extracted with both general statistics and the correlation matrices offer useful information about the data that provides a good foundation for analysis. The PCA was, as discussed in section 6.1.3, not a suitable method for this investigation. Many factors can be reasons behind this fact but one is the size of the investigation. PCA is typically used by big companies like Facebook who has access to massive data frames when performing the analysis and our, in comparison, very small data frame may not have enough data to properly calculate the patterns. Another aspect the may impact the analysis is the categories, there are 17 categories in the investigation and these may not have been sufficient as to capture the complexity of our posting patterns. For a better approximation of the data more and more specified categories might contribute. The categories used in the thesis are quite wide and a category such as politic can differ a lot in content depending on the user and there may be a point in narrowing the categories. This will help the analysis but the collection of data will be made more difficult as more personal judgement is needed for the categorisation which decreases the reliability of the investigation. A constant debate during the work of this thesis has been between normalizing the data or not. The normalization has changed then results a lot which makes the decision more difficult. One way to avoid this discussion would have been to have a self normalizing method. This could for example mean that the data is collected from more people but the people posting the most and the least are taken away from the results. This would normalize the data without altering the result in a too drastic way.

6.1.5 To what degree could we categorise people on the basis of this information?

None of the methods investigated in this thesis can accurately describe or categorise a user by its own. The general statistics shows the general pattern but in order to categorise the users the data frames corresponding to each user would have to be studied. In the case of the correlation matrix we see how some categories have stronger correlation but these correlation does not provide sufficient information as to fully categorise the users. The PCA algorithm offers a clear categorisation but on the data frame of this thesis the amount of data captured in the principal components is not adequate as to describe or categorise the users in a satisfactory way.

6.2 Conclusions

The three different methods for data extractions investigated in this thesis provides us with

different set of results. With the general statistics we get an overview over the data and with

only a few calculation we find the big tendencies that influence the data. The correlation

matrices gives a indication about the correlations between the categories in the data which

gives a strong indication about the patterns behind the story posting. Through the PCA

algorithm we receive information about all individuals how they relate both too each other

and to some of the most influential categories. However since the PCA did not capture the

majority of the variance in the original data the results are not reliable. Because of the

methods of this thesis did not extract sufficient information as to accurately describe the

studied users.

(22)

7 Appendix

Figure 12: Heatmap showing the results of the correlation matrix of the data set 3, normalised after the categories

Figure 13: Cluster plot of the two dimensional data approximation performed by the PCA

algorithm performed on data set 3

(23)

Figure 14: To the left: Scree plot of the percent variance captured in each principal component

for data set 3. To the right:Component weight each category in the three first PCA’s for data

set 3.

(24)

7.1 Codes

Code for correlation matrices

""" Created april 15th 2020 by Linnea Eklund"""

import pandas as pd

import matplotlib.pyplot as plt import seaborn as sn

#originaldatan

i_data = pd.read_csv('datakopia_1.csv',header=0) #eg vanlig corrMatrix = i_data.corr()

#data normaliserad efter personerna

i_data_pers = pd.read_csv('datakopia_pers1.csv',header=0) corrMatrix_pers = i_data_pers.corr()

#data normaliserad efter kategorierna

i_data_kat = pd.read_csv('datakopia_kat2.csv',header=0) corrMatrix_kat = i_data_kat.corr()

sn.heatmap(corrMatrix, annot=True) plt.show()

PCA code

""" Created april 21th 2020 by Linnea Eklund"""

import pandas as pd

#ORGINALDATA

#X = pd.read_csv('datakopia_1.csv',header=0, names= ["Namn", " Djur"," Humor",

" Familj" , " Fotografi", " Mat", " Natur", " Hobbies", " Inredning",

" Interaktion"," Jobb", " Kompisar", " Politik", " Resor", " Selfies",

" Socialt"," Tr¨ aning", " Nostalgi"])

#DATA NORMALISERAD EFTER PERSONER

X = pd.read_csv('datakopia_pers1.csv',header=0, names= ["Namn", " Djur",

" Humor"," Familj"," Fotografi", " Mat", " Natur", " Hobbies", "Inredning",

" Interaktion"," Jobb", " Kompisar", " Politik", " Resor", " Selfies",

" Socialt", " Tr¨ aning", " Nostalgi"])

#DATA NORMALISERAD EFTER KATEGORIER

#X = pd.read_csv('datakopia_kat2.csv',header=0, names= ["Namn", " Djur",

" Humor", " Familj"," Fotografi", " Mat", " Natur", " Hobbies"," Inredning",

" Interaktion"," Jobb", " Kompisar", " Politik", " Resor",

" Selfies"," Socialt", " Tr¨ aning", " Nostalgi"])

(25)

from sklearn.preprocessing import StandardScaler

variables = [" Djur"," Humor", " Familj", " Fotografi", " Mat", " Natur",

" Hobbies", " Inredning", " Interaktion"," Jobb", " Kompisar", " Politik",

" Resor"," Selfies"," Socialt", " Tr¨ aning", " Nostalgi"]

x = X.loc[:, variables].values y = X.loc[:,['Namn']].values

x = StandardScaler().fit_transform(x) x = pd.DataFrame(x)

from sklearn.decomposition import PCA pca = PCA(n_components=2)

x_pca = pca.fit_transform(x) x_pca = pd.DataFrame(x_pca) x_pca['Namn']=y

x_pca.columns = ['PC1','PC2', 'Namn']

print(x_pca.head())

print(pca.explained_variance_ratio_) print(pca.components_)

import matplotlib.pyplot as plt fig = plt.figure()

ax = fig.add_subplot(1,1,1)

ax.set_xlabel('Principal Component 1') ax.set_ylabel('Principal Component 2') ax.set_title('2 component PCA')

targets = ['Person 1', 'Person 2', 'Person 3', 'Person 4', 'Person 5', 'Person 6', 'Person 7', 'Person 8', 'Person 9', 'Person 10', 'Person 11', 'Person 12', 'Person 13', 'Person 14', 'Person 15', 'Person 16', 'Person 17', 'Person 18', 'Person 19', 'Person 20', 'Person 21', 'Person 22', 'Person 23', 'Person 24', 'Person 25', 'Person 26', 'Person 27', 'Person 28']

#colors = ['r', 'g', 'b', 'darkgreen', 'yellow', 'orange', 'lightgreen', 'pink' , 'purple', 'violet', 'indigo', 'm', 'olive', 'k', 'plum', 'tan', 'lime',

'salmon','chocolate', 'brown', 'gold', 'teal', 'darkslategray','gray', 'sienna', 'navy', 'maroon','tomato']

for target, color in zip(targets,colors):

indicesToKeep = x_pca['Namn'] == target

ax.scatter(x_pca.loc[indicesToKeep, 'PC1'], x_pca.loc[indicesToKeep, 'PC2']

, c=color, s=50)

ax.legend(targets, bbox_to_anchor=(1.1, 1.05)) ax.grid()

plt.show()

(26)

Component weight code

#!/usr/bin/env python3

# -- coding: utf-8 --

"""

Created on Wed May 13 12:50:33 2020

@author: davsu428

"""

import matplotlib.pyplot as plt import pandas as pd

#ORGINALDATA

#X = pd.read_csv('datakopia_1.csv',header=0, names= ["Namn", " Djur"," Humor",

" Familj" , " Fotografi", " Mat", " Natur", " Hobbies", " Inredning",

" Interaktion"," Jobb", " Kompisar", " Politik", " Resor", " Selfies",

" Socialt"," Tr¨ aning", " Nostalgi"])

#DATA NORMALISERAD EFTER PERSONER

X = pd.read_csv('datakopia_pers1.csv',header=0, names= ["Namn", " Djur",

" Humor"," Familj"," Fotografi", " Mat", " Natur", " Hobbies", "Inredning",

" Interaktion"," Jobb", " Kompisar", " Politik", " Resor", " Selfies",

" Socialt", " Tr¨ aning", " Nostalgi"])

#DATA NORMALISERAD EFTER KATEGORIER

#X = pd.read_csv('datakopia_kat2.csv',header=0, names= ["Namn", " Djur",

" Humor", " Familj"," Fotografi", " Mat", " Natur", " Hobbies"," Inredning",

" Interaktion"," Jobb", " Kompisar", " Politik", " Resor",

" Selfies"," Socialt", " Tr¨ aning", " Nostalgi"])

from sklearn.preprocessing import StandardScaler

variables = [" Djur"," Humor", " Familj", " Fotografi", " Mat", " Natur",

" Hobbies", " Inredning", " Interaktion"," Jobb", " Kompisar", " Politik",

" Resor"," Selfies"," Socialt", " Tr¨ aning", " Nostalgi"]

x = X.loc[:, variables].values y = X.loc[:,['Namn']].values

x = StandardScaler().fit_transform(x) x = pd.DataFrame(x)

from sklearn.decomposition import PCA pca = PCA(n_components=3)

x_pca = pca.fit_transform(x) x_pca = pd.DataFrame(x_pca)

variables = ['Animals', 'Humor', 'Family', 'Photography', 'Food', 'Nature', 'Hobbies', 'Design', 'Interaction', 'Work', 'Friends', 'Politics',

'Travel', 'selfies', 'Social', 'Exercise', 'Nostalgia']

plt.matshow(pca.components_.T,cmap='coolwarm',vmin=-0.5,vmax=0.5)

plt.xticks([0,1,2],['1st','2nd','3rd'],fontsize=10)

(27)

plt.yticks(range(17),variables,fontsize=10) plt.colorbar()

plt.tight_layout() plt.show()

Explained variance code

""" Created may 2nd 2020 by Linnea Eklund"""

import numpy as np import pandas as pd import seaborn as sn

import matplotlib.pyplot as plt

#ORGINALDATA

#X = pd.read_csv('datakopia_1.csv',header=0, names= ["Namn", " Djur"," Humor",

" Familj" , " Fotografi", " Mat", " Natur", " Hobbies", " Inredning",

" Interaktion"," Jobb", " Kompisar", " Politik", " Resor", " Selfies",

" Socialt"," Tr¨ aning", " Nostalgi"])

#DATA NORMALISERAD EFTER PERSONER

X = pd.read_csv('datakopia_pers1.csv',header=0, names= ["Namn", " Djur",

" Humor"," Familj"," Fotografi", " Mat", " Natur", " Hobbies", "Inredning",

" Interaktion"," Jobb", " Kompisar", " Politik", " Resor", " Selfies",

" Socialt", " Tr¨ aning", " Nostalgi"])

#DATA NORMALISERAD EFTER KATEGORIER

#X = pd.read_csv('datakopia_kat2.csv',header=0, names= ["Namn", " Djur",

" Humor", " Familj"," Fotografi", " Mat", " Natur", " Hobbies"," Inredning",

" Interaktion"," Jobb", " Kompisar", " Politik", " Resor",

" Selfies"," Socialt", " Tr¨ aning", " Nostalgi"]) from sklearn.preprocessing import StandardScaler

variables = [" Djur"," Humor", " Familj", " Fotografi", " Mat", " Natur",

" Hobbies"," Inredning", " Interaktion"," Jobb", " Kompisar", " Politik",

" Resor", " Selfies"," Socialt", " Tr¨ aning", " Nostalgi"]

x = X.loc[:, variables].values y = X.loc[:,['Namn']].values

x = StandardScaler().fit_transform(x) x = pd.DataFrame(x)

from sklearn.decomposition import PCA pca = PCA()

x_pca = pca.fit_transform(x) x_pca = pd.DataFrame(x_pca) x_pca['Namn']=y

x_pca.columns = ['PC1','PC2','PC3', 'PC4','PC5','PC6', 'PC7','PC8','PC9',

(28)

'PC10','PC11','PC12', 'PC13','PC14','PC15', 'PC16','PC17','Namn']

print(pca.explained_variance_ratio_)

#PLOT PERCENT VARIANCE EXPLAINED fig,ax = plt.subplots()

var_explained = np.round(pca.explained_variance_ratio_ 2 / np.sum (pca.explained_variance_ratio_ 2), decimals=3)

sn.barplot(x=list(range(1, len(var_explained) + 1)) ,y=pca.explained_variance_ratio_, color="limegreen") plt.xlabel('SVs', fontsize=16)

plt.ylabel('Percent Variance Explained', fontsize=16) plt.savefig('svd_scree_plot.png', dpi=100)

plt.title('Percent variation')

plt.show()

(29)

8 References

[1] Swedish ministry of Foreign Affairs. UD avr˚ ader fr˚ an alla icke n¨ odv¨ andiga resor. 2020.

url: https://www.regeringen.se/uds-reseinformation/ud-avrader/utlandsresor-- avradan-for-alla-lander/ (visited on 04/20/2020).

[2] Instagram. About us. 2020. url: https://about.instagram.com/about-us (visited on 05/20/2020).

[3] Instagram. Stories. 2020. url: https://about.instagram.com/features/stories (visited on 05/20/2020).

[4] Oscar Schau Joachim Kerpner. Gymnasieskolor ska g˚ a ¨ over till distansundervisning.

2020. url: https://www.aftonbladet.se/nyheter/a/mR44yl/gymnasieskolor- ska-ga-over-till-distansundervisning (visited on 04/20/2020).

[5] Michal Kosinski, Yilun Wang, Himabindu Lakkaraju, and Jure Leskovec. “Mining big data to extract patterns and predict real-life outcomes.” In: Psychological methods 21.4 (2016), p. 493.

[6] Mathworld.wolfram. Correlation Coefficient. 2020. url: https://mathworld.wolfram.

com/CorrelationCoefficient.html (visited on 06/27/2020).

[7] Andrew Ng. CS229 Lecture notes, part XI Principal Component Analysis. 2008.

[8] Salman Aslam Omnicore. Instagram by the Numbers: Stats, Demographics Fun Facts.

2020. url: https://www.omnicoreagency.com/instagram-statistics/) (visited on 05/02/2020).

[9] Thomas Peters. Data-driven science and engineering: machine learning, dynamical sys- tems, and control: by SL Brunton and JN Kutz, 2019, Cambridge, Cambridge University Press, 472 pp.,£ 49.99 (hardback), ISBN 9781108422093. Level: postgraduate. Scope:

textbook. 2019.

[10] Scikit-learn. 2.5. Decomposing signals in components (matrix factorization problems).

2019. url: https://scikit- learn.org/stable/modules/decomposition.html#

decomposition (visited on 05/10/2020).

[11] Gilbert W Stewart. “On the early history of the singular value decomposition”. In: SIAM review 35.4 (1993), pp. 551–566.

[12] Public Health Agency of Sweden. Covid 19. 2020. url: https://www.folkhalsomyndigheten.

se / smittskydd - beredskap / utbrott / aktuella - utbrott / covid - 19/ (visited on 05/10/2020).

[13] Wu Youyou, Michal Kosinski, and David Stillwell. “Computer-based personality judg-

ments are more accurate than those made by humans”. In: Proceedings of the National

Academy of Sciences 112.4 (2015), pp. 1036–1040.

Analysis of Instagram Stories: A quantitative study of Instagram Stories using PCA

U.U.D.M. Project Report 2020:31

Examensarbete i matematik, 15 hp Handledare: David Sumpter

Examinator: Veronica Crispin Quinonez Augusti 2020

Department of Mathematics Uppsala University

Analysis of Instagram Stories: A quantitative study of Instagram Stories using PCA

Linnea Eklund

Abstract

The online presence of human activity is constantly increasing and with it an opportunity

to investigate human psychology through big data frames containing user footprints. This

offers a unique insight into human behaviour and has opened up a new research field mainly

dominated by big companies like Facebook who uses the user data to improve experiences on

their sites. This thesis analyses three methods of data extraction and how well they work on

a data frame containing of the Instagram Stories of 28 users. The Instagram stories posted

by these 28 users were categorised into 17 different categories in order to find patterns and

correlations in the data. The three methods for analysing the data were general statistics, a

correlation matrix and PCA, principal component analysis. The results of the study shows

that none of the investigated methods fully accounted for the diversity in the original data and

that they consequently did not contained sufficient information from the given data frame.

Contents

1 Introduction 3

2 Background 3

2.1 Instagram . . . . 3

2.2 Singular Value Decomposition . . . . 4

2.3 Principal Component analysis . . . . 4

2.4 Pearson correlation coefficient . . . . 7

3 Statement and research question 7 4 Method 8 4.1 Collection of data . . . . 8

4.2 Categorisation . . . . 8

4.3 General statistics . . . . 8

4.4 Correlation matrices . . . . 8

4.5 PCA . . . . 9

5 Results 10 5.1 General statistics . . . . 10

5.2 Correlations within the data . . . . 12

5.3 Results from PCA . . . . 14

5.4 Affects by the pandemic . . . . 16

6 Discussion 17 6.1 General Discussion . . . . 17

6.1.1 General statistics . . . . 17

6.1.2 Correlation matrix . . . . 17

6.1.3 PCA . . . . 18

6.1.4 What useful information could be extracted about users . . . . 19

6.1.5 To what degree could we categorise people on the basis of this information? 19 6.2 Conclusions . . . . 19

7 Appendix 20 7.1 Codes . . . . 22

8 References 27

1 Introduction

2 Background

2.1 Instagram

Instagram is a Social Media app owned by Facebook, it was created in 2010 and has since then grown rapidly in popularity. At its core Instagram is a app for posting pictures although it today has more features including Instagram Stories (Instagram 2020a).

On Instagram there are two options for picture sharing. First you can post a picture or

movie in Instagram feed and the second opinion is to post it on Instagram Stories. Instagram

stories can be compared to Snapchat stories which has the same functions and is typically

used in the same way. They are meant to be a fast way of communication in which you can

easily share your ever day moment. The idea for this is the possibility to capture events and

2.2 Singular Value Decomposition

A “ U SV T , (1)

SVD, and its subcategories, is one of the most used applications of linear algebra in today’s society. One of the advantages with SVD in contrast to other reduction techniques is that the SVD is guaranteed to exist for any matrix A with real numbers (Peters 2019).

2.3 Principal Component analysis

Principal Component analysis, PCA was first invented by Pearson in 1901. Like SVD, PCA

is a dimension reduction method that calculates the subspace were the data approximately

is stored. PCA can be described as subsection or a special case of SVD were the data is

pre-processed in two steps before performing the SVD. First a mean subtraction is applied

and in the next step the variance is set to unity. PCA is one of the most utilized dimension

reduction techniques and is frequently used in a variation of research fields (Peters 2019).

Figure 1.1 Data measures Data measures with the vector u drawn

PCA is calculated following these steps. Given a collection of data x, where x P R n and the required dimension k ă n. The PCA algorithm starts with a two step normalization:

The first step is to let the mean value µ “ 1 m

m

ÿ

i“1

x piq (2)

and replace all x piq with x piq ´ µ (3)

Now let the squared standard deviation σ 2 j “ 1 m

ÿ

i

x piq2 (4)

The second step in the PCA algorithm is to compute the major axis of variation, u. the

vector u is the unit vector to which the variance of the projected data is maximized when

the data projects in the corresponding direction to u. In other words u is the vector in the

direction of the most variation of data. The purpose of this step is to capture as much of the

variance as possible in the subspace that belong to u. When the distance for all points x i

to the origin is maximized then the variance capture in the vector u also maximizes. If we

A “ U SV ^T , (1)

PCA is calculated following these steps. Given a collection of data x, where x P R ⁿ and the required dimension k ă n. The PCA algorithm starts with a two step normalization:

x ^piq (2)

and replace all x ^piq with x ^piq ´ µ (3)

Now let the squared standard deviation σ ² _j “ 1 m

x ^piq2 (4)

variance as possible in the subspace that belong to u. When the distance for all points x _i

´ x ^piq

ux ^piq x ^piq

u “ u ^T

x ^piq x ^piq

x ^piq x ^piq

Because the maximisation of u ^T u has to be one since σu “ λu.

´ u ^t ÿ

u `u ^t u ˘

This proves the u is the eigenvector λ ₁ to the σ matrix. The norm of u, || u || will there be equal to one since the sum cannot be larger than λ ( equation 9) which is the eigenvector of the matrix.

The last step in the PCA algorithm is to compute the corresponding vector in order to represent x ^piq in this new basis.

y ^piq “

u ^T ₁ x ^piq u ^T ₂ x ^piq

.. . u ^T _m x ^piq

This new matrix y now provides us with a k-dimensional approximation for all points x ^piq .

Each u ^T _i represents a principal component that when combined with the original point gives