Profile based evaluation of what different browsers and browser extensions may be able to learn about a user

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Information Technology

2021 | LIU-IDA/LITH-EX-G--21/064-SE

Proﬁle-based evaluation of what

diﬀerent browsers and browser

extensions may be able to learn

about a user

Proﬁlbaserad undersökning av den information som webbläsare

och webbläsartillägg kan lära sig om en användare

Soﬁa Bertmar

Johanna Gerhardsen

Supervisor : Niklas Carlsson Examiner : Marcus Bendsten

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Information leakage online has become a marketplace where companies can profile users to gain revenue from personalized advertising. This thesis offers a deeper analysis into what part the browsers Chrome and Firefox play in this targeted advertising. Privacy focused extensions have become a common way for users to avoid being exposed to third-party trackers, and two such extensions are Ghostery and CatBlock. By adding these to the browsers, this thesis examines how ads are affected during online sessions. Through development of a Selenium web crawler, several online profiles were built with the usage of personas based on specific interest categories. By performing daily sessions of Google search queries, data was collected in the form of ads and HTML text. The data collection lasted for a period of 21 days, using 29 virtual machines and 6 personas. This data was further used to analyze the extent of personalization of ads as profiles were built over time. The results obtained show similarities in how ads are targeted in the browsers, as well as the level of personalization that occurs when using an extension. Results show no major differences in level of targeting between the used browsers, but clearly show that a personalization of advertising has occurred. The usage of extensions proved to be efficient in reducing the amount of ads that a user is exposed to. However, the usage of extensions did not decrease the percentage of targeting amongst ads.

(4)

Acknowledgments

We would like to thank our supervisor Niklas Carlsson for giving great feedback and being available through both email and meetings. Your insight has been of great relevance to the development of the thesis project. We would also like to thank Le Minh-Ha for being of great assistance when aiding with the creation of Azure accounts, as well as helping with the set up of the virtual machines. Finally, we would like to thank our group members Alice Ekblad, Anna Höglund, Julia Mineur and Isabell Öknegård Enavall for great teamwork when creating the framework, as well as being helpful during the data collection process.

(5)

1 Introduction 1 1.1 Aim . . . 1 1.2 Research questions . . . 1 1.3 Contributions . . . 2 1.4 Delimitations . . . 2 1.5 Thesis outline . . . 2 2 Background 3 2.1 Selenium . . . 3 2.2 Browser extensions . . . 4 2.3 Virtual machine . . . 5 2.4 User modeling . . . 5 3 Related work 7 4 Method 9 4.1 Framework overview . . . 9 4.2 Persona design . . . 10 4.3 Framework design . . . 11 4.4 Set up . . . 12 4.5 Data collection . . . 13 5 Results 14 5.1 Browser performance . . . 14 5.2 Ads . . . 15 5.3 Word clouds . . . 23 6 Discussion 25 6.1 Results . . . 25 6.2 Method . . . 28

6.3 The work in a wider context . . . 29

(6)

Bibliography 32

A Appendix 34

A.1 Search queries . . . 34 A.2 List of filter words . . . 40

(7)

List of Figures

4.1 An overview of the framework design. . . 9 4.2 Crawler navigation from search query to post-search behavior. . . 11 4.3 Probabilities of post-search user behavior and think times. . . 12 5.1 An overview of personalized ads for the base personas A, C and E for day 1, 7 and

14. . . 20 5.2 The results from the word cloud function with Firefox, for persona A. . . 23 5.3 The results from the word cloud function with Chrome, for persona A. . . 23 5.4 The results from the word cloud function with Ghostery added to Firefox, for

per-sona C. . . 23 5.5 The results from the word cloud function with CatBlock added to Chrome, for

(8)

List of Tables

4.1 The six different personas. . . 10

4.2 The virtual machines with their set ups, where "-" entails no implemented exten-sions, while blank entails no analysis. . . 13

5.1 The average browser performance during the collection. . . 14

5.2 The crash rate experienced during the collection, where "Y" entails a successfully run session and "N" entails a crash. . . 15

5.3 The number of visited URLs and identified ads day 1, 7, 14, 15 and 21. . . 16

5.4 An overview of encountered ads. . . 16

5.5 Firefox heat maps of day 1, 7 and 14. . . 17

5.6 Chrome heat maps of day 1, 7 and 14. . . 18

5.7 Ghostery heat maps of day 1, 7 and 14, for 3 Chrome- and 3 Firefox machines. . . . 19

5.8 CatBlock heat maps of day 1, 7 and 14, for 4 Chrome- and 3 Firefox machines. . . . 20

5.9 Firefox heat map of day 15 and 21. . . 21

5.10 Chrome heat maps of day 15 and 21. . . 22

5.11 Ghostery heat map of day 15 and 21. . . 22

(9)

1 Introduction

In a society where many daily interactions are based around the usage of web browsers, the importance and awareness of user privacy is becoming a major priority. Leaking private information about users has opened up to a marketplace where companies can track and profile their users in order to create personalized ads. By tracking a user’s activities online, companies are able to learn and utilize certain characteristics as well as interests that can further be used for advertising [15], thus increasing company revenues. This form of user tracking is commonly performed on websites through the usage of cookies, which can track a user across several sites, hence aiding in providing a wider context to the profiling. The web browser’s involvement in this process is something that a user is generally unaware of. From the perspective of upholding the integrity of every user online, this profiling could be considered a threat to the user’s privacy [17].

1.1 Aim

The main goal of this thesis is to gather information and evaluate results regarding the form and amount of information that an extension and a web browser is able to learn about a user. The focus is centered around browsers and their extensions, to further assess the differences between various browsers as well as user profiles. To successfully examine the privacy of user profiles in regards to extensions, multiple originally blank user profiles have been built and developed. The interaction with the browsers from the profiles have been done by the use of a web crawler developed with Selenium. Essentially, the intention is to establish well-informed models to aid in the extraction and identification of what exactly the extensions and web browsers could obtain and learn from these various user profiles.

1.2 Research questions

The questions that will be answered in this thesis are centered around privacy of the user and how much different entities are able to learn about a user depending on web browser, extension as well as the characteristics of the user.

• Are there any differences between the information that the two different web browsers, Chrome and Firefox, obtain about a user?

(10)

1.3. Contributions

• What differences are there to the degrees of privacy of the information that different extensions may alter in the browser?

1.3 Contributions

In this thesis, a framework was developed to imitate user behavior online. The framework was used for building profiles in web browsers and to gather data with regards to targeted advertisement. Furthermore, the framework acted in various environments, with differences such as location, persona, browser and extension or no extension in focus. The datasets col-lected in these environments show the level of personalization of advertising in the browser. The datasets also provide contributions in the form of differences in performance between privacy-focused extensions.

1.4 Delimitations

Several delimitations have been drawn in order to further increase the reliability of this thesis. The thesis will be centered around six different user profiles, which could potentially narrow down the final conclusion as opposed to examining hundreds of different profiles. When comparing different browsers and the information they may gather in regards to users, only two browsers have been included in the thesis. This limits the broadness of the conclusions that can be drawn from the gathered results.

The project suffers from time limitations, which entails that further testing over an ex-tended period of time could have altered the data collection and thereby the gained results. The number of virtual machines that were created was limited due to cost and the time limi-tations as well as the capability of the involved group members. Furthermore, the amount of collected data gathered is massive, and the analysis has therefore been narrowed down. This, naturally, has affected the results presented later in this thesis.

1.5 Thesis outline

This thesis has been structured as follows. The background necessary to understand the data collection is described in Chapter 2. Chapter 3 presents relevant related work that has been crucial to the development of the framework as well as any conclusions drawn. An overview of the framework in combination with the methodology is described in Chapter 4, whereas Chapter 5 presents the results obtained during data collections. Finally, Chapter 6 discusses the obtained results in further depth.

(11)

2 Background

This section is based around the technical details relevant to the thesis. Frameworks, user modeling, and further necessary information is explained in greater detail, in order to ensure a better understanding for the following concluding sections.

2.1 Selenium

Selenium is a tool created with the purpose of automating and emulating a user’s interac-tion with various browsers. Some interacinterac-tions that can easily be performed with Selenium can for example be clicking on buttons and entering text into search fields. For automated testing with desktop websites the Selenium WebDriver is of great relevance [21]. The Web-Driver works through direct or remote communication with a chosen browser, which enables commands to be sent to the browser and in return information received. Essentially, Web-Driver is responsible for enabling communication back and forth with the selected browser. For Chrome the WebDriver alternative is ChromeDriver [10], while Firefox uses GeckoDriver [23].

As for testing, a test framework is of relevant use. The framework will be in charge of executing the code through WebDriver and can collect the data needed for further analysis [22]. The advantages of this type of automated web testing is that a large scale of data can be collected without any interference from the developer.

Python

Although Selenium is open to code of various languages, this thesis has been centered around the usage of Python. Python is an object-oriented programming language, with a large stan-dard library [9]. As Python is easy to use and known to be practiced by a large community of developers, Python is a good candidate for programming with the Selenium interface. With these factors in mind, a reason for choosing Python when programming with the Selenium framework is the fact that it is commonly used for the purpose of automated testing.

(12)

2.2. Browser extensions

Crawler technology

The main purpose of most web crawlers is to, in various ways, collect data from web pages. Usually, more advanced crawlers are developed with the help of automated testing frame-works, such as Selenium. The reason for this is the ability to handle the newer, more dynamic environments online, such as the occurrence of cookies or caches [26].

Crawlers have proved efficient when examining online privacy issues such as browser fingerprinting and cookie syncing [1]. As the focal point of this thesis is to examine and determine the extent of a user’s privacy in regards to browsers, the technology of crawlers for web scraping is well suited for collecting the data needed.

2.2 Browser extensions

A browser extension is a small software program allowing a user to customize the experience of the browser [11]. These are constructed with web technologies and let the user decide which extensions the browser should use to fit the individual needs. Extensions can be added to the two web browsers used in this project, Google Chrome and Mozilla Firefox. These are used to add functions and different features to the web browsers.

In consideration of privacy and security, risks can occur when using an extension due to the capability of it being able to modify and observe all browsing activity of a user [5]. This capability also extends to the extension being able to insert code which can retrieve information such as cookies and form inputs from web pages [8]. However, a majority of extensions don’t require any particular privilege to be added to the browser, albeit having access to sensitive user information. This becomes a privacy risk considering that patterns of breaching user integrity has been discovered in multiple extensions which have been store-approved [14]. The various browser extensions described beneath show differences in the degrees of privacy of the information that they may request access to about a user, as well as their capability to alter objects in the browser.

Ghostery

Ghostery [13] is an extension developed for browsers such as Chrome and Firefox. The ex-tension is focused on user privacy, through blockage of ads and prevention of third-party trackers. Ghostery offers usage of VPN, which enables the user to encrypt the internet con-nection. Furthermore, Ghostery analyzes the active third-party trackers, as well as general website performance, which gives the user a chance to get an overview of who is tracking and where [18]. This makes it possible for users to control what information to give during browsing. Overall, the main purpose of the extension is for users to be able to take back online privacy.

CatBlock

CatBlock [7] is a browser extension available on both Chrome and Firefox. The extension works like an ad blocker, which alters or blocks online advertising on a web page, and instead shows pictures of cats. This is done by overlaying the ads with images of cats. Peckett and Taro developed this extension from code that already existed from AdBlock to help identify the ads on a web page.

Regulations

From January 2021 extensions from Google Chrome have to show “privacy properties” in order to make the user more aware of which data it collects and uses [4]. Any developer can create a browser extension, making security updates done continuously by Google necessary.

(13)

2.3. Virtual machine

Before entering the market, the updates help identify more harmful extensions and remove them.

In Mozilla Firefox, as well as in Google Chrome, extensions need to satisfy policies in order to be submitted [16]. One of the policies establishes that no sudden features should be added to an extension, for example a feature that would deal with a user’s privacy or security. The extensions must tell how they collect, use and store data from the user in the “privacy policy field” on their page. Expectations from Mozilla is that an extension should limit the data collection as much as possible while it is not allowed to collect any user data without explicitly saying so. For CatBlock and Ghostery, both implicitly state that the extension has permission to read the user’s browsing history.

2.3 Virtual machine

A virtual machine [19] is a computer resource that uses software to virtualize a physical com-puter. It gives the flexibility to use and run different operating systems simultaneously, where each virtual machine runs its own operating system separate from others. This, without needing to maintain the physical hardware that runs on it. In this thesis virtual machines have been used to avoid collecting data from one’s own computer and prevents the retrieval of information about the participants. A virtual machine allows for control of the comput-ing environment and offers a clean environment. Microsoft Azure [19] is a cloud computcomput-ing platform and the tool being used during this project.

2.4 User modeling

There are multiple studies focused on the topic of user behavior. In order to imitate a users average behavior one has to take various factors into consideration, such as clickthrough frequency, think time and dwell time deviation [2]. This is where user modeling becomes relevant. Through creating a general user behavior model, one can imitate the features of an average user’s behavior, and thereon integrate these behaviors into an automated framework or a bot.

In this thesis, the user behavior features that have been integrated into the Selenium framework is weighted randomized clickthrough from a search query, think time, click fre-quency as well as follow up queries. These features are mostly centered around the aspects of clickthrough and browsing, but are of great relevance when imitating user behavior during the build-up stage of creating a profile.

Weighted clickthrough

The average user is both likely and biased towards clicking on the top results of a search query [2]. For this reason a combination of weighted and randomized clickthrough is a central part to the development of post-search clickthrough algorithms. Research by Breslau at al. [6] has shown that Zipf’s law is a good representation of post-search user interaction, as the relative frequency of which a user requests certain web pages after a search query follows a Zipf-like distribution. In general terms, Zipf’s law claims that the probability of a request of the n:th most popular page of a specific search query is in proportion to 1/n. Further research presented by Freedman [12] also supports that clickthrough follows a Zipf-like distribution, as the head URLs prove to be particularly popular in regards to amount of requests.

Think time

When observing a user online, think time is the time spent between interactions. This, for example, occurs when the user is doing some form of navigation on the screen, entering data, choosing which link to click or simply just reading a web page. Ramakrishnan et al. [20] have

(14)

2.4. User modeling

performed a study in think time and have used a program to extract think times from web server logs. Analysis of this extracted data could be used in order to create realistic think times. Their conclusion was that the median think time is 28.653 sec.

User behavior post-search

The algorithm created to imitate a user’s typical behavior post-search has been divided into four extreme classes. Research by White and Drucker [25] suggests that a majority of users interact in a consistent manner across different websites regardless of query. This implies, and has motivated the assumption taken, that a user is prone to have a consistent pattern of interactive behavior post-search .

The four extreme classes of post-search user behavior are divided into the categories ’Forward-to-browse’, ’Backward-to-browse’, ’Forward-to-search’ and ’Backward-to-search’, classes which are concluded through the research by White and Drucker. The Forward-to-browse navigator behavior class is defined by a likeliness to continue towards a new do-main, often through clicking on a hyperlink on the current page. As for the Backward-to-browse behavioral class, this category is defined by likeliness to revisit the previous domain. Forward-to-search behavior is defined through a likeliness to simply make an entirely new search query. Finally, the Backward-to-search class entails a likeliness to return to the pre-vious search query result page and click on a new search result. Though these are extreme classes of interaction variance, most users tend to lean toward one class or the other during a session, or will mix between the classes with different likeliness.

(15)

3 Related work

Given the increasing amount of tracked users and their behavior in recent years, the need to protect a user’s privacy has grown to become an important topic. Many users are unaware of the leakage of their information over the web, data which may be collected by other parties. Malandrino et al. [17] present a study that observes the leakage of personal and sensitive information as well as the information that is collected by third parties. To analyze this, they set up accounts to create profiles and added detailed information to each account, which for example included full name, date of birth and general interests. Furthermore, Malandrino et al. implemented different extensions and saw that it helped the user by limiting the spread of their private information. This work is closely related to this thesis project due the profiles that are being created and evaluated in the project. However, the profiles developed in this thesis do not contain the detailed personal information that the profiles in the paper had, such as phone number, home address and email address. Such information was used to register and log in with the corresponding first party site, which has not been done in this thesis.

Borgolte and Feamster [5] discuss how experts and privacy advocates consider that the collection of data of a user often is excessive and infringe on a user’s privacy. Furthermore, browser extensions can use their capabilities to modify and inspect all of a user’s browsing activity online. The paper is based around privacy-focused extensions and the performance of such when using Google Chrome and Mozilla Firefox. One of the privacy-focused exten-sions used in the paper has also been chosen for analysis in this thesis. However, this paper does not discuss what access the browser extension gets to user information or the level of personalization of advertisement during browsing sessions.

Further research presented by White and Drucker [25] examines typical user interaction post-search. This research brings forth data suggesting that user behavior after a search query can be divided into four subcategories. These four categories as well as typical patterns of interactive behavior have been of great importance when integrating post-search behavior into the crawler. However, the personas in this thesis have not been divided into one of the two extreme user behavior classes that White and Drucker present.

Finally, the analysis of online advertising is discussed in a research paper by Barford et al. [3]. This study is centered around displaying and categorising the ad-landscape, also known as adscape, online. When doing so, crawler technology has been used for the purpose of building profiles as well as for harvesting ads, which is similar to the work performed through this thesis. Barford et al. also use the method of heat maps to display data relating

(16)

to profiles and catagories, which has been of great relevance to the results presented in this thesis. The research presented by Barford et al. differs from this thesis project in the sense that no extensions have been implemented. Barford et al. collect data over a limited time period of two days, and do not take the time-of-day aspect into consideration, which has been highly relevant to the data gathered in this project.

(17)

4 Method

This section is centered around describing the methodology of the thesis project, as well as in detail explaining some of the components, features and choices that were important for gathering necessary data.

4.1 Framework overview

The framework that has been developed through this thesis was a tool used to automate the process of building user profiles and was built in a joint group project. This can be seen in Figure 4.1. The background of each profile has been based on multiple persona interests and characteristics. As these interests were compiled into a list of search phrases, they were used by the crawler to perform google search queries. By automating this process, one could imitate the average behavior of a user online.

Framework is

started Framework isstopped Has the

framework run for 3 hours?

Yes

Makes a Google search query

No

Filters Google search result page Clicks on link according to Zipf´s law Lands on website Returns to previos page Clicks link on website Returns to Google search result page Html text Screenshots Urls Stats Logs data Makes new google search P2 P3 P4 P1 Output List of search querys Filter list Input

(18)

4.2. Persona design

After making a search query, the framework filtered the Google search result page through a list of filter words to remove unnecessary links from the search results. Thereafter the framework clicked on a link according to Zipf’s law, which took it to a new web page. There were four instances that could occur at this stage with different probabilities, P1, P2, P3 and P4. The crawler was intended to run a session for three hours, so if enough time had passed, the data scraping was stopped before making a new search.

Furthermore, the framework gathered relevant data and compiled this data into log files during every user session. The actual data logged by the crawler consisted of session per-formance statistics, screenshots of the visited pages, URLs and HTML text. Through this process, one could analyze and evaluate how content, such as ads, shown to the user differs over time along with how the user profile continued to develop.

4.2 Persona design

Six different personas have been created for the purpose of the research in this thesis. Every persona has been designed with various characteristics in mind, as well as varying ages, gender and interests shown in Table 4.1. Since every persona has their own profile, their interests have been developed in order to be of relevance to these specific browsers. This entails that the interests and characteristics chosen are something that, for example, Google would find relevant when tracking a user or cataloging a user’s interest. For the Chrome browser there is a specific page called Google’s Ad personalization that is updated based on a user’s activity online. The categories included in that page have been highly relevant to the process of persona design.

The persona information was considered when creating the search queries, where each persona had two hundred search queries stored in a text file. These phrases were created during the joint group project consisting of six persons, each responsible for one persona. As each person has their own trademark of writing search queries this motivates the differences in styles. Furthermore, all personas used English as their primary language. The reason for this was simply to ensure that all the resulting data collection, such as ads, were in English. Three personas were chosen to be the base personas, Mary Johnson, Patricia Jones and John Anderson. This entails that these personas were used more frequently than others, and were implemented on the majority of the virtual machines where extensions were added to the browser.

With the purpose of simplifying the data collection in mind, the personas used in this thesis have been created to be fairly “stereotypical” in the sense that their interests tend to be commonly linked to their age and gender. These are based on our own biases. The decision to develop personas in such a way is simply based around wanting to make the resulting data easier to interpret. However, other personal traits such as religion, sexuality or race have not been included in the design process. As for names of the personas, common names that are applicable for a wide range of people have been chosen.

Name Gender Age Interest Occupation Civil status Parent

A Mary Johnson Female 18 Horses, celebrities, gardening High School Single No B Jennifer Brown Female 33 Hair, fashion, DIY Hairdresser Single No C Patricia Jones Female 51 Movies, stock trading, interior design Bank worker Married Yes D James Davis Male 21 American football, baseball, medicin University In a relationship No E John Anderson Male 37 Cooking, traveling, technology History teacher Divorced Yes F Robert Smith Male 68 Birds, baking, crossword puzzles Retired Married Yes

(19)

4.3. Framework design

4.3 Framework design

The automated crawler session started by opening a browser window to Google Search, in ei-ther Chrome or Firefox. Here, the framework entered a search phrase. An important function in the framework was therefore to be able to read and write to text files. The search phrases related to each persona were shuffled every time the framework started running, with every row being a unique search query.

Another important task was for the framework to be able to handle cookies and pop-ups. For the sake of simplicity, the framework accepted all pop-ups before continuing with its’ further tasks. This step was crucial to ensure that a clean screenshot could be taken of every web page visited, with no pop-ups blocking the content.

Browser differences

One of the first tasks the framework handled was to open a browser window. Before the start of every session a Chrome debugging window had to be opened manually through the terminal. The reason for this was to avoid sessions working in an automated testing envi-ronment. This step was necessary to keep cookies and browser history consistent between sessions. For Chrome sessions, ChromeDriver was utilized. As for Firefox, the browser was opened directly through the Selenium framework, and cookies were loaded from- and saved to a pickle file between sessions to maintain consistency. For Firefox sessions, GeckoDriver was utilized.

User behavior

After having made a search, seen in Figure 4.2a, the next function was centered around weighted as well as random clickthrough. In order to simulate a user in a way which would appear credible to the browser, the weighted clickthrough algorithm was based around the distribution of Zipf’s law, previously mentioned in section 2.4. The algorithm filtered out links of all the web pages resulting from the search query, and thereafter placed weights on these in the order of which they are presented on the page. The filter function also removed all of the irrelevant links that regularly appears on the Google search result page, shown in Figure 4.2b, such as settings, tools and feedback.

(a) Pre search query. (b) Search results page. Figure 4.2: Crawler navigation from search query to post-search behavior.

Thereafter comes the post-search user behavior. Here, think time is an important factor. Every user spends a certain amount of time reading, scrolling or dwelling on a page. This period of dwelling differed depending on the probabilities given, with think times chosen being 14 sec, 28 sec and 56 sec with the median value of 28 sec, mentioned in Section 2.4. The probabilities chosen for these think time values, to ensure a median of 28 sec, were 0.45, 0.35 and 0.2, found in Figure 4.3b.

Another factor which came into play during this stage was the post-search behavior, fur-ther mentioned in Section 2.4. The post-search user behavior was divided into four outcomes.

(20)

4.4. Set up

These consisted of (P1) Backward-to-search, (P2) Forward-to-search, (P3) Forward-to-browse, or (P4) Backward-to-browse. The probabilities of these outcomes were set to 0.08, 0.21, 0.5 and 0.21, motivated through research by White and Drucker [25], shown in Figure 4.3a.

P1

8,0%

P2

21,0%

P4

21,0%

P3

50,0%

(a) Post-search user behavior.

14 s

20,0%

56 s

35,0%

28 s

45,0%

(b) Think time. Figure 4.3: Probabilities of post-search user behavior and think times.

If the crawler was to run into any faulty pages, such as pages that would not load, HTML text that could not be saved, buttons that could not be clicked or unknown errors, the recovery function was to simply return to the first level Google search page and continue on with the next search query. Essentially, when an exception occurred the framework continued on with the next search phrase. Search phrases were repeated if necessary to ensure that 3 hours passed before the end of each session.

Browser extensions

Furthermore, an important part of the analysis was the incorporation of privacy-focused ex-tensions into the data collection, to determine if such presence somehow altered the results. Here, one of the two extensions, Ghostery or CatBlock, were added onto the browser. Sessions with active extensions in usage were run on both Chrome and Firefox, as well as between the different profiles. On some profiles the extensions were added from the first round of data collection, whereas on other profiles the extension was added after the first round of data collection.

4.4 Set up

With the help of Microsoft Azure multiple virtual machines were created to ensure a sand-boxed environment. In total, 29 virtual machines were set up, in three main locations, east US, west US and south UK for further regional analysis. The operating system used during the data collection was Microsoft Windows. The virtual machines had 2 vCPUs, 4 GiB memory and a standard SSD. For every virtual machine, one persona was implemented. Every virtual machine used either Firefox or Chrome as their primary browser. Some virtual machines had no extension, whereas others had active extensions from the get-go. The virtual machine set up can be seen in Table 4.2.

The total data collection took place in two rounds in a total of 21 days, referred to as the primary and secondary data collection. For the purpose of consistency, all sessions were run at the same time interval every day, 16.00 CET. The primary data collection lasted fourteen days. Further, for the second collection of data, the framework was run for an additional seven days in order to implement extensions on some of the virtual machines. The machines which had extensions implemented in the primary collection were run with these uninstalled in the secondary collection. The opposite was invoked for machines with no extension active during the primary collection. The collected data set was then used to compare between each session in further analysis.

(21)

4.5. Data collection

Collection Analysis

Browser Persona Region Primary Secondary Primary Secondary

Firefox A east US - - X

Firefox B UK - - X

Firefox C east US - Ghostery X X

Firefox D UK - - X

Firefox E east US - - X

Firefox F UK - CatBlock X X

Chrome A east US - - X

Chrome B UK - - X

Chrome C east US - Ghostery X X

Chrome D UK - - X

Chrome E east US - - X

Chrome F UK - - X

Chrome A east US Ghostery - X X

Chrome C east US Ghostery - X X

Chrome E east US Ghostery - X X

Firefox A east US Ghostery - X X

Firefox C east US Ghostery - X X

Firefox E east US Ghostery - X X

Chrome A east US CatBlock - X X

Chrome C east US CatBlock - X X

Chrome E east US CatBlock - X X

Firefox A east US CatBlock - X X

Firefox C east US CatBlock - X X

Firefox E east US CatBlock - X X

Chrome F UK CatBlock - X X

Chrome F UK - CatBlock X

Chrome C east US - Ghostery X

Chrome C UK - Ghostery X

Chrome C west US - Ghostery X

Table 4.2: The virtual machines with their set ups, where "-" entails no implemented extensions, while blank entails no analysis.

4.5 Data collection

The collection of data was achieved directly through the automated framework and saved into log files. During each session, log files were created that contained information about the session statistics, URLs visited, the HTML text as well as screenshots from the web pages visited. A scroll function was implemented to ensure that multiple screenshots were taken from the visited page, with the intention of capturing ads. For Firefox sessions cookies were loaded from and saved to a file before and after every session. Finally, statistics of each session was saved into a separate log file, containing information such as exceptions, number of clicked links and number of screenshots.

After the collection of the HTML texts on the web pages, the text files were put through a word cloud function, developed during the joint group project, to extract words that ap-peared most frequently. The word cloud function turned the most frequently apap-peared words into an image, creating an overview of words that the browser, third-parties and extensions would be able to pick up. Further, the different screenshots of the visited URLs during a running session were saved for identifying potential third-parties at the different websites.

(22)

5 Results

This section is focused on describing and presenting the results of the data collection in this thesis project. The data presented has been collected through active crawler sessions over the span of three weeks, with 29 VMs in usage.

5.1 Browser performance

As visible in Table 5.1 the Chrome framework has a higher average success rate during ses-sions than the Firefox framework, meaning that the framework did not crash while running for 3 hours. Chrome’s average success rate is 71% whereas Firefox’s average success rate is 50%, for the primary collection. The success rate during the secondary collection was higher than during the primary collection for both Chrome and Firefox machines.

In regards to exceptions Chrome had an average number of 119 exceptions per session. Firefox had an average number of 126 exceptions, whereas the number of exceptions in the secondary collection was 127, nearly similar to the primary collection.

Primary collection Secondary collection Browser Success rate Exceptions Success rate Exceptions

Chrome 71% 119 77% 119

Firefox 50% 126 63% 127

Table 5.1: The average browser performance during the collection.

Table 5.2 displays information regarding crashes during sessions, as well as statistics such as success rate and number of crashes. The figure shows that both Chrome and Firefox frame-works experienced significant crashes during day 1. Furthermore, the Firefox framework con-tinuously suffered from crashes on various virtual machine for the duration of the primary and secondary collection. The Firefox framework implemented with the CatBlock extension crashed more than the machines implemented with Ghostery.

(23)

5.2. Ads

Primary collection Secondary collection

Day

Browser Persona Extension 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Extension 15 16 17 18 19 20 21

Chrome A - N Y Y Y Y Y Y Y Y Y Y Y Y Y Chrome B - N Y Y Y N Y Y Y Y Y Y Y Y Y Chrome C - Y Y Y Y Y N Y N N Y Y Y Y N Ghostery Y Y N Y N Y Y Chrome D - Y Y Y Y N Y N Y Y Y Y Y N N Chrome E - N Y Y N Y Y Y Y Y Y Y Y N Y Chrome F - Y N N Y Y Y Y N Y Y Y Y N Y Chrome C - N N Y N Y Y Y Y Y Y Y Y Y Y Ghostery Y Y N Y Y Y N Chrome F - N Y Y N Y Y Y Y Y N N Y N Y CatBlock Y N Y N Y N Y Chrome C - N Y Y Y Y Y Y Y Y Y Y Y Y Y Ghostery Y Y Y Y N Y Y Chrome C - N Y N Y Y Y Y Y Y N Y N Y Y Ghostery Y Y N Y Y Y N Chrome A Ghostery Y Y Y Y Y N Y Y Y N Y Y Y Y - Y Y Y Y Y Y Y Chrome C Ghostery N Y Y Y Y Y Y Y Y Y Y Y N Y - N Y Y Y Y Y Y Chrome E Ghostery N Y Y Y N Y Y Y Y Y Y Y Y Y - Y Y Y Y Y Y N Chrome A CatBlock N Y Y Y Y Y Y Y Y Y Y Y Y N - Y N N Y N Y Y Chrome C CatBlock N Y Y Y Y Y Y Y Y Y Y Y Y Y - Y Y Y Y N N Y Chrome E CatBlock N Y Y Y Y Y Y Y Y N N Y Y Y - Y Y Y Y N Y Y Chrome F CatBlock Y Y Y Y N Y Y Y Y Y Y Y Y Y - Y Y Y N N Y Y Firefox A - N N N N N Y N N N Y Y N Y Y Firefox B - N N N Y Y N N N Y Y Y N Y Y Firefox C - N N N Y Y Y Y Y Y Y Y Y N Y Ghostery Y Y N Y Y Y Y Firefox D - N N N N N Y N Y N N N N N N Firefox E - N N N Y Y Y Y N Y Y Y Y N Y Firefox F - N N N Y N Y N Y N Y Y N N N CatBlock Y Y Y Y N Y Y Firefox A Ghostery N N N N N Y N Y N Y Y Y Y Y - Y Y Y Y N Y Y Firefox C Ghostery N N N N Y Y Y Y Y N Y Y Y Y - N Y Y Y Y Y N Firefox E Ghostery N N N N N Y Y N Y Y Y N N Y - Y Y N Y N Y N Firefox A CatBlock N N N Y N Y N N N N N Y Y Y - N N N N N Y N Firefox C CatBlock N N N Y N N Y Y N N N Y Y N - Y N Y Y N Y N Firefox E CatBlock N N N N N N Y N N N N Y N N - N N Y N N N N Chrome: Crashes 12 2 2 3 4 2 1 2 1 4 2 1 5 3 1 2 4 2 6 2 3 No crashes 5 15 15 14 13 15 16 15 16 13 15 16 12 14 11 10 8 10 6 10 9 Success-rate 24% 71% 71% 67% 62% 71% 76% 71% 76% 62% 71% 76% 57% 67% 85% 77% 62% 77% 46% 77% 69% Firefox: Crashes 12 12 12 6 8 3 6 6 7 5 4 5 6 4 3 3 3 2 6 1 5 No crashes 0 0 0 6 4 9 6 6 5 7 8 7 6 8 5 5 5 6 2 7 3 Success-rate 0% 0% 0% 50% 33% 75% 50% 50% 42% 58% 67% 58% 50% 67% 63% 63% 63% 75% 25% 88% 38% Total: Crashes 24 14 14 9 12 5 7 8 8 9 6 6 11 6 4 5 7 4 12 3 8 No crashes 5 15 15 20 17 24 22 21 21 20 23 23 18 22 16 15 13 16 8 17 12 Success-rate 15% 45% 45% 61% 52% 73% 67% 64% 64% 61% 70% 70% 55% 67% 76% 71% 62% 76% 38% 81% 57%

Table 5.2: The crash rate experienced during the collection, where "Y" entails a successfully run session and "N" entails a crash.

Table 5.3 shows further statistics of visited URLs during the days analyzed. Furthermore, the table displays the number of total ads found on every machine during these days. As visible in the table, there are some patterns of correlations between visited URLs and number of ads. Number of ads vary, and machines with a high number of visited URLs does not necessarily have a high number of found ads, which is the case for the extension machines. Overall, more visited URLs during a session has lead to more displayed ads.

5.2 Ads

Here, data will be presented showing differences in ads between sessions. Early sessions have been compared with later sessions, and the comparison is focused around total number of ads and total number of personalized ads. These two components will be measured against each other for several sessions, and between different virtual environments. These stats will be presented in several heat maps.

Every session has captured a unique amount of screenshots. How many ads have been captured differs depending on number of exceptions and ability to scroll on the visited do-main. Not all screenshots contain ads, and are therefore of little interest to the results. Some web pages are blocked by pop-ups or cookie windows, which entails that screenshots of these are difficult to analyze. Furthermore, the analysis has been narrowed down to the ads placed

(24)

5.2. Ads

Primary collection Secondary collection

Browser Persona Extension Day 1 Ads Day 7 Ads Day 14 Ads Extension Day 15 Ads Day 21 Ads

Chrome A - 38 25 240 45 223 49 Chrome B - 16 2 199 32 203 9 Chrome C - 237 32 216 48 17 0 Ghostery 225 4 235 3 Chrome D - 235 14 217 20 109 31 Chrome E - 60 7 229 60 223 36 Chrome F - 235 49 217 87 238 78 Chrome C - 63 15 229 45 216 39 Ghostery 200 4 220 3 Chrome F - 33 2 249 62 222 18 CatBlock 214 1 189 1 Chrome C - 17 1 228 17 216 30 Ghostery 210 3 235 0 Chrome C - 23 5 211 50 241 69 Ghostery 228 4 219 3 Chrome A Ghostery 246 24 214 2 222 8 - 244 66 225 46 Chrome C Ghostery 69 7 233 3 235 2 - 97 18 237 54 Chrome E Ghostery 15 0 233 5 230 3 - 205 55 110 13 Chrome A CatBlock 41 2 216 1 208 10 - 242 53 208 58 Chrome C CatBlock 227 6 231 1 226 2 - 141 45 215 26 Chrome E CatBlock 16 1 220 2 178 0 - 211 34 226 40 Chrome F CatBlock 236 6 229 1 221 2 - 248 69 219 48 Firefox A - 62 21 167 46 187 63 Firefox B - 16 4 199 31 198 19 Firefox C - 53 26 212 75 207 73 Ghostery 207 4 204 4 Firefox D - 36 7 161 12 186 31 Firefox E - 13 3 212 33 208 62 Firefox F - 59 30 126 35 204 60 CatBlock 181 0 189 2 Firefox A Ghostery 13 1 88 2 180 5 - 179 72 176 60 Firefox C Ghostery 42 2 217 6 216 3 - 110 28 146 38 Firefox E Ghostery 13 1 188 5 180 3 - 175 46 62 22 Firefox A CatBlock 23 1 115 0 187 3 - 185 47 154 25 Firefox C CatBlock 9 0 190 1 47 0 - 161 39 150 35 Firefox E CatBlock 1 0 178 1 5 0 - 3 0 28 11

Table 5.3: The number of visited URLs and identified ads day 1, 7, 14, 15 and 21.

on the upper part of web pages. Ads situated further down on web pages have therefore not been included in the presented results.

Without having an extension added to a browser, ads appeared very often. As seen in Table 5.4 the machines with no extension encountered hundreds of more ads per session, than the Ghostery and CatBlock machines.

Chrome Firefox

No extension Ghostery CatBlock No extension Ghostery CatBlock

Total Targeted Total Targeted Total Targeted Total Targeted Total Targeted Total Targeted

Ads Day 1 146 31% 31 45% 15 8% 91 34% 4 100% 1 0% Day 7 337 49% 10 70% 5 25% 232 47% 13 74% 2 33% Day 14 239 20% 13 96% 14 13% 308 38% 11 82% 3 11% Total ads/VM 120 18 9 105 9 2 Ads Day 15 340 50% 15 41% 1 100% 232 39% 4 75% 0 0% Day 21 285 46% 9 58% 1 100% 191 48% 4 50% 2 50% Total ads/VM 89 6 2 71 8 2

Table 5.4: An overview of encountered ads.

Furthermore, there was no significant difference in the number of ads or targeted ads encountered on Chrome versus Firefox. However, there was a difference in personalization depending on day. Day 7 experienced a higher level of personalization than both day 1 and day 14 on the majority of the machines. The only deviation to this pattern were the machines

(25)

5.2. Ads

that had Ghostery implemented. For day 15 the level of personalization was similar to that of day 1. By the end of the secondary collection the percentage of targeted ads was the same for both Firefox and Chrome, though they experienced different numbers of total ads.

Primary collection

Here we will present data from the primary data collection. This collection lasted fourteen days.

Firefox

The results of ads found in Firefox sessions have been compressed into heat maps. There are three shades used to display captured ads. The strongest red represents a percentage of ads being 25% and higher, the medium shade represents a percentage varying between 25-10%, and the weakest shade represents a percentage lesser than 10%. The interest categories chosen are taken from the persona design, displayed in Table 4.1, whereas the private life category is intended to capture ads related to aspects such as marital status and job.

Day 1, Firefox

Persona A B C D E F

Private

life

Other Personal- ization Horses Celebrities Gardening Hair Fashion DIY Movies Stock- trading Interior design American Football Baseball Medicin Cooking Traveling Technology Birds Baking Cross- words

A 0.143 0.095 0.095 0.667 0.000 B 0.500 0.500 0.500 C 0.385 0.308 0.115 0.192 0.692 D 0.286 0.286 0.143 0.286 0.429 E 0.333 0.667 0.333 F 0.067 0.033 0.200 0.033 0.067 0.067 0.533 0.067 Day 7, Firefox Persona A B C D E F Private life

A 0.130 0.239 0.130 0.043 0.022 0.043 0.022 0.065 0.304 0.435 B 0.161 0.516 0.097 0.032 0.194 0.677 C 0.027 0.107 0.307 0.107 0.013 0.013 0.040 0.053 0.333 0.573 D 0.083 0.083 0.250 0.083 0.167 0.333 0.583 E 0.061 0.030 0.061 0.061 0.061 0.152 0.182 0.030 0.364 0.364 F 0.029 0.171 0.143 0.171 0.029 0.029 0.143 0.286 0.171 Day 14, Firefox Persona A B C D E F Private life

A 0.159 0.063 0.095 0.190 0.048 0.048 0.048 0.016 0.032 0.048 0.254 0.365 B 0.211 0.263 0.053 0.053 0.421 0.579 C 0.014 0.055 0.082 0.219 0.192 0.027 0.014 0.041 0.096 0.260 0.534 D 0.161 0.032 0.065 0.065 0.065 0.032 0.032 0.129 0.065 0.355 0.226 E 0.081 0.016 0.048 0.016 0.016 0.226 0.113 0.097 0.387 0.435 F 0.017 0.167 0.033 0.017 0.067 0.017 0.050 0.017 0.100 0.033 0.483 0.150

Table 5.5: Firefox heat maps of day 1, 7 and 14.

For Firefox machines the ads related to the persona’s specific interest categories were as most personalized on day 7, displayed in Table 5.5. In this heat map there is a clear diagonal correlation of ads being related to interest categories of the persona, resulting in a stronger red color in these cells. By day 14 though, ads of more varied interest categories were discovered, these were not as targeted as the findings of day 7. Day 1 did not show signs of any particular personalization, as the found ads were fairly spread across the multiple interest categories.

(26)

5.2. Ads

Chrome

The results obtained when looking closer at the ads from the Chrome machines have also been compressed into heat maps, shown in Table 5.6. These demonstrate, like previous heat maps, how the ads were categorized into different interest categories. When the categories were not applicable the ad was inserted into "other".

Day 1, Chrome

Private

life

A 0.160 0.080 0.040 0.120 0.040 0.040 0.520 0.280 B 0.500 0.500 0.500 C 0.031 0.031 0.125 0.094 0.094 0.031 0.594 0.219 D 0.071 0.071 0.071 0.071 0.714 0.071 E 0.429 0.286 0.286 0.714 F 0.082 0.041 0.020 0.061 0.020 0.082 0.041 0.061 0.592 0.102 Day 7, Chrome Persona A B C D E F Private life

A 0.222 0.067 0.178 0.022 0.022 0.022 0.022 0.444 0.489 B 0.406 0.375 0.031 0.188 0.813 C 0.146 0.042 0.438 0.042 0.021 0.042 0.083 0.188 0.708 D 0.050 0.050 0.050 0.050 0.050 0.050 0.200 0.150 0.350 0.250 E 0.200 0.033 0.067 0.050 0.183 0.250 0.050 0.017 0.150 0.500 F 0.023 0.126 0.011 0.023 0.069 0.057 0.023 0.046 0.034 0.080 0.506 0.184 Day 14, Chrome Persona A B C D E F Private life

A 0.265 0.041 0.082 0.041 0.020 0.041 0.020 0.041 0.449 0.429 B 0.556 0.111 0.111 0.222 0.556 C 0.000 D 0.056 0.139 0.139 0.250 0.139 0.028 0.250 0.000 E 0.065 0.097 0.032 0.032 0.032 0.097 0.065 0.581 0.194 F 0.028 0.194 0.194 0.139 0.028 0.417 0.028

Table 5.6: Chrome heat maps of day 1, 7 and 14.

For Chrome machines, the results from day 1, day 7 and day 14 show that the ads were most personalized by day 7. This is visible through the diagonal of the stronger colors, which entails that the ads match the personas interests. For day 1 and day 14, the ads were more spread out over the different interest categories.

Ghostery

With Ghostery added onto the browser the extension blocked the majority of the ads from being visible to the user. This resulted in a significantly lesser amount of ads to analyze.

By day 1, shown in Table 5.7, the found ads, though few, showed strong correlations to persona interests. For Firefox machines with Ghostery implemented the percentage of personalization was as high as 100%. For Chrome, this personalization was lesser; for Persona A around 63% while for Persona E 0%. By day 7, the ads discovered were more personalized than day 1, and there was no significant difference depending on the browser the extension was added onto. Finally, day 14 presented the highest number of personalization. The heat map displays patterns of red cells, which entails that the ads are in strong correlation to persona interests.

(27)

5.2. Ads

Day 1, Ghostery (3 Chrome + 3 Firefox)

Persona A C E

Private

life

Other Personal- ization Horses Celebrities Gardening Movies Stock- trading Interior design Cooking Traveling Technology

A 0.417 0.042 0.083 0.083 0.375 0.625 C 0.714 0.286 0.714 E 0.000 A 1.000 1.000 C 1.000 1.000 E 1.000 1.000

Persona A C E

Private

life

A 0.500 0.500 0.500 C 0.333 0.667 1.000 E 0.200 0.400 0.400 0.600 A 0.500 0.500 1.000 C 0.167 0.500 0.167 0.167 0.833 E 0.400 0.600 0.400

Persona A C E

Private

life

A 0.250 0.125 0.250 0.250 0.125 0.875 C 1.000 1.000 E 1.000 1.000 A 0.600 0.200 0.200 0.800 C 0.667 0.333 0.667 E 1.000 1.000

Table 5.7: Ghostery heat maps of day 1, 7 and 14, for 3 Chrome- and 3 Firefox machines.

CatBlock

When CatBlock was added to the browser, images of cats appeared on the web pages instead of regular ads. CatBlock has sufficiently removed ads from the data collection, and has in-stead provided various images of cats. The number of cats displayed during sessions has not been included in analysis of ads.

For the machines where the CatBlock extension was added onto the two browsers, the resulting percentage of personalization was low. As can be seen in Table 5.8, few ads were found and these ads had little to no correlation to the persona interests. By day 7 the number of target ads had increased, though there was still a low percentage of targeting towards persona interests. Finally, for day 14 the personalization had decreased from day 7. For machines with CatBlock implemented onto the browser, there was no significant difference between browsers in regards to number of targeted ads.

Persona differences

Differences in percentages of targeted ads for the base personas are displayed in figure 5.1. Here, it is visible for all three personas that the level of personalization is at it’s highest during day 7, for a majority of the browsers and browser-added extensions. When comparing the personas with each other, it is clear that persona C has experienced the overall highest level of targeted advertising.

(28)

5.2. Ads

DAY 1, CATBLOCK (4 CHROME + 3 FIREFOX)

Persona A C E F

Private

life

Other Personal- ization Horses Celebrities Gardening Movies Stock- trading Interior design Cooking Traveling

T

echnology Birds Baking Cross- words

A 1.000 0.000 C 0.167 0.833 0.167 E 1.000 0.000 F 0.167 0.167 0.667 0.167 A 1.000 0.000 C 0.000 E 0.000

Persona A C E F

Private

life

Other Personal- ization Horses Celebrities Gardening Movies Stock- trading Interior design Cooking Traveling Technology Birds Baking Cross- words

A 0.000 C 1.000 0.000 E 0.500 0.500 1.000 F 1.000 0.000 A 0.000 C 1.000 1.000 E 1.000 0.000

Persona A C E F

Private

life

Other Personal- ization Horses _Celebrities _Gardening Movies Stock- trading Interior design Cooking Traveling

T

echnology Birds Baking Cross- words

A 0.100 0.100 0.600 0.000 C 0.500 0.500 0.000 E 0.000 F 0.500 0.500 0.500 A 0.333 0.333 0.333 0.333 C 0.000 E 0.000

Table 5.8: CatBlock heat maps of day 1, 7 and 14, for 4 Chrome- and 3 Firefox machines.

Personalization 0% 25% 50% 75% 100%

Day 1 Day 7 Day 14

Firefox Chrome Firefox+Ghostery Chrome+Ghostery Firefox+CatBlock Chrome+CatBlock Personalization 0% 25% 50% 75% 100%

Day 1 Day 7 Day 14

Firefox Chrome Firefox+Ghostery Chrome+Ghostery Firefox+CatBlock Chrome+CatBlock Personalization 0% 25% 50% 75% 100%

Day 1 Day 7 Day 14

Firefox Chrome Firefox+Ghostery Chrome+Ghostery Firefox+CatBlock

Chrome+CatBlock

(a) Persona A (b) Persona C (c) Persona E

Figure 5.1: An overview of personalized ads for the base personas A, C and E for day 1, 7 and 14.

Secondary collection

Here we will present data from the secondary collection. This data collection lasted for seven days, following the first fourteen days. The machines which did not have an extension active during the primary data collection were now implemented with an extension added to the browser. The machines which did have an extension active during the primary data collection were now implemented without an extension added to the browser.

(29)

5.2. Ads

Firefox

The following heat maps show the ad landscape of Firefox machines that had an extension active from the primary data collection, which during days 15 to 21 was disabled. Day 15, displayed in Table 5.9, show personalization to a degree varying between 34-63%. The highest degree of targeted ads were visible for both of the machines with persona E implemented, but for one of the machines no screenshots were found during Day 15. For Day 21, the degree of personalization varied between 24-73%.

Day 15, Firefox

Private

life

A 0.097 0.125 0.014 0.014 0.014 0.042 0.083 0.167 0.444 0.389 C 0.036 0.036 0.357 0.071 0.036 0.036 0.036 0.393 0.429 E 0.087 0.022 0.022 0.217 0.174 0.152 0.087 0.239 0.630 A 0.064 0.043 0.085 0.191 0.021 0.021 0.021 0.021 0.043 0.085 0.064 0.149 0.191 0.340 C 0.103 0.154 0.179 0.128 0.051 0.051 0.128 0.026 0.077 0.103 0.538 E 0.000 Day 21, Firefox Persona A B C D E F Private life

A 0.033 0.017 0.167 0.167 0.083 0.050 0.017 0.050 0.033 0.383 0.250 C 0.053 0.316 0.079 0.079 0.053 0.026 0.132 0.263 0.526 E 0.182 0.273 0.227 0.045 0.273 0.727 A 0.040 0.040 0.120 0.040 0.040 0.040 0.080 0.040 0.040 0.080 0.080 0.080 0.040 0.040 0.200 0.240 C 0.143 0.057 0.286 0.086 0.029 0.086 0.057 0.029 0.229 0.457 E 0.182 0.273 0.273 0.273 0.727

Table 5.9: Firefox heat map of day 15 and 21.

Chrome

The following heat maps, Table 5.10, show the ad landscape of Chrome machines that had an extension active from the primary data collection, which during days 15 to 21 was disabled. These show strong tendencies of personalization in correlation to persona interests.

(30)

5.2. Ads

Day 15, Chrome

Private

life

A 0.061 0.015 0.470 0.015 0.030 0.061 0.015 0.152 0.182 0.697 C 0.556 0.167 0.278 0.722 E 0.018 0.018 0.145 0.127 0.200 0.109 0.382 0.582 A 0.019 0.358 0.057 0.038 0.075 0.019 0.057 0.019 0.358 0.396 C 0.022 0.044 0.044 0.111 0.222 0.067 0.022 0.022 0.444 0.400 E 0.029 0.088 0.412 0.088 0.118 0.265 0.618 F 0.029 0.072 0.029 0.014 0.058 0.058 0.174 0.014 0.014 0.043 0.493 0.072 Day 21, Chrome Persona A B C D E F Private life

A 0.130 0.174 0.174 0.065 0.022 0.043 0.043 0.348 0.348 C 0.037 0.593 0.019 0.056 0.093 0.111 0.019 0.074 0.278 E 0.692 0.077 0.231 0.769 A 0.103 0.052 0.379 0.103 0.017 0.017 0.052 0.017 0.017 0.017 0.034 0.190 0.569 C 0.038 0.500 0.038 0.154 0.115 0.038 0.038 0.077 0.346 E 0.075 0.025 0.125 0.125 0.300 0.050 0.025 0.275 0.500 F 0.229 0.146 0.063 0.083 0.146 0.146 0.083 0.104 0.375

Table 5.10: Chrome heat maps of day 15 and 21.

Ghostery

Ad landscape results in heat maps from machines which had Ghostery implemented during the secondary data collection are presented in Table 5.11. Both heat maps show correlation between Persona C’s interests and the found ads, though the category "Strocktrading" has resulted in a majority of the ads.

Day 15, Ghostery (3 Chrome + 3 Firefox) Day 21, Ghostery (3 Chrome + 3 Firefox)

Persona C

Private

life

Other Personal- ization

Persona C

Private

life

Movies Stock- trading Interior design Movies Stock- trading Interior design

C 0.250 0.750 0.250 C 0.333 0.333 0.333 0.667

C 0.250 0.750 0.250 C 0.333 0.333 0.333 1.000

C 0.667 0.333 0.667 C 0.667 0.333 0.667

C 0.250 0.250 0.500 0.500 C 0.000

C 0.500 0.250 0.250 0.750 C 0.250 0.250 0.500 0.500

Table 5.11: Ghostery heat map of day 15 and 21.

CatBlock

Results from the ad landscape found through machines which had CatBlock implemented during the secondary data collection are visible in Table 5.12. During Day 15, no targeted ads were found, whereas Day 21 showed a higher level of targeted ads.

DAY 15, CATBLOCK (1 CHROME + 1 FIREFOX) DAY 21, CATBLOCK (1 CHROME + 1 FIREFOX)

Persona F

Private

life

Persona F

Private

life

Birds Baking Cross- words Birds Baking Cross- words

F 1.000 0.000 F 1.000 1.000

F 0.000 F 0.500 0.500 0.500

(31)

5.3. Word clouds

5.3 Word clouds

The data presented below is based on the personas A, C and F. These personas used (i) Chrome and Firefox, (ii) Firefox with Ghostery added to the browser, (iii) Chrome with Cat-Block added to the browser. The active user is the previously mentioned personas, Mary Johnson, Patricia Jones and Robert Smith. The presented data is taken from the sessions of day 1, 7, 14, 15 and 21, from multiple different virtual machines. The intention is for the word clouds to give an overview over the collected HTML-texts of these profiles.

The results presents all the gathered text in figures, where words are written in differently sized text, meaning that words of greater size are more commonly appeared.

Firefox and Chrome

As can be seen from Figures 5.2 and 5.3 the most commonly appeared phrases from the results of Mary Johnson’s search queries are "Horse", "Will" and "Plant". For Chrome, the interest for Mary appear more large in day 14 compared with day 1, seen in Figure 5.3.

(a) Day 1 (b) Day 7 (c) Day 14

Figure 5.2: The results from the word cloud function with Firefox, for persona A.

(a) Day 1 (b) Day 7 (c) Day 14

Figure 5.3: The results from the word cloud function with Chrome, for persona A.

Ghostery

The results obtained when having Ghostery added to Firefox is that, for persona C, the phrases "Stock", "One" and "Best" appears most frequently, displayed in Figure 5.4 . Patri-cia Jones interest is stock trading, movies and interior design.

(a) Day 1 (b) Day 7 (c) Day 14 (d) Day 15 (e) Day 21

(32)

5.3. Word clouds

CatBlock

Looking closer at Figure 5.5, some of Robert Smith’s more common phrases were "Make", "Time" and "Bird". Day 14 displays words which occur just as often, only "Crossword Puzzle" appears larger.

(a) Day 1 (b) Day 7 (c) Day 14 (d) Day 15 (e) Day 21

Profile based evaluation of what different browsers and browser extensions may be able to learn about a user

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Information Technology

2021 | LIU-IDA/LITH-EX-G--21/064-SE

Proﬁle-based evaluation of what

diﬀerent browsers and browser

extensions may be able to learn

about a user

Proﬁlbaserad undersökning av den information som webbläsare

och webbläsartillägg kan lära sig om en användare

Soﬁa Bertmar

Johanna Gerhardsen

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Aim

1.2

Research questions

1.3

Contributions

1.4

Delimitations

1.5

Thesis outline

2

Background

2.1

Selenium

Python

Crawler technology

2.2

Browser extensions

Ghostery

CatBlock

Regulations

2.3

Virtual machine

2.4

User modeling

Weighted clickthrough

Think time

User behavior post-search

3

Related work

4

Method

4.1

Framework overview

4.2

Persona design

4.3

Framework design

Browser differences

User behavior

P1

8,0%

P2

21,0%

P4

21,0%

P3

50,0%

14 s

20,0%

56 s

35,0%

28 s

45,0%

Browser extensions

4.4

Set up

4.5

Data collection

5