Automatic fingerprinting of websites

(1)

DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS

STOCKHOLM, SWEDEN 2020

Automatic fingerprinting of

websites

Using clustering and multiple bag-of-words

models

Automatisk

fingeravtryckning av

hemsidor

Med användning av klustring och flera

ordvektormodeller

ALFRED BERG

NORTON LAMBERG

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH

(2)

(3)

Automatic fingerprinting of websites

Using clustering and multiple bag-of-words models

Automatisk fingeravtryckning av

hemsidor

Med användning av klustring och flera ordvektormodeller

Alfred Berg

Norton Lamberg

Examensarbete inom Datateknik,

Grundnivå, 15 hp

Handledare: Shahid Raza Examinator: Ibrahim Orhan TRITA-CBH-GRU-2020:060 KTH

Skolan för kemi, bioteknologi och hälsa 141 52 Huddinge, Sverige

(4)

(5)

Abstract

Fingerprinting a website is the process of identifying what technologies a website uses, such as their used web applications and JavaScript frameworks. Current fingerprinting methods use manually created fingerprints for each technology it looks for. These fingerprints consist of multiple text strings that are matched against an HTTP response from a website. Creating these fingerprints for each technology can be time-consuming, which limits what technologies fingerprints can be built for. This thesis presents a potential solution by utilizing unsupervised machine learning techniques to cluster websites by their used web application and JavaScript frameworks, without requiring manually created fingerprints. Our solution uses multiple bag-of-words models combined with the dimensionality reduction technique t-SNE and clustering algorithm OPTICS. Results show that some technologies, for example, Drupal, achieve a precision of 0.731 and recall of 0.485 without any training data. These results lead to the conclusion that the proposed solution could plausibly be used to cluster websites by their web application and JavaScript frameworks in use. However, further work is needed to increase the precision and recall of the results.

Keywords

Clustering, fingerprinting, OPTICS, t-SNE, headless browser, bag-of-words, unsupervised machine learning

(6)

(7)

Sammanfattning

Att ta fingeravtryck av en hemsida innebär att identifiera vilka teknologier som en hemsida använder, såsom dess webbapplikationer och JavaScript-ramverk. Nuvarande metoder för att göra fingeravtryckningar av hemsidor använder sig av manuellt skapade fingeravtryck för varje teknologi som de letar efter. Dessa fingeravtryck består av flera textsträngar som matchas mot HTTP-svar från hemsidor. Att skapa fingeravtryck kan vara en tidskrävande process vilket begränsar vilka teknologier som fingeravtryck kan skapas för. Den här rapporten presenterar en potentiell lösning genom att utnyttja oövervakade maskininlärningstekniker för att klustra hemsidor efter vilka webbapplikationer och JavaScript-ramverk som används, utan att manuellt skapa fingeravtryck. Detta uppnås genom att använda flera ordvektormodeller tillsammans med dimensionalitetreducerings-tekniken t-SNE och klustringsalgoritmen OPTICS. Resultatet visar att vissa teknologier, till exempel Drupal, får en precision på 0,731 och en recall på 0,485 utan någon träningsdata. Detta leder till slutsatsen att den föreslagna lösningen möjligtvis kan användas för att klustra hemsidor efter de webbapplikationer och JavaScript-ramverk som används. Men mera arbete behövs för att öka precision och recall av resultaten.

Nyckelord

Klustring, fingeravtryckning, OPTICS, t-SNE, huvudlös webbläsare, ordvektor, oövervakad maskininlärning

(8)

(9)

Acknowledgment

We would like to give a special thanks to Tom Hudson for offering technical counsel during the writing of this thesis, and for providing the basis of the data collection tool that we could continue to develop upon.

(10)

(11)

1 Introduction

Today, there are a vast number of websites, using a great variety of technologies. The ability to be able to identify and categorize properties associated with these technologies is useful in several fields. One example of a property is the technology stack that is in use by a website, such as detecting which websites use Wordpress or certain frameworks. The process of taking such a property and making a unique identifier based on it is called fingerprinting. One field where this could be useful is in the security field, where fingerprinting could be useful to find groups of web applications utilizing similar technologies. Knowing the technologies that are in use by a company can be useful if a weakness is found within one of these technologies, as this information can be used to find which web applications should be prioritized to patch. Similarly, finding out how widespread certain technologies are on the market is useful for market research.

1.1 Problem statement

Categorizing the technologies used by websites via fingerprinting is nothing new. However, in comparison to manual fingerprinting methods such as the ones used by Wappalyzer [1], an automated method of fingerprinting could allow a larger part of the web to be fingerprinted while potentially reducing the time required to perform the fingerprinting. However, there is currently no readily available tool that fingerprint websites based on the used web servers and web applications without using previously defined fingerprints for each technology.

Thanks to the development of clustering algorithms, it is possible to take unlabeled data and then group that data into clusters. These clustering algorithms can handle large amounts of data and group them based on their likeness to one another. Clustering could possibly be used to create an automated method for fingerprinting websites that can group web servers and web applications without needing training data about any technologies. Additionally, this would also open up the possibility to fingerprint new or internal frameworks, such as if a company or service utilizes its own internal technologies, as well as speeding up the fingerprinting process by removing the need to build fingerprints for each technology.

1.2 Goal of the project

The goal of this thesis is to determine if it is possible to create a tool that can, without using previously defined fingerprints for each technology, fingerprint websites based on the technologies used. The types of technologies in focus are different web applications and web servers. The precision, recall, and silhouette coefficient of said fingerprints are determined and their usefulness compared to pre-existing fingerprinting technologies.

(14)

2 | 1 Introduction

In order to achieve this, a tool needs to be created that can gather the relevant data from websites that the fingerprints are to be based on. After that, another tool has to be built that groups the websites based on the data collected in groups of the web server and web application in use. Our results are compared with existing methods with a test on unlabeled data. A large number of websites are chosen to run the tool on, which tests the tools ability to operate on live data, but has the drawback that the ground truth of what the technology stack looks like is not known.

The results from our fingerprinting method need to be compared with pre-existing fingerprinting tools. The metrics used to compare the clustering to pre-existing tools are precision and recall. Additionally, the silhouette coefficient is used to determine how well separated the clusters are with the compatible clustering algorithms. In order to assist in the evaluation, spot-checking the results is used.

1.3 Scope of the project and limitations

Fingerprinting in this thesis refers to the method of taking a list of websites and being able to determine parts of the technology stack behind the websites by analyzing the response to HTTP requests. Pre-existing tools such as Wappalyzer [1] do this by searching for specific strings in the responses, such as “href='/wp-content” for an indication that Wordpress might be used. These strings are built up by people manually analyzing pages that run the technology that the fingerprint is being built for.

This thesis focuses on ways to automatically group websites by the technology used, instead of creating specific fingerprinting strings like Wappalyzer for each technology. Additionally, the focus is on grouping the websites by the web server in use (such as Apache, Nginx, or Tomcat) and the web application in use (such as Wordpress, Magento, Drupal, etc.). There is not any focus on fingerprinting applications that are not on the HTTPS port 443.

Another limitation is that the focus is on the root path of the sites that are chosen to be fingerprinted. There is not any crawling for paths or domains in applications involved in this thesis. This means that applications that are on the path www.example.com/ are fingerprinted, but applications that are on paths like www.example.com/blog are not. There is also a limit on how many different automatic fingerprinting techniques can be reasonably built, tested, and evaluated in the time span allotted for this thesis.

(15)

3 | 2 Theory and background

2 Theory and background

This chapter presents the theoretical framework for the background of the problem and the proposed solution. It also presents related works within fingerprinting and clustering.

2.1 Fingerprinting

In a broad sense, fingerprinting refers to creating a unique identifier for different items. If the items are the same, the fingerprint will also be the same [2]. In this thesis, fingerprinting refers to the technology in use by websites. Two sites will have the same fingerprint if they both use the same technologies. There is a lack of readily available academic papers that discuss fingerprinting web applications and web servers based on the HTTP response that they return. However, there are open source tools such as Wappalyzer [1] and WhatWeb [3] that can achieve this.

Wappalyzer uses a large JSON file to define the fingerprints for the applications that it supports. In Figure 1, the Wappalyzer [1] JSON object containing the fingerprint for the Magento web application can be seen. This fingerprint is based on multiple different properties of the returned response from the web server. One such property is the name of the cookie. If the name of the cookie contains “frontend” then it is a sign that the site might be running Magento. All the supported properties are [1]:

● The cookie names and/or values

● Pattern matching (regex) against the HTML returned

● The favicon of the application (usually seen as the picture of each tab in the web browser)

● Method names from the JavaScript code in the page ● Pattern matching for the JavaScript URLs in the page ● The value of the HTML meta tag

(16)

4 | 2 Theory and background

Figure 1: Wappalyzers fingerprint for the Magento web application. Picture taken from

Wappalyzers GitHub. 2.2 Headless browser

A headless browser is like a normal browser such as Chrome or Firefox, but without any GUI (graphical user interface) to interact with [4]. Instead of a GUI, it is possible to interact with it by writing code, for example, via the Golang framework chromedp [5]. An example of a program that makes use of a headless browser is Wappalyzer [1]. A headless browser makes it possible to get the full functionality of a browser, such as executing JavaScript and making XHR requests when visiting websites. This functionality is useful when, for example, testing or automating tasks on modern web applications that make heavy use of JavaScript [6]. An alternative to a headless browser is using a raw HTTP client, which can be achieved with a tool like curl [7] that sends one single HTTP request and outputs the response. Tools like this are generally faster, but the evaluation of JavaScript and the additional requests done by the page are lost.

(17)

2.3 JavaScript window object

Figure 2: A few of the available methods in the JavaScript window object.

The JavaScript object “window” as seen in Figure 2, is a global object that exists on all pages that the browser visits. It contains many standard methods, variables, and constructors [8]. One of these methods is the window.alert() method, which creates a popup alert box in the browser. It is possible for scripts running on a website to extend the window object with new variables or methods. If all of the standard methods and variable names that are available on a website are removed, there would only be the custom methods and variables added by the site left. The window object could possibly be used when fingerprinting websites and is further discussed in section 3.2 “Data collection”.

2.4 HTML document

Figure 3: A simplified example of an HTML document.

An HTTP request to a web server can respond with an HTML document in the body of the response, such as the one seen in Figure 3. The browser then parses the HTML and creates a Document Object Model (DOM) that JavaScript can interact with and edit. HTML5 and DOM are defined by the standards created by the “Web Hypertext Application Technology Working Group” [9]. In these standards, it is specified that any HTML element can have the unique identifier (ID) attribute, the

(18)

ID value must be unique for the whole page, and an element cannot have multiple ID attributes. One use for the id attribute is for JavaScript to be able to query the document for a specific element with a particular id. The class attribute is similar to the ID attribute in that it can be on any element, but the class value does not have to be unique for the document. The values of the class attribute are space-separated, which means that the h1 element in Figure 3 has the values “fancy-title” and “main-title”. The class values can be used by JavaScript to query all elements that have a specific class value and style them in a specific way. Both the id and class attributes could be interesting when fingerprinting sites since they are well used by many frameworks such as Bootstrap [10].

2.5 Supervised and unsupervised learning

Supervised and unsupervised learning are two different types of machine learning. They are used to learn and predict the relations between data. In “An Introduction to Statistical Learning” by Gareth James et al. [11] describes supervised learning as requiring some form of training data that consists of labeled data. This labeled data is then used as a reference point for the supervised learning algorithm to determine what kinds of categories exist, and what data belongs in which category. This training data makes the supervised learning method able to put new data into one of the previously seen categories.

In contrast to supervised learning, unsupervised learning does not utilize any training data set. Unsupervised learning algorithms instead attempts to observe relations between data, independently of previous categorization or labeling. One field where unsupervised learning is commonly used is in clustering, where relations between data are observed, and the data is consecutively grouped based of these observations.

2.6 Clustering

Clustering is the method of taking in multiple objects that each consist of a numeric vector, and then grouping the objects into clusters that are most similar to each other with regards to the distance between the vectors [11]. This distance can be measured using Euclidean distances, pairwise distances between data points, or other similar metrics. The created clusters can then be utilized to gain further insight into the data. There are multiple clustering algorithms available, and they mainly belong to two main groups, partitional and hierarchical. One of the main features of clustering algorithms that this thesis aims to use is its ability to work with unlabeled data.

2.6.1 Partitional clustering algorithms

Partitional clustering algorithms require the analyst to predefine a targeted amount of clusters to be created in order to get started, generally denoted as K [11]. These

(19)

algorithms then take a data set of N data points where N is equal to or greater than K and creates K numbers of clusters, where each cluster contains at least one data point, and each data point N belongs to exactly one cluster K.

There exists a subset of partitional clustering algorithms, known as fuzzy partitioning. In fuzzy partitioning algorithms, data points can belong to more than one cluster [12], but this thesis does not cover fuzzy partitioning.

2.6.1.1 K-means algorithm

The partitional clustering algorithm K-means works by randomly assigning all data points to precisely one cluster. The K in K-means notes the total number of clusters that will be formed.

In “An Introduction to Statistical Learning with Applications in R” by Gareth James et al. [11], the K-means algorithm is described as an algorithm that works by first defining a number K of centroids, centroids being the center of a cluster. These centroids are at first randomly distributed and each data point is allocated to a cluster with the goal of keeping each cluster as small as possible. This process is done iteratively with each iteration attempting to distribute the centroids in such a way that the clusters shrink in size. This process is then iterated upon until there is no change in the centroids positioning, indicating that a local optimum based on the initial randomly distributed centroids has been reached.

The inherent drawback of this method is that the end result depends upon the initial random distribution of the centroids, and a certain set of random distributions may be unable to find the global optimum, no matter the amount of iterations [11]. This problem can be alleviated by running the method several times with different starting distributions, which is a feasible solution as one of the main benefits of K-means clustering is its speed [13] [14].

2.6.2 Hierarchical clustering algorithms

Hierarchical clustering does not rely on a predefined K value, but instead creates a dendrogram, a tree-based representation of the data set [11]. Dendrograms show the relations between data points at all levels without first limiting the results to a particular number of clusters. In Figure 4, three depictions of the same dendrogram can be seen. Each depiction shows a different cut at a different height resulting in different clusterings, where the colors of the data points represent different clusters.

(20)

Figure 4: A dendrogram showing the clustering process of a hierarchical algorithm. Picture from

“An Introduction to Statistical Learning”.

Clusters that are conjoined lower in the graph tend to be more similar, while clusters that are created by conjoining higher up in the graph might not only just be vaguely similar, but in extreme cases, have no similarities at all. In a dendrogram, only vertical proximity dictate likeness of data points, their horizontal proximity is arbitrary for determining likeness.

The hierarchical clustering algorithm achieves these results by first defining a distance metric. The algorithm treats each data point as its own cluster, and then iteratively joins the two most similar clusters until the entire graph consists of a single cluster [11]. The dendrogram results in the possibility to determine the number of clusters after the algorithm is finished, leading to between one and as many clusters as there are data points from the same dendrogram.

The inherent problem with hierarchical clustering comes from the ambiguity of determining what number of clusters accurately reflects the ground truth. Another problem is that depending on the chosen feature and an incorrectly chosen number of clusters, the results may end up nonsensical. An example of this is attempting to cut a binary dissimilarity in three distinct clusters.

2.6.3 Density-based clustering

Martin Ester et al. [15] describes density-based clustering algorithms as algorithms which create clusters by grouping areas with a dense concentration of data points, while classifying data points in sparse areas as noise or outliers. A visual example, as seen in Figure 5, depicts the density of data points in a separate 3-dimensional graph. The density-based algorithm creates a “cut” through the height of the graph to extract the number of separate islands or clusters. Only the data above the

(21)

cutting point will be considered when deciding where to form a cluster. When the decision to create a cluster has been made, nearby data will be grouped into the cluster, even if the data would be below the cutting point. The density level required merely decides on where to create a cluster, not exactly which data points belong to the cluster.

Figure 5: A visual representation of density based clustering. The plots on the left show the result

of the clustering, and the 3-dimensional graph on the right shows how the cutting points placement affects the clustering process. Picture from “Density-based clustering”.

2.6.3.1 DBSCAN and OPTICS

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm based on density. This algorithm creates clusters by grouping together dense groups of data points. DBSCAN was created by Martin Ester et al. [16]. The algorithm has three different kinds of labels for the data points, core

(22)

points, border points, and noise. For a point to be classified as a core point, it has to have N number of data points within the distance D. If a data point does not fulfill those criteria, but is within distance D of a core point, it is classified as a border point. If none of these apply, the data point is classified as noise. Both N and D are user-provided parameters to the algorithm.

Each core point then creates a cluster, if a core point is within distance D of another core point, they are merged into one cluster. All border points within the distance D of the formed cluster are then added to the cluster. All points not in a cluster after this are considered noise [16].

Figure 6 shows a dataset where DBSCAN is run on with a minimum neighbor amount (N) of three. The three center data points surrounded by solid circles become core points as they all have three or more data points within their surrounding solid circles. The two data points surrounded by dashed lines become border points, as they are within the distance D of a core point, but do not have three neighboring points on their own to be considered a core point. The final data point surrounded by a dotted line then becomes classified as noise. This process results in a single cluster containing all data points except for the noise data point.

Figure 6: Shows how DBSCAN on a dataset decides between core points and border points with a

minimum neighbors number of 3, where the core points are surrounded by solid lines and border points by dashed lines. With a single noise data point being surrounded by a dotted line. Picture

from “Study of Protein Interfaces with Clustering”.

One of the issues with density-based clustering, such as DBSCAN, is that clusters can have varying density. If only one density is used for the whole algorithm, clusters can be missed [17]. For example, with DBSCAN if any particular core data point neighbors another core data point, those core data points will be merged into a single cluster. Because of this, no distinction between the two core data points can be made as they are combined into a single cluster, whereas a solution closer to the ground truth could have been to have two separate clusters.

(23)

Looking at Figure 7, there are two potential distinct clusters in the red and blue circle, while the black ellipse shows another potential clustering. The black cluster is formed due to the connecting data points in between the red and blue circles, whereas both the red and blue circle would have made clusters close to the ground truth. Jörg Sander and Hans-Peter Kriegel, that helped develop DBSCAN, later created a new algorithm built upon DBSCAN called OPTICS(Ordering Points To Identify the Clustering Structure) [17] et al. OPTICS aims to solve the varying density issue by retaining the core distance of objects, and allow for the creation of subclusters based on the density of core data points.

Figure 7: Shows how two potential distinct clusters can be grouped together by linked core data

points. The black ellipse shows one potential clustering where the linking data points qualify the entirety for a single cluster. The blue and red ellipses show another potential clustering.

One drawback of the OPTICS algorithm is that in some of its implementations the time complexity or big O notation, which relates to the number of data points (n), can be [18]:

𝑂(𝑛2) (1) This time complexity limits the usability of the algorithm on large datasets, as the runtime scales exponentially based on the number of data points.

2.6.4 Clustering performance evaluation

The variables used to compare different clustering algorithms in this thesis are their precision, recall, and when applicable, the silhouette coefficient. Both precision and recall requires knowledge of the ground truth to evaluate the clustering. TP stands for true positive, or if the clustering algorithm puts the sample in the correct cluster. FP stands for false positive and means that the

(24)

algorithm identified a sample incorrectly as another type. FN stands for false negative and means that the sample should have been in a particular cluster, but it is not. The precision is used to determine how accurate an algorithm is in regards to how many false positives it generates. Precision is defined as follows [11]:

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃

𝑇𝑃 + 𝐹𝑃 (2)

Recall operates similarly, but instead of determining accuracy based on how many false positives are created, it instead determines accuracy based on the number of false negatives generated. Recall follows a similar formula as precision, but replaces the false positives with false negatives, which means that the clustering algorithm incorrectly identified a sample as not belonging to a specific cluster [11]:

𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃

𝑇𝑃 + 𝐹𝑁 (3)

Lastly, the silhouette coefficient [19] can be used to determine how consistent the results of a partitional clustering algorithm are. A perfect score of 1 is achieved if all of the clusters are well separated and there are no points between the generated clusters. While the lowest score of -1 indicates that the clustering algorithm has mislabeled points. To calculate the silhouette coefficient the ground truth does not need to be known.

2.7 Dimensionality Reduction and sample size

Care has to be taken when extracting the features to cluster on. If the features extracted have too much noise, it can lead to overfitting [11], where two similar samples may result in being in entirely separate clusters. There is also the problem of the curse of dimensionality [20], where the dimensional space required to present any amount of data needed rises exponentially with the number of dimensions.

This section briefly touches on the ambiguity in determining the minimum sample size and explains the curse of dimensionality. It also describes the dimensionality reduction technique truncated SVD and the visualization technique t-SNE.

2.7.1 Sample Size

The sample size used when clustering can have an impact on its results [21], but there are no hard fast rules for determining the sample size. In a report written by Sara Dolnicar [22], she mentions, “There are no rules-of-thumb about the sample size needed for cluster analysis.” Nonetheless, in her results, she concludes that with a low sample size it can be hard to find any clusters, especially in high dimensional data.

(25)

2.7.2 Curse of dimensionality

The term curse of dimensionality was coined by Richard Bellman [20] and refers to occurrences in higher dimensional datasets that are not present in lower dimensions. The curse of dimensionality can manifest itself in different ways, for example, having more feature dimensions than there are data points can lead to the problem that the Euclidean distance between data points can become too similar [23]. Generally, the Euclidean distance is used to cluster on, which can result in difficulties creating clusters, as the distance between all samples are too similar to one another. This similar distance between data points issue can lead to overfitting, which generally results in a worse overall accuracy as clusters and generalizations become too tailored to the initial data used, resulting in new data points that should reasonably fit in the cluster to be excluded. A visual example of this can be seen in Figure 8, where the green line represents the line an overfitted method would use to determine what belongs to a cluster. Meanwhile, the black line represents the line made by a non-overfitted method and is what would have been preferred since this leads to a better generalization.

Figure 8: A visual representation of how overfitting can lead to unexpected results. The red and

blue points represent the training data. It can be seen that the green line, which is the results of a model, follows the training data precisely while the black line shows a better generalization that is

expected to work better with new data. Picture by Ignacio Icke. 2.7.3 SVD and truncated SVD

An SVD (Singular value decomposition) is a factorization of a matrix containing either real or complex numbers [24]. Truncated SVD is an approximation that discards some columns, corresponding to the largest values of one of its factors, which is a diagonal matrix of non-negative real numbers on its diagonal. It then calculates the remaining rows. Truncated SVD can as such be used to perform a

(26)

dimensionality reduction by reducing the number of columns of the original matrix.

2.7.4 t-SNE

To make the results distinguishable by human eyes, some technique that visualizes high dimensional data is needed, one such technique is t-SNE (t-Distributed Stochastic Neighbor Embedding) developed by Laurens van der Maaten et al. [25]. T-SNE is a variation of SNE (Stochastic Neighbor Embedding). It works by taking the high dimensional Euclidean distances between different data points and then converting them into conditional probabilities that can be used to represent various similarities between these data points in a lower dimension. First, it creates a probability distribution of pairs of high-dimensional objects to ensure that similar data points have a higher probability of being picked than dissimilar data points. Then it defines a similar distribution on a lower dimension map and reduces the divergence between the probability distributions.

One of the main benefits of t-SNE over SNE or other techniques is that it is partially designed to resolve a so called “crowding problem” that exists in SNE and many other reduction techniques [25]. This problem stems from the fact that a lower-dimension representation cannot always exactly represent a higher-dimension manifold, and can result in very dissimilar clusters of data points to inherent positions very far away in a low-dimensional representation. This causes the second probability distribution to essentially crush together all data points near the center of the low-dimensional space, potentially resulting in a loss of global identity between created clusters. By resolving this crowding problem, t-SNE tends to create clearer, easier to read representations, compared to many other reduction techniques. It does this without placing different data centers so far away from one another as to limit readability, but still keeping them distinct.

Scikit-learn recommends using another dimensionality reduction method[26], such as PCA, to first reduce the number of dimensions to a better suited number of dimensions for t-SNE, for example 50, as was used by van der Maaten [27]. T-SNE is best at representing data in a 2 or 3-dimensional space. It is also a very processor heavy technique, and can take hours where other methods such as PCA can finish in minutes, or even seconds [26]. The technique can also allow for different results based on different initializations, and as such multiple restarts with different initializations are recommended.

There are however drawbacks with t-SNE. One of them is that random noise might look like clusters in t-SNE [28]. The parameters for t-SNE are perplexity, epsilon, and number of steps. Changing these parameters can result in significant changes

(27)

to the results. It is recommended to have a large steps value to ensure t-SNE reaches a stable state [28].

2.8 Feature extraction

As the input to the clustering algorithm is a vector of numbers, there needs to be a step before the clustering process where the data is processed and turned into a vector. This step is called feature extraction. In this section, various ways to extract these features will be handled.

2.8.1 Bag-of-Words

Figure 9: A visual representation of how a bag-of-words model extracts how often terms are used.

Bag-of-words is a model to turn text into a vector by looking for multiple character sequences in the sample and counting their occurrence. The output of the bag-of-words model for a single document is a vector. In Figure 9, the input document “The quick brown fox jumps over the lazy dog” can be seen being put through a bag-of-words model. The bag-of-words vocabulary contained the words “the” “fox” and “cat”. After letting the model processes the text, it creates the vector [2,1,0], indicating that “the” was used twice, and “fox” was used once, whereas “cat” was not used at all in the text.

Bag-of-words is a common method [14] [29] to extract features. A risk from using the bag-of-words model is that the dimensionality can quickly get very high if each occurrence of a word is added to the vocabulary of the model. A solution for this is to define a minimum document frequency (MIN_DF). If a feature does not exist in enough of the input text documents, in regards to the MIN_DF value, it will not be used in the vocabulary. This selection process leads to the problem that the features

(28)

left out will not be taken into account in the clustering. Another drawback is that the orders of appearance of the features in the document are not taken into account.

2.8.2 N-Gram

An N-gram, also called shingles, is a way to break down words or sentences into smaller parts [30]. The N in N-gram notes the length of the new parts. N-grams can, among other things, be based on characters or words. For example, if the word “error” is used and broken down into a character based 3-grams, it becomes the parts “err”, “rro”, and “ror”. N-grams can be used to get some more context that the bag-of-words model loses. For example, a bag-of-words model that uses whole words as features would have “wp-content” and “wp-data” as different features, while a character based 3-gram would be able to catch the “wp-” string being used. However, a potential drawback of character based N-grams is that it can add noise since the same 3-gram could be used in entirely different words.

2.9 Related Works

In a study conducted by Kai Yang et al. [31], device detection was conducted to identify and fingerprint IoT devices. The device detection was done by sending query packets to remote hosts and analyzing the responses by comparing devices by their IP and TCP/UDP packet headers, as well as 20 other common protocols. They utilize web crawlers to crawl commercial websites in order to obtain their features. The input to the model is then converted into binary values in the preprocessing step. These binary values were then used by a neural network to generate device fingerprints, which would provide class labels for the different IoT devices in the three different categories: their type, the vendor, and the type of product. With this method, they were able to achieve 94.7% accuracy when generating fingerprints based on only the IoT device types and 91.4% accuracy when generating fingerprints based on type, the vendor, and the type of product.

Lotta Järvstråt’s master thesis on a functionality classification filter for websites [29] takes data from websites in the HTML format of the site and extracts the URLs and content. The websites were then classified according to their function, such as being a blog, a news site or forum. Multinomial logistic regression with Lasso was used as a means to reduce the number of variables. Infrequent terms in the data set were removed to reduce noise from the results and make the process less computationally heavy. The thesis showed a potential 99.61% accuracy in classifying the function of a website using these methods. This method was then compared to another method using the topic model Latent Dirichlet Allocation (LDA). LDA was used to reduce the number of variables into a smaller number of topics, but utilizing only this method only achieved a best case accuracy of 97.62%. A point was made that overfitting could be a reason for inaccurate results. When

(29)

the method was rerun with over a three months difference between feature extraction and fetching the data, it resulted in an accuracy of 90.72%. This result shows that overfitting likely was an issue. Where combining both the methods and using both LDA and a multinomial logistic regression classifier resulted in an impressive 99.70% accuracy.

These results show that the techniques can be used on a wide variety of base data, from values in an IP header to entire site contents. Feature extractions can then be run on the collected dataset to prepare them for further calculations in order to categorize the samples into groups based on their likeness.

(30)

(31)

19 | 3 Methodology

3 Methodology

This chapter explains how the project was carried out. First, a literature study was conducted to discover what methods and models exist for fingerprinting and clustering, and how they could be used to solve the problem presented in this thesis. This literature study consisted of reviewing related works, research, and previous studies. Using the information gathered from the literature study, a prototype was built, which was then compared against the existing fingerprinting tool Wappalyzer [1]. Figure 10 shows an overview of the implementation and evaluation of our prototype, from deciding on the dataset used to the evaluation of the clustering results. An alternative method, as well as the architecture used, is also described in this chapter.

Figure 10: An overview of our method. Each box represents a step in the process. Clustering

includes both dimensionality reduction and creating clusters. 3.1 Supervised learning or unsupervised learning

Supervised learning and unsupervised learning are two categories of machine learning. A supervised approach would require a labeled training dataset. The training dataset could be collected by using existing fingerprinting technologies such as Wappalyzer [1]. The main issue with the supervised approach is that each technology would have to be in the training dataset and be correctly labeled. Getting a training dataset consisting of all technologies on the web is unachievable due to the diverse and quickly changing nature of the web. An unsupervised approach, and in particular clustering, makes it possible to group technologies without the need to have a training dataset containing each technology. The drawback of an unsupervised approach is that there is no guarantee that the collected data will split into the expected groups. Since a supervised approach did not fit the goal of this thesis, an unsupervised approach was picked.

3.2 Data collection

The dataset of web servers used in this project was the Tranco list [32], which contains 1 million of the most popular websites globally. The Tranco list is specifically made to be a dataset for internet wide research and hardened against manipulation. One of the benefits of working with the Tranco list, over creating a specific dataset by for example picking random websites by randomizing IP addresses or looking at a specific span of IP addresses, was that the Tranco list

(32)

20 | 3 Methodology

contains real websites that are popular and in use. Picking random IP addresses might result in the web server not serving a page if no hostname is given. Another benefit was that the Tranco list contains a rather varied list of web servers from different areas of the world, which makes it a diverse set of web servers. The Tranco list also makes it easy to reproduce our results, since it is possible to download the same list used if the date or id is given. For our purposes, only web servers using HTTPS on port 443 were used, sites not using HTTPS were discarded. The reason was that many sites using HTTP on port 80 redirected to an HTTPS version of the same site, which meant that using both HTTPS and HTTP would result in some pages being collected twice.

In order to be able to perform any clustering, data first has to be collected from the list of web servers. To collect the data, a headless chrome browser was used. The feature types collected when the headless chrome browser visits a site were:

● The values from the html class attribute. ● The values from the html id attribute.

● The names of the methods and variables in the JavaScript window object. ● The names of the cookies used by the site.

● All of the additional requests, for example, images and API-calls.

These feature types were picked from looking at how the Wappalyzer [1] fingerprints were built and from comparing a few sites manually to find what differentiated them.

The feature types collected were then put in separate files in a folder for each site. The reason for using a headless browser for collecting these values instead of a raw HTTP client was that the headless browser runs all of the JavaScript that the page uses. The JavaScript on the page can add content to the DOM and make additional requests. An HTTP client would not be able to gather this data without also evaluating the JavaScript. The headless browser will follow both HTTP header redirects and JavaScript redirects from the page. If the website uses redirects, the data will be collected on the final site at the end of the redirects.

In Figure 11, an overview of how our dataset was structured can be seen. Each folder represents one site from the Tranco list. In each folder, there were five files that contain the collected data from our headless browser tool. The “wappalyzer.data” file contains the fingerprints that Wappalyzer collected from the site and is further explained in chapter 4 “Results”.

(33)

Figure 11: An overview of the structure of our dataset. Each folder contains their own copies of the

files to the right.

In Figure 3, an HTML document can be seen where the extracted list of id’s would be “text”, and the extracted list of class values would be “fancy-title” and “main-title”. This data would then be put in the files “htmlID.data” and “htmlClass.data”, from Figure 11, respectively.

Figure 3: A simplified example of an HTML document.

The files “requests.data”, “htmlWindow.data”, and “cookieNames.data” are different from the other files since they do not contain data directly extracted from the HTML document. The file “requests.data “contains all the URLs to the requests that the browser makes when visiting the page. These URLs can include paths for JavaScript files, image files, or post requests. An example of a URL from the file could be “https://example.com/images/image.jpg”. Each URL was tokenized by removing the domain and splitting on the ‘/’ and ‘?’ characters. This tokenization results in the tokens “images” and “image.jpg” from the previous example. The file “htmlWindow.data”, contains the variables and method extracted from the JavaScript window object, as explained section 2.3 “JavaScript window object”. The “cookieNames.data“ file contains all the names of the cookies that can be extracted with JavaScript from the page.

(34)

22 | 3 Methodology

3.3 Feature extraction

To create the vectors for each website, which is needed for clustering, feature extraction was utilized. In order to create this vector for our model, a combination of 5 bag-of-words models was used. The bag of words approach was picked due to its relative simplicity and that it is a well-used and documented method. It is also deterministic, which makes replicating and testing this method easier as it is predictable. One drawback of the bag-of-words model is that it loses the ordering of the features from the input document. This drawback is not applicable in this thesis since the ordering of the features used in this thesis does not matter. Whole words were picked as features for the bag-of-words models. All the files from the data collection in our dataset first had to be tokenized and then processed to decide the vocabulary for the bag-of-words models. The selection process to decide what the vocabulary of a bag-of-words model was based on how many of the other documents contained that specific feature. For example, if the id “text” can be found on five sites and the minimum document frequency (MIN_DF) was ten sites, it will not be included in the model.

The result of the tokenization was then processed again to extract the feature vector for each document by using the vocabulary. An alternative to using whole words as features could have been to use n-grams, for example, character based 3-grams such as “aaa “, “aab” and so on. 3-grams would have the advantage of removing the vocabulary building step since there are only 46656 possible 3-grams for all the alphanumeric letters. However, since we used a total of 5 bag-of-words models, the total dimensionality would be five times higher, at 233280. This many dimensions would likely run into the issues explained in section 2.7.2 “Curse of dimensionality”. 3-grams was not picked due to this reason. As it was not possible to add each new unique token from the tokenization step into the vocabulary of the bag-of-words model, a selection of which tokens to add has to be made. The reason for not being able to extend the vocabulary for each new feature was that the dimensionality of the final vector would get too big. Had this many dimensions been used, it would also lead to the issues of the curse of dimensionality. This part of the method was similar to how Lotta Järvstråt [29] picked the vocabulary for her bag-of-words model. However, a difference from her work was that our thesis uses multiple bag-of-words models and combines the results.

(35)

Each bag-of-words model has a separate vocabulary from the tokenization step. In Figure 12, a visualization of how the feature vector for “example.com” is created can be seen. It can be seen that example.com contains the classes “fancy-title”, “main-title” and the id “text” after the tokenization step. The class data was then passed to the bag-of-words model for the class. Each bag-of-words model contains a vocabulary, but while the class value “fancy-title” exists on “example.com”, it is not in the vocabulary for the class model since not enough other sites had a class attribute with the value “fancy-title”. The class data was then turned into the vector 0,1,0 since “main-title” was in the vocabulary and it exists once on the page example.com. This whole process is then repeated for the other bag-of-words models and data. The final vector for “example.com” was the combined vectors from each bag-of-words model. In this thesis, in order to make one feature comparable to another feature, each feature was scaled to make the standard deviation 1. While the result of the bag-of-words models was centered around 0.

Figure 12: An example of how the features from https://example.com are turned into a feature

(36)

24 | 3 Methodology

3.3.1 Dimensionality Reduction

After having obtained the feature-vectors, they still need to be subject to dimensionality reduction. For our model, we used truncated SVD followed by T-SNE to bring down the dimensionality. The reason for the dimensionality reduction was that the dimensionality from the bag-of-words models was still too large to cluster on, with upwards of over 14000 dimensions.

The framework used to perform both truncated SVD and t-SNE was scikit-learn [33]. The reason that the vectors were first subject to truncated SVD was that t-SNE works better with an already lower amount of dimensions, as t-SNE is a slower algorithm. Scikit-learn recommends around 50 dimensions in order to speed up the process as well as to help reduce noise [34]. Truncated SVD was chosen over other alternatives due to its unsupervised nature, its simplicity, and ease to set up. After truncated SVD had been used to bring down the dimensions, t-SNE was used to perform data visualization in order to make the results more comprehensible. T-SNE was picked for its ability to deal with the crowding problem. While the t-T-SNE algorithm loses some information compared to, for example, truncated SVD, solving the crowding problem was deemed more important and was worth the loss of information, and in some cases, the loss of information can be beneficial. Such as when data points have a very large Euclidean distance from each other, t-SNE tends to create representations where the global distance between these groups of data points is reduced, making the results easier to cluster.

3.3.2 Clustering algorithm

Once the feature vectors have been built for the sites, a clustering algorithm has to be picked that groups the most similar sites into clusters. This thesis focuses on the two clustering algorithms, K-means and OPTICS, both of which are explained in section 2.6 “Clustering”.

The K-means algorithm was chosen due to its speed and simplicity. However, an issue for this thesis is that the number of centroids must be specified when using K-means, which could not be done because the data collected was unlabeled. This unlabeled data leads to the correct number of clusters not being known. This was solved by incrementing the K value and comparing the silhouette coefficients and picking the K with the largest silhouette coefficient value.

Something that was taken into consideration when using the K-means algorithm was the initialization values of the centroids. It is possible for one initialization to reach a local optimum with clustering results that were far away from the global optimum. For this reason, the K-means algorithm was run multiple times with different initial values for the centroids.

(37)

The OPTICS algorithm was picked due to its ability to deal with clusters of uneven sizes and that there was no need to specify the number of clusters, only the minimum cluster size [17]. The OPTICS algorithm can also handle clusters that are not circular. OPTICS was picked over DBSCAN since DBSCAN has issues separating smaller subclusters close to other clusters due to the single density value based approach that DBSCAN uses. The parameters N and D were chosen. Where N is the number of neighboring data points required to be considered a core point, and D is the distance a nearby data point needs to be to a core point to be considered neighboring for OPTICS. N was chosen as the smallest MIN_DF in use, since our method likely cannot cluster any technology that occurs less than MIN_DF. The D was chosen by running OPTICS multiple times with different values of D and picking the best result.

A hierarchical clustering algorithm was not picked due to the issue of determining the number of clusters. This issue was difficult to solve as it was not known how many technologies there are in the dataset being used. A hierarchical algorithm would also not offer any tangible benefit over other clustering algorithms for our method.

3.4 Alternative method

This section describes a theoretical alternative method for data collection and feature extraction. This alternative method does not rely on the HTTP body, but on the HTTP headers instead. It does this by sending various malformed requests to the site using a raw HTTP client instead of a headless browser. The reason for using a raw HTTP client was due to the ease of generating malformed requests in any specific configuration desired. An example of a malformed HTTP request was an HTTP request where the method, which usually is a GET, POST, or another method defined in the HTTP standard [35] is changed to a non-standard HTTP method “AAA”. Two other examples of malformed requests were to send a nonexistent HTTP version in the request such as “GET / HTTP/99” or requests with unusually long paths of thousands of characters, like “GET /AAAAAAAAAAAAAA... HTTP/1.1”. The hypothesis is that different web servers and frameworks will handle these malformed requests differently and give different kinds of responses. Data from the responses to each request was then saved, this could be data such as the HTTP status code, the used header names, and the content length. The collected data was then clustered using the already mentioned clustering and dimensionality reduction methods.

To use the malformed requests method, a different dataset than the Tranco list should be used. The dataset should consist of websites belonging to companies that have given the explicit permission to allow security tests to be performed on them. The reason behind this is that this method could be considered intrusive due to the

(38)

26 | 3 Methodology

malformed requests that the application might not be expecting. However, it is unlikely that the malformed request in and of themselves would uncover any vulnerabilities. Due to this ethical factor and time constraints, it was decided not to pursue this method.

3.5 Architecture

This section gives an overview of the architecture of the tools that were built. This includes the hardware it was run on and the software used.

3.5.1 Hardware

Two different machines were used, as can be seen in Figure 13. One was a server in Google Cloud [36] of the type n1-standard-4 with 15GB of RAM and three vCPUs with a base frequency of 2.3 GHz each. The other machine used was a desktop computer with 16GB of RAM and one CPU with 3.40 GHz base frequency.

(39)

The Google Cloud machine, located in Iowa, USA, was used to collect the Wappalyzer data and the headless browser data. The desktop computer was used to extract the features from the dataset, dimensionality reduction, and cluster the dataset.

3.5.2 Software

The tools in this project were written in python [37] and Golang [38]. Wappalyzer [1] was then used to collect the fingerprints that our results were compared to. The program that uses the headless browser was written in Golang. The Golang program uses Chromedp [5] to control the chrome headless browser. The base of this program was already built by Tom Hudson [39], but it was extended in this project to include the functionality to collect the data that was needed to cluster the websites.

The program for feature extraction and clustering was written in python. Scikit-learn [33] was a large part of the python script and was used for its pre-built clustering algorithms and data pre-processing tools. The plots were built with the Matplotlib [40] library.

(40)

(41)

29 | 4 Results

4 Results

A few different metrics are used to evaluate the result of our method and to make it possible to compare them. This chapter begins with an overview of the results followed by observations that were made from the clustering and finally an evaluation of our results compared to Wappalyzer.

4.1 Labeling the data for comparison

The comparison for our results is made against the pre-existing fingerprinting tool Wappalyzer [1]. Wappalyzer is run on the exact same sites as the ones that are clustered by our tool. Since the ground truth is not known, Wappalyzer results were used as a comparison ground instead. There is also the possibility that the Wappalyzer fingerprints can have false positives and false negatives. Spot checking is carried out on the results both to try to detect flaws in our method, but also to detect potential false positives and false negatives. It is also important to note that Wappalyzer can fingerprint multiple technologies for one site, for example, a Wordpress site can also run JQuery.

4.2 Results of our method

Figure 14 shows the distribution of the sites when plotted in two dimensions by first reducing the data to 50 dimensions with truncated SDV and then reducing the 50 dimensions to two with t-SNE. The color of each dot is determined from the Wappalyzer results. The dots that are colored blue run Wordpress, green Drupal, and cyan ASP.NET. The red dots are sites not running any of the aforementioned technologies according to Wappalyzer. Since Wappalyzer can have multiple fingerprints per site, it is important to note that the colors are applied in the order blue, green, and then cyan. This means that if according to Wappalyzer a site runs both Wordpress and ASP.NET, the color will be blue for Wordpress.

It can clearly be seen that each of the technologies has one main cluster where that technology has high density, but it can also be seen that each technology has outliers that are outside the main cluster. These outliers are most notable for the cyan ASP.NET and blue Wordpress. It is noticeable for ASP.NET as it is quite spread out but has a small main cluster of high ASP.NET density at the lower center of the image.

(42)

30 | 4 Results

Figure 14: The result of the dimensionality reduction using truncated SVD and t-SNE. Each point

is a site from the Tranco list. The colors come from the Wappalyzer fingerprint results; blue Wordpress, green Drupal, cyan ASP.NET.

4.2.1 Dataset

The list of sites that our method was run on comes from the Tranco list [32]. Due to the long runtime of our method, the total number of sites the clustering method was run on needed to be limited. The process of gathering the web sites from the Tranco list was allowed to run overnight, with half of the web servers being picked from the top of the Tranco list and the other half from the end of the list. The Tranco list was fetched the 14th of April 20201_{. These web servers were then}

trimmed down to remove all HTTP results as a lot of the HTTP servers redirected

(43)

31 | 4 Results

to an HTTPS version, resulting in a lot of duplicate results. Furthermore, web servers with invalid or expired SSL certificates were removed, which resulted in a total of 23 457 different sites.

Different cutoff points for which features should be included were set for each bag-of-words model. If a feature occurs in fewer sites than what is defined by the minimum document frequency (MIN_DF), it will not be included in the model. The MIN_DF was set with the total number of features in each bag-of-words model within the range of 400-6000. The reason for doing this was to make sure that there was not a large bias for any particular bag-of-words model. The MIN_DF value and number of features for each bag-of-words model can be seen in Table 1. The sum of the total number of features used is 14 246.

Table 1: Each column represents a bag-of-words model. The first row describes the minimum

document frequency for each model and the second row how many features there are in each model.

Cookie Class Id Path Window MIN_DF 30 90 30 30 70 Number of

features

(44)

32 | 4 Results

4.2.2 K-Means clustering

Figure 15: A plot of the silhouette score for different K values when using K-means on our result.

The X-axis shows the K values and the Y-axis the silhouette score.

Picking the number of centroids (K) for the K-means algorithm proved to be difficult. The reason for this can be seen in Figure 15. Figure 15 contains a plot of the silhouette score on the Y-axis for different K values between 2 and 200 on the X-axis. The silhouette coefficient can range from -1 to 1, where a higher value represents better separated clusters. There was no significant difference in the silhouette coefficient with the changes of K. Due to this, it was not possible to pick any meaningful K value that separated the data into clusters. Figure 16 shows how the K-means algorithm clustered our data using a K of 42, which had the highest silhouette coefficient at 0.53. In Figure 16 it can be seen that clusters of data points that a human would classify as one cluster is split multiple times into several clusters.

(45)

33 | 4 Results

Figure 16: An image of how K-means divided our results into clusters with a K (number of

clusters) of 42. Each color is a separate cluster. 4.2.3 OPTICS clustering

Compared to the K-Means clustering algorithm, OPTICS performed better on the vectors from our method. It can be seen in Figure 17 that the data points are generally well separated into clusters by the OPTICS algorithm. The light red points are classified as noise that, according to OPTICS, does not belong to any cluster. OPTICS was run with the settings of “clustering_method” set to DBSCAN and an “eps” (D) of 1.75, and a “min_samples” (N) as our smallest MIN_DF (35). These settings resulted in OPTICS forming 67 different clusters.

(46)

34 | 4 Results

Figure 17: An image of how OPTICS divided the results into clusters. Each color except light red is

a separate cluster. The points colored light red are classified as noise. In total, there are 67 clusters. 4.3 Observations

This section highlights some observations that were made when spot checking the results of our method.

4.3.1 Truncated SVD

Figure 18 show our dataset after having been reduced to two dimensions by the dimensionality reduction algorithm truncated SVD. There are noticeably some extreme outliers on the right side of the graph where X is around 500. These are 242 sites that either are the homepage of the Google search engine, Blogger, or sites that redirect to the Google search engine or Blogger. This is an example of overfitting and is further discussed in chapter 5 “Discussion”.

(47)

35 | 4 Results

Figure 18: The results of reducing our dataset to two dimensions with truncated SVD. Note the

outliers on the bottom right side. 4.3.2 Wappalyzer false negative

Figure 19 is a zoomed in image of the largest cluster predominantly containing Wordpress sites. Note that all of the red points inside this cluster are sites that Wappalyzer did not fingerprint Wordpress on. A spot check was carried out of 10 randomly chosen red points. The spot check included running Wappalyzer against the sites again and manually visiting the sites and inspecting them. The spot check showed that 8 of the 10 sites were running Wordpress, while two of the sites were not. One of these sites waited about 10 seconds before it redirects the browser to another site.

(48)

36 | 4 Results

Figure 19: A zoomed in image of the Wordpress cluster. According to Wappalyzer, blue points are

sites running Wordpress, while red points are not running Wordpress. 4.3.3 Empty and almost empty data from sites

In Figure 20 and Figure 21, zoomed in pictures of two clusters can be seen. The sites in Figure 20 have in common that the headless browser failed to extract any information from the sites. There are a total of 2011 sites in this cluster. Figure 21 contains the sites where there is only a small amount of data extracted (e.g., only one cookie name and no other data) and contains 633 sites. Chapter 5 “Discussion” discusses why these clusters might have formed and potential ways to avoid creating them.

(49)

37 | 4 Results

Figure 20: Zoomed in picture of the round cluster of points where the headless browser failed to

collect any data. It is on the top right side of the original dimensionality reduction results. There are 2011 points inside the round cluster.

Automatic fingerprinting of websites

Automatic fingerprinting of

websites

Using clustering and multiple bag-of-words

models

Automatisk

fingeravtryckning av

hemsidor

Med användning av klustring och flera

ordvektormodeller

ALFRED BERG

NORTON LAMBERG

Automatic fingerprinting of websites

Using clustering and multiple bag-of-words models

Automatisk fingeravtryckning av

hemsidor

Med användning av klustring och flera ordvektormodeller

Alfred Berg

Norton Lamberg

Abstract

Sammanfattning

Acknowledgment

Table of contents

1 Introduction

2 Theory and background

3 Methodology

4 Results