An analysis of reported phishing domains

(1)

Linköping University |Department of Computer and Information Science. Bachelor thesis, 16 ECTS| Information Technology

Spring term 2019 | LIU-IDA/LITH-EX-G--19/060—SE

An analysis of reported phishing

domains

En analys av rapporterade phishingdomäner

Daniel Keyvanpour and Tim Hellberg

Supervisor: Niklas Carlsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/her own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page:

(3)

Students in the 5-year Information Technology program complete a semester-long soft- ware development project during their sixth semester (third year). The project is completed in mid-sized groups, and the students implement a mobile application intended to be used in a multi-actor setting, currently a search and rescue scenario. In parallel they study several topics relevant to the technical and ethical considerations in the project. The project culminates by demonstrating a working product and a written report documenting the results of the practical development process including requirements elicitation. During the final stage of the semester, students create small groups and specialize in one topic, resulting in a bachelor thesis. The current report represents the results obtained during this specialization work. Hence, the thesis should be viewed as part of a larger body of work required to pass the semester, including the conditions and requirements for a bachelor thesis.

(4)

Abstract

As we become more digitalized and we rely more heavily on the internet, the more important it is to protect ourselves against phishing attacks and other types of internet frauds. Users who fall for phishing attacks risk getting sensitive information stolen such as their bank accounts. In this thesis we describe and analyze domains that use Hypertext Transfer Protocol Secure (HTTPS), an extension to the Hypertext Transfer Protocol (HTTP) used for secure communication, and the impact that these domains have on phishing.

We have analyzed and performed experiments that quantify how many of the phishing domains reported to PhishTank are HTTP and HTTPS, and why phishing sites can use HTTPS and still fail to be safe. We have created a script in Java that takes a set of URLs and creates a dataset containing the domains and all certificates that have been issued to these domains, making it a useful tool to analyze phishing domains. Furthermore, we present analyses and results describing how hashing algorithms are used in different certificates and their impact in securing the web.

Through analyses and experiments we gained an understanding of how easy it is to create a certificate and claim to be behind a website. Phishing domains being able to use HTTPS is a good example of this and our results have shown that many imposter websites use HTTPS. Thankfully, there are tools in place to secure the web and avoid phishing, such as browsers having a set of Certificate Authorities (CAs) that they trust, meaning that any HTTPS site that does not have a certificate from one of these CAs will be flagged as not secure. Another countermeasure is increasing people's knowledge about how to handle websites that seem to be secure and have the necessary parameters, such as HTTPS, but nevertheless are phishing sites.

(5)

Acknowledgement

We want to thank our supervisor Niklas Carlsson for helping and guiding us in the right direction this thesis. We also want to thank Sokrates Jacobson for giving us useful feedback that contributed to pushing our report forward.

(6)

1. Introduction

Almost everyone you meet have in some way used the internet. The internet is a fantastic assembly where you can do many things. With that said, the dangers of being defrauded also increases with its popularity. One of the more common ways to get digitally defrauded is through phishing; e.g., when someone claims to be a bank, government, your manager or maybe a company in order to trick you to reveal sensitive information about yourself. This includes information such as login details, credit card details or personal information. There are roughly 4.5 billion internet users today [1]. However, most of the users have no or little information of how phishing works. The most effective ways to protect yourself is through knowledge and information about this type of attack, how the method can be avoided and how it can be handled when it has occurred.

Internet users that have heard about phishing often believe that websites using a HTTP domain are not secure while domains using HTTPS are secure and can be trusted with your personal information [2]. Interestingly, almost a quarter of all phishing attacks take place on HTTPS domains [3], leaving many users vulnerable. The need for information is high, but how do one know if a website is phishing and what are the ways to avoid them?

1.1 Motivation

The need for deeper analyses of phishing is great. People cannot keep up with the speed of the digital development and new methods of deceiving people over the internet are introduced every day. Many websites today use HTTPS. But the surprising part is that some of these websites seemingly secure to the untrained eye also turns out to be phishing sites. To better understand this observation, it is important to build facts and methods that can help find why and how many HTTPS domains are not secure and analyze their certificates to see who issued these certificates. Our work is further motivated by a need to help further research and to create a way to find and analyze which certificate authorities (CAs) have issued certificates to these imposter sites and what ways to avoid them.

1.2 Aim and research questions

Our primary aim in this thesis is to understand why and how some CAs issue legitimate certificates to phishing sites. We aim to increase the understanding of phishing for future researches and put a stop to these publishers so that phishing sites becomes easier to discover. Phishing has been a wide problem since many years, while the attacks have been relatively easy to discover, this is not the case anymore [2]. Lately, phishing attacks have become much more advanced and more authentic; e.g., by making the imposter sites (and E-mails) look more realistic and trustworthy. We want to counteract this trend and once again take a step closer to a more secure web, where everyone can safely use the web without worrying that a site may steal the user’s private information.

(10)

Our research questions are the following:

Why and how can CAs issue approved certificates to fraudsters' websites? Why are increasingly many HTTPS domains not trustworthy?

These questions are answered in the report and by answering these questions, we help further research that can help general public to understand more about network fraud and phishing so the risk of falling for the fraud decreases.

1.3 Contributions

Our main contributions are the analysis of how phishing is affecting our society and the derivation of a few answers regarding why phishing has become more sophisticated recently. Additionally, we contribute with a script that, if built upon, allows for deeper analysis of phishing domains and their certificates. Further, the contributions will help future work against phishing but should also help building an understanding of how phishing can happen on websites that are using HTTPS. In this thesis we:

Create a script that can be used to analyze domains and their certificate.

Using the script, we analyze websites and for example reveal which CAs issue these certificates to deceiving websites that trick people.

1.4 Delimitations

This work analyzes many different phishing websites which are using HTTPS. However, due to time constraints we could not manually analyze each of these websites in depth due to the time constraints it will be almost impossible to dive into each website to analyze it in depth. Instead, the thesis has a thorough focus on analysis of what the general problem is in cases where the sites using HTTPS have certificates, but nevertheless turn out to be fraud websites.

1.5 Structure

The remainder of this thesis is structured as follows. First, we present some background and other research that touched on our topic. Second, we describe our method on how we have found the phishing domains and a walkthrough of the custom-made script, explaining what it does. Third, in the results chapter, we present what we have come up with by analyzing the sites and specifying why some of them are HTTPS but still have a high risk of phishing, and finally, we discuss and analyze our results.

(11)

2. Background

This chapter describes what phishing is and discuss related research.

2.1 Difference between HTTP and HTTPS

HTTP stands for Hypertext transfer protocol, and it is a protocol used for transferring data over a network. There are HTTP requests and HTTP responses. A typical HTTP request might look like this: GET /example.txt HTTP/1.1 User-Agent: curl/7.63.0 libcurl/7.63.0 OpenSSL/1.1.1 zlib/1.2.11, Host: www.example.com

Accept-Language: en

The problem with HTTP is that anyone who is monitoring the communication can read the text in the request or response and know exactly what information someone is asking for, sending or receiving. This information can be very sensitive, such as passwords or bank information. The difference with HTTPS is that it uses TLS or SSL to encrypt the HTTP requests and responses making them unreadable for people trying to monitor the communication. For the example above the message would, for the people monitoring the communication, look something like this:

t8Fw6T8UV81pQfyhDkhebbz7+oiwldr1j2gHBB3L3RFTRsQCpaSnSBZ78Vme+DpDVJPvZdZUZH pzbbcqmSW1+3xXGsERHg9YDmpYk0VVDiRvw1H5miNieJeJ/FNUjgH0BmVRWII6+T4MnDwm CMZUI/orxP3HGwYCSIvyzS3MpmmSe4iaWKCOHQ==

So essentially what HTTPS does is securing you from eavesdroppers. You can imagine HTTP is like having a conversation with someone in a public space where everyone can, if they want, listen to your conversation. While HTTPS is like having a conversation with someone in a soundproof chamber. HTTPS does not protect you from the person you are having the conversation with though, which leads us to the next topic, phishing.

2.2 What is phishing and how does it work with CT logs?

Phishing is a form of manipulation and a social engineering attack that try to lure you to give out sensitive information; e.g., by direct users to fake websites where they are asked to reveal such things as user information and credit card details. Many corporations like banks and other types of businesses that has this type of information is particularly exposed to phishing. One of the more usual scenarios in the phishing world is that the phisher creates an identical copy of, for example, a bank´s website. Everything in this fake website is identical to the real one which you are used to. The only difference is that the server is placed under the phishing site and the information that you type in the website will be stored by the fake website´s server [4].

(12)

As mentioned before, phishing attacks typically copy one websites appearance but on another server. How is this possible? While, it is impossible to make an exact identical copy, for example, there can be a tiny symbol in the domain name that separates the real website from the fake one, such difference makes is difficult to observe. But as mentioned

before, one can often see if a website contains phishing attacks by looking at the domain name and how it is spelled.

Most of the websites today use HTTPS, where the intention is that the websites that have HTTPS domains should be secure, however, there are many phishing websites that also use HTTPS. One reason for why there are many phishing sites that use HTTPS, is that it has become easy to get such a domain and many users of the web think that only legitimate sites use HTTPS. Further, certificate transparency (CT) which is a system with an open source system that logs and monitor different certificates, also each log maintains a list of “trusted” roots. Certificate transparency logs attaches and ads new certificates to a hash tree which continues to grow and expand [5],[6].

A log must have fulfilled some terms to be correct. The following sentences describes the terms [5],[6]:

If there is no signature that is authentic, certificates cannot be published.

The certificate that is submitted should be verified and have a signature that is valid and have a track record that leads back to the one who issued the certificate

If there is some kind of verification chain of a new certificate, it should be possible to store it.

The chain mentioned above should be available if someone wants to review it.

(13)

2.3 Different certificates authorities and their security level

From our dataset we can see that there is a difference between different browsers where Google certainly has a high standard because they only accept websites that have their own certificates. What also can be said is that many websites have both HTTPS and legit certificates, but there were no padlocks on some of them, which can make one suspect that there is something wrong.

False Yahoo Real Yahoo

Figure 2.1: Example of a phishing site (left) targeting Yahoo.com (right).

Figure 2.1, which illustrate an example of two websites where the false one is a phishing site targeting the real Yahoo. The figure shows an example of how web sites that want to manipulate the user can get certificates and is seen as legitimate, but we also see that its clearly is a phishing site of the original Yahoo site. We had many examples of this type in our dataset.

2.4 Targeting legit websites with phishing

Many websites directly target legit hostnames (e.g. phishing target) where websites that are fake takes a famous and established name but make a small change in the domain name [7].

(14)

Figure 2.2: A visualization of the real Backman.com

Figure 2.3: A visualization of the phishing version of Backman.com

The illustration of targeting legit websites with phishing can be seen in Figure 2.2 where one can see a visualization of the real Backmann.com. In Figure 2.3 we see the false Backman.com, in this case called “Backkman”, this is a quite reoccurring phenomenon, where the names only differ slightly. Backman is not as popular as Yahoo in terms of how many users it got, but we will nevertheless use Backman and illustrate it as an example.

(15)

Backkman which was one of many sites that occurred in our dataset is shown to be a phishing site, but with all the necessary certificates to seem as a legit website. But experiments showed that even if Backkman had HTTPS, Chrome did not accept it. This is because Google Chrome does not only accept websites that have HTTPS, the domain must also have a valid SSL certificate, which Backkman with the two K obviously do not have!

Figure 2.4: The certificates issued to Backkman.com

Figure 2.5: The issuers name and information for Backkman.com

In Figure 2.2 which shows the visualization of the real Backman.com, we can see the original page. But notice that it is not secured which is quite unusual because most web sites that are not phishing should have that, if we look at the phishing domain in Figure 2.3, the one with a small

negligible change in the name) called the Backkman with two k is also not secured. What we can see from this, however, is that the copied website has approved certificates, which we can see from Figures 2.4 and 2.5, where certificates are

(16)

issued by a legitimate CA named Cpanel Inc. Furthermore, we can see that it is only Google Chrome that blocks the phishing page while the domain works in most of the other browsers like Safari and Internet Explorer without warnings, This is because Google has developed an additional security against phishing sites because they only accept certificates from their own companies [8].

As the example above, many of the sites we have in our set is in the same way. What we can see is that the amount of phishing sites copies the original site and makes it legit by getting real certificates from surprisingly enough real certificates issuers. As mentioned before, those have no padlocks, but a padlock alone is no guarantee of not entering a phishing site [7].

2.5 Ways of phishing

There are a few different ways for phishing attacks to trick a user. Listed below are four very common ways [8]:

Typosquatting: when a phishing site has consciously spelled the real domain site wrong. An example of this: Apple.com (real) and Applle.com (fake).

Character substitution: when you use an alternative letter or digit that is similar to the real one. An example of this: Apple.com (real) and App1e.com (fake).

Use an established domain name as part of their own: When you use the full name of your target domain and add something to it. An example of this: Apple.com (real) and App1e-buy.com (fake).

Homograph attacks: when you use characters that look quite alike, the characters are written with the Unicode alphabet. An example of this: Apple.com (real) and Aƿƿ1e.com (fake). Using the four ways mentioned above, there are many different names for just Apple, but where everyone had some sort of disguise. Allow us to illustrate some examples.

App!e.com (Homograph attacks) Aple.com (Typosquatting)

App|e.com (Character substitution) ᴀpplle.com (Character substitution) Apples.com (Typosquatting)

Apple-phone.org (Use an established domain name as part of their own) Aplle.gq (Typosquatting)

(17)

2.6 Related research

In the following chapter, we present other related research that has been a great help to us. Williams and Li [7] made a quite interesting research that concerned the human simulation detections of phishing sites and made analyses of how the applicability of the Adaptive Control of Thought—Rational (ACT-R) cognitive behavior works for humans. Williams and Li talk much about the effectivity and occurrence of phishing. Although there are many technical defense mechanisms to prevent phishing today, they mean that the human is the single greatest factor to being exposed to fraud in the web. Pretend there is a system of different mechanics that attackers want to target, and in every system, there is always the weakest link. In our case, it is the human [9].

The analysis by Williams and Li [7] is important for our research, thus it gives us an insight into the psychology behind the human mind when on the web. Further, the authors show proof that with a model to how simulating human behavior with regards to the ACT-R cognitive architecture. This behavior simulates how we review a site at first sight. They also come to the important conclusion about how humans perceive HTTPS padlocks with as a security indication and how to simulates our cognitive brain. The use and understanding of ACT-R in this research is of use to our as well thus we also want to understand the human behavior behind phishing. They also made some research on how the optimal way is to combine human mind with technical defense to ensure the safety of all the user.

Dong et al. [10] presents a new machine learning method to discover phishing sites. With a stronghold that HTTPS is not enough, they propose a more secure way which includes the use of machine learning that takes in the function of the certificates for the keys to the domains. Dong et al. research is in many ways pioneering, thus it shows a way of detecting phishing sites in a fast way without using blacklists. The research may work as an extra complement to the mechanism that already exists today. Mechanisms like blacklists. The interesting aspect of the research is that it can be implemented on not only HTTPS but also on HTTP, which is interesting for us to know that there are such mechanisms, thus we want to know how we can prevent it on all the domains.

Another article by Hu et al. [11] concerns the more psychological aspect of phishing. They come up with a unique detection method. Much of the method relies on the information of the website's servers. In the servers there are always special logs. The method that is provided in this article finds a way of analyzing and finding reference in those logs in a way that it first collects legal websites server logs and subsequently extracts the URLs and filters the blacklists and whitelists, which it further both automatically and manually validates the URLs, which will give us all the phishing website [11].

The method is not fast and relies on the understanding of the phishing and its technique. Nevertheless, it brings new light in the phishing area. The whole method centers around the logs and references connected to the server. And by analyzing the logs, the phishing's

(18)

sites will be found, besides if they do not have any legal website server that records that is legitimate the method will be found out [11].

Furthermore, Cui et al. [12] have created a report that really helped us in our work where they have tracked phishing attacks over time. They come up with two results that are important for our theses. First, phishing pages take time to deal with. They argue that there is an illusion among people that it is quick to take down phishing websites, but it does not. This shows that today's phishing counterattacks are not good. Second, a majority of these phishing pages are managed and created by a small-scale people.

What is more interesting in this report is partly the observation that many of the phishing pages today are copies of other real websites and above all, their method of detecting phishing pages. What the authors believe is that today's counterattacks against phishing do not keep the standard in the aspects of efficiency and speed [12].

So, they had a base with different phishing pages (nearly 19,000 pieces) and the way they have optimized the counterattack is by finding tendencies and similarities between the phishing pages and the originals. Furthermore, they have developed a system that is extremely interesting because they have a system based on the Document Object Model (DOM). Through this base with different vectors, one can quickly and smoothly find direct links between websites and detect if the website is an original or a fake by looking at parameters like number of certificates and similarities between the phishing pages and the originals [12]. With that said, these three articles are closely related to our work and has given us much light in our work with why and how certified certificate issuers issue approved certificates to fraudsters' website and why we are in a demand for monitor domains and certificates. It has also given us some thoughts of how to utilize the methods of fighting backed secured domains with CT logs and understand the process behind it [12].

3. Gathering the data with webpage scraping using JSOUP

Our goal was to research the PhishTank domains and their certificates with the help of Certificate search. To do this we created a script for web scraping in java using JSOUP. Below we have summarized our general approach in the matter of how the data was gathered. The data was gathered mainly using the following three tools:

PhishTank, which is a website containing reported phishing URLs and data about them.

Certificate Search (crt.sh) which is a website that provides the certificates issued to a given domain

A Java integrated development environment to make a script that scrapes data from Certificate search using the data from PhishTank.

(19)

3.1 PhishTank

PhishTank is a free community site containing phishing data that can be submitted, verified tracked and shared by anyone. PhishTank also provides an open source for developers and researchers which is where all the phishing data used in this report comes from.

Making an account on the website allows one to register for a personal API-key which gives access to the data stored on the site, which will be downloaded locally to the computer. The data can be retrieved in four different formats; XML, Serialized PHP, CSV and JSON [13]. We choose the JSON-format simply because it is what we are familiar with and downloaded it using our personal API-key. Figure 3.1 shows the structure of the JSON-file provided by PhishTank.

Figure 3.1: Structure of the Json file provided by PhishTank.

3.2 Certificate search

Certificate Search (crt.sh), created by Sectigo, formerly Comodo CA is a website that provides information about the certificates issued to a given domain. Information such as the issuer name, when the certificate was logged, who trusts the certificate, for what purpose the certificate is trusted and much more.

We have considered other alternatives to using certificate search. As mentioned in the background, one alternative would be to download all entries from CT logs and then try to find certificates issued to our domains. Another alternative was to use a service like Certstream [14] to check for newly

(20)

issued certificates and their owners and find some way to make use of that data in connection with the PhishTank data.

We felt that certificate search were the best of these alternatives since we are only interested in the reported phishing domains we got from the PhishTank database, which is a limited and relatively small number compared to all the domains that exist hence it should be the fastest way, of the three presented alternatives, to find the certificates issued to our phishing domains.

The pages on crt.sh that we scraped were the page you get when you type in a domain and the page you get when you look at the certificates. Figure 3.2 and 3.3 shows example screenshots of how these pages look like for facebook.com. From the domain page we scraped how many certificates the domain has been issued and the issuer names; e.g., the CAs that have issued the certificates. From the certificate page we scraped what Signature algorithm was used and the public key bit size.

Figure 3.2: Example of how a typical domain page on Certificate Search looks. The domain used in the example is facebook.com.

(21)

Figure 3.3: Example of how a typical certificate page on Certificate Search looks. The domain used in the example is facebook.com.

3.3 Our script

To scrape the data from Certificate search we made a script in java. To get all the data we have gathered, a lot of small alterations were made to the code. One example of an alteration would be to only look at domains that were reported to PhishTank more than once. We will not go through all the alterations here. Instead we will show you how one scrape could look like.

As mentioned above we downloaded the PhishTank data to the computer as a JSON-file and hence we had to parse it. Figure 3.4 shows a simple example script in which we extract the URLs from the JSON-file. Note that we also divided the URLs into two separate arrays, one for HTTP and one for HTTPS. We did this because it is only the HTTPS URLs that

(22)

will have any certificates for us to find on certificate search. By separating them, we could also see how many of the URLs were HTTP and how many were HTTPS by checking the size of the arrays.

Figure 3.4: Parsing the JSON-file from PhishTank

Next, we had to get the domains from the URLs since the certificates are issued to the domains, not to the full URLs. For example, if you would want to see the certificates issued

to YouTube on Certificate Search you have to input “Youtube.com” and not “https://www. youtube.com/watch?v=IlqSIxqhJx0”. We did this using the Javas URL class. The code snippet for this is shown in Figure 3.5.

(23)

Figure 3.5: Getting the domains from the HTTPS URLs.

Lastly, we had to retrieve the certificates for all these domains from certificate search. This was done with the help of Jsoup [15], and the code snippet for this is shown in Figure 3.6.

Figure 3.6: Getting the certificates for all the phishing domains from certificate search. In summary, we get a dataset of phishing domains from PhishTank.com and using these domains as inputs for JSOUP we can scrape crt.sh to gather data about the certificates of these domains.

(24)

4. Result

Limitations

It does not say anywhere on the PhishTank website what their database contains except that it is a collection of reported phishing domains. This means that when downloading their data, we do not know if the data only contains the phishing URLs that still are online or every URL that has ever been reported. With that said we think the number of URLs would be larger if the latter was the case suggesting that what is downloaded is all reported phishing URLs currently online at the time of download. We also assume that every domain in the dataset from PhishTank is a phishing domain, since we do not have time to look at every site and determine that ourselves. There is also a small margin of error in the presented numbers due to the script running multiple times and there being a slight variation in the number of successful HTTPS requests made to crt.sh but this should not affect the general analysis.

4.1

Certificate, domains and CAs

We have seen, based on the related studies referenced in this report, that many websites that use HTTPS and have legitimate certificates are despite this phishing sites. To take a closer look at the HTTPS domains reported to PhishTank we removed all HTTP domains and all duplicate domains. This resulted in a reduction from 10,577 URLs to a much smaller dataset of 2,074 domains.

What we could see from this dataset was that many of the websites had lots of different certificates in total, the websites combined for 28817 certificates. How these certificates were distributed can be seen in Figure 4.1. Looking closer at the CAs that have issued these certificates we found that there were 127 different CAs represented. The CAs that have issued most of these certificates were COMODO ECC, Let’s Encrypt Authority and cPanel inc. Figure 4.1 shows the number of observed certificates that were issued by different CAs. By visiting the websites of these companies, we found that they offer free Domain Validation (DV) certificates. This result is expected because DV certificates are the easiest type of certificate to get [16].

(25)

Figure 4.1: Number of certificates issued by the CAs found by our script

Figure 4.2 shows the number of domains that have a certain number of certificates. We can see from Figure 4.2 that most of the domains from our dataset have very few certificates issued to them and that most of the HTTPS domains in our dataset only have between one and five certificates. Doing some research, we have found that a fewer number of certificates is an indication that a domain might be a phishing domain. The life span of phishing domains is usually short and a fewer number of certificates means that a domain has not existed for very long [12]. Why a small number of certificates suggests that a domain has not existed for very long is because certificates expire. According to Standard Certificate Authority Guidelines, an SSL Certificate will expire after up to 12 months or 24 months and if it is a free SSL certificate it will expire after 30 days to 90 days. When a certificate expires you must get a new one if you still want to be able to use HTTPS. Getting a new certificate does not remove the old one which means that a domain with a lot of certificates most likely have a lot of expired certificates and therefore is older [12]. The result shown in Figure 4.2 therefore suggests that most of the domains in our dataset are phishing domains which is a desirable result since that was the goal when downloading the data from PhishTank. Note that domains with zero certificates were not included in the graph. We had 927 of these domains in our dataset. The reason they were excluded from the graph is that domains with zero certificates cannot implement HTTPS and therefore do not fit in a graph showing how many certificates our HTTPS domains have.

(26)

Figure 4.2: Shows the number of domains that have a certain number of certificates. Looking at the 30 websites that have the most certificates we found that all the domain names are safe and some of them are well known; e.g., Twitter and Google. These domains can be seen in Figure 4.3. If we look at Google.se, that domain has nearly 700 different certificates while Google mail has over 1000 certificates. Looking at websites with 1-10 certificates, we found a larger frequency of non-legitimate websites. This result, just as the previously presented result, suggests that websites that only have around 1-10 certificates are more likely to be a phishing site.

(27)

Figure 4.3: The 30 domains with the most certificates, and the number of certificates they have.

Most CAs do not review who receives their X509 certificate, at least not in depth. And therefore, quite a few phishing pages today have legitimate certificate cases [9]. Many users have been trained to look at two things when they are browsing the internet. The first is that the domain they are visiting have HTTPS, the other is that the domain has the padlock symbol [10]. This makes it much easier for phishing sites to trick users, since the padlock and the HTTPS mark should be a guarantee that it is safe. But then you can ask why these global companies like Comodo issue certificates to domain names that they even know beforehand have copied a real domain name [17]. The answer is that anyone who has bought a server and owns the domain has the right to buy certificates from these publishers and thus the publishers do not make anything wrong themselves, they are companies that sells certificates that legitimize domain names on the net [18]. In Figure 4.3, we can see that those domains that are popular like Google and Twitter have over 100 certificates. One of the reasons for this is that they have existed for many years and therefore have needed to get more certificates since certificates expire. They also have a lot of money and larger servers, which makes it more likely that they can afford to buy more certificates. Another reason can be that big brands, like Google, have many different types of websites (like Gmail and Google Drive and more).

(28)

Because of this they must invest in more certificates to maintain a high security standard and therefore they have a higher number of certificates than the average domain.

A potential fraud that has the intention to implement a phishing site and has bought a server and a purchased name is free to buy X509 certificate [18], that is why so many

domains have bought one or two certificates, which can be seen in Figure 4.2. So, the error does not lie with the CAs. It is more about how easy it is to buy a domain, run a server and use a similar name. Further, Google is developing their own certificates and thus only accepts the domains with their own padlocks [7].

The CAs do not inspect the names of the domain, even if they seem suspect. Most of the phishing sites today has both a padlock that is green and is secured with HTTPS, often with more than one certificate.

The CAs are businesses that depend on money flow, this is most likely a contributing factor to why so many phishing domains have been issued certificates even though they should not have been. But the companies that most certainly is affected by phishing, like Google and Apple, can not only think about the money. Because they have many sub-pages and emails that are affected by phishing directly, they can not only think about issuing certificates to the first best domain name like Comodo does. A page that has a legitimate certificate and is secured does not necessarily have to work on Google Chrome while it works on, for example, on Internet Explorer [19].

Further, the data we gathered shows an increase in both the number of HTTP and HTTPS domains reported to PhishTank which can be seen in Figure 4.4 and Figure 4.5. There can be many reasons why we see this result. One reason is that phishing domains have increased these last years. Another reason is that there are more people who know that PhishTank exists now than they did before and therefore more people report domains to PhishTank. We think that it is a combination of these reasons and other possible reasons that has led to the increase in number of HTTP and HTTPS domains reported to PhishTank.

The more certificates a website has registered, the higher its ranking on search engines, e.g., it gets a better SEO (Search Engine Optimization). Also, the number of registered certificates contributes to an increased level of trustworthiness and Google have (at least in the past) prioritized the sites with more registered certificates [20].

(29)

Figure 4.4: The increasing of HTTP reports.

(30)

Figure 4.6: Average number of certificates issued to phishing domains over the years.

Moreover, if one has a domain in one’s possession and if a certificate publisher notes any similar names that appears in the stream, you can as the original holder of the name be notified [21]. So, if this new domain that appeared in the CT logs matches your own website in some way, you will get a note. Many of today's phishing attacks are done on already established websites to fool and take their customers. It can be anything from banks and clothing stores to news sites. What is then done is that there is a data with a continuous flow of new published CT logs where all websites that become certified will appear. Every domain name that becomes certified must become available to everyone on the Internet for some hours before it can vanish again [21]. Also, we can see that the average number of certificates the domains reported to PhishTank have increased since 2014, which is seen Figure 4.6 , which shows the average number of certificates issued to phishing domains over the years, and that means many companies are more aware of the problem and do not give out certificates as easily as previous years.

An interesting way to apply this is to understand what Hardenize has developed. Hardenize, which is a research partner company with Digicert, has an excellent way to cover phishing sites via certificate transparency monitoring. This research company intends to gather all the information about the CT logs in a database and make all logs visible. Which means that they can directly show whether a new domain linked to a log is

(31)

similar or has similar tendencies to another domain. If this new domain name is like an already established page, it is added to a list of all new pages that are risky or may have a tendency of phishing [22].

4.2 A deeper look at the certificates

In chapter 4.1 we presented general data about the certificates, domains and CAs that we had gathered using our code. We also presented theory regarding phishing. In this chapter we will, using our code, take a closer look at the certificates. We will specifically look at their signature algorithm and public key bit size based on different factors.

Looking at the 28,817 certificates issued to the domains in our dataset we can see that most of them use sha256 with RSA encryption as their signature algorithm, which you can see in Figure 4.7 that shows the type of hashing algorithm that is used in the certificates, where sha265 has 62.2 %, and is run with RSA encryption. One reason why sha256 is so popular may be because it is very secure. The different bit sizes of the public keys used are 256, 384, 1024, 2048, 3072 and 4096 bits. We note that 2048 bits is the most common. Figure 4.8 shows the number of observed certificates that use a public key of specific size. Depending on the algorithm used for encryption you need different bit key sizes to reach the same level of security. For example, ECDSA needs smaller keys than sha256 to reach the same security level. That the most common bit size is 2048 is probably correlated with sha256 being the most common algorithm since 2048 bits is a pretty common size for public keys when using sha256.

(32)

Figure 4.8: The public key bit size of the certificates that have been issued.

In Figure 4.9, where we see the public key bit sizes used by certificates with the different signature algorithms, and If we look at the relationship between the signature algorithm and public key bit size, we can see that 2048 bits is the most common size for both sha256 and sha1 with RSA encryption while 256 bits is the most common size for ecdsa with sha256. . This result suggests that there is a big difference in the preferred public key bit size between the signature algorithms. But it also suggests that there is a lot of variation in the public key bit size within the same signature algorithm. You can for example not know that the public key bit size will be 2048 bits just because sha256 with RSA encryption is used, even though it is the most common bit size for sha256.

(33)

Figure 4.9: Public key bit sizes used by certificates with the different signature algorithms.

Furthermore, it could be interesting to research if we can see any difference in what signature algorithm is used based on the number of certificates a domain has. To look closer at this, Figure 4.10 shows the frequency that different signature algorithms are used by certificates issued to domains with a certain number of certificates. From Figure 4.10 we note that sha256 dominates certificates on domains with a low number of certificates and that ECDSA became increasingly popular the more certificates the domains have. One reason for this might be that a key of 256 bits with ECDSA corresponds to a key with several thousand bits if you run RSA encryption which means that we can obtain the same type of security level that we had received with the help of RSA but with smaller keys. Smaller key sizes require less bandwidth to set up an SSL/TLS stream, which means that ECDSA certificates are ideal for mobile applications. As we saw earlier in the presented results some of the domains with a huge number of certificates were companies such as Google and Twitter who have a lot of mobile applications. So therefore, it makes sense that we see an increase in the use of ECDSA the more certificates the domains have.

It could also be interesting to see if there is any difference in what signature algorithm is used based on how many times our domains have been reported to PhishTank. In this case the results suggest that there is no difference. See Figure 4.11, which Shows the frequency that different signature algorithms are used by certificates issued to domains that has been reported different amount of times to PhishTank, what is interesting in this graph is that sha256 has decreased in last year while ecdsa has increased. We would argue that our

(34)

dataset is too small to answer this accurately though. This is because the number of domains reported once were 1761 and the domains reported more than once were only

313. An optimal dataset to investigate this would be a dataset containing equally many

domains for every number of reports.

Figure 4.10: Shows the frequency that different signature algorithms are used by certificates issued to domains with a certain number of certificates.

(35)

Figure 4.11: Shows the frequency that different signature algorithms are used by certificates issued to domains that has been reported different amount of times to PhishTank.

5. Discussion

5.1 Method

We think that the method we used suited our project pretty good because we were able to get the information we need, and it was not too complicated to do so. We had no prior experience of analyzing domains and their certificates so using certificate search was helpful. The reason we used JSOUP was mainly because we are experienced in using Java and JSOUP is a pretty popular tool for scraping websites in Java. Using this method might not be optimal for bigger project or more complex projects since using JSOUP to scrape crt.sh is a bit tedious and you need to adjust your script in minor ways for every run for every different type of data you want to scrape. You also need to make big changes to the script when the pages are different from each other. For example, you can see in Figure 3.2 and Figure 3.3 that shows an example of how a typical domain page on Certificate Search looks and how a typical certificate page on Certificate Search looks for facebook.com. Since these two pages are very different from each other the script must be adjusted to fit the specific page you are scraping which is time consuming. But for projects like this one were you only need to gather data for specific domains it works fine.

(36)

5.2 Result

As we saw in the results, the amount of reported HTTP and HTTPS domains have increased in the last 10 years. The reason for this could be an increase in phishing domains but it could also be that people have become more aware of the phishing problem over the years leading to more people reporting phishing domains when they come across them. When it comes to the increase of HTTPS domains being reported for phishing there is probably a correlation between the number of domains using HTTPS and the number of HTTPS domains being reported. But it might also be that the people doing the phishing attacks have started using HTTPS more than before.

Phishing attacks today are becoming increasingly difficult to detect and this may be because we, the general user of the internet, have not kept up with the developments that have taken place in the phishing area. Most of the typical internet users are not so familiar with social engineering attacks such as phishing [6]. The general user would never imagine that a phishing website would be using HTTPS. That a website would be approved and secured just because it is using HTTPS is far from the truth. Those who do the phishing attacks know this and we think that is the reason why the attackers obtain websites with HTTPS. This makes HTTPS lose part of its value since you cannot be sure that you are safe browsing HTTPS sites. What we can see from our result is that HTTPS is not entirely secure and may simply be that even criminals have access to obtain such a secure stamp as HTTPS. However, it should be made clear that HTTPS clearly is a way in the right direction for the security of internet, because it has made phishing more difficult. An encrypted connection with a server that is also secure. But it has turned out that it is easy to obtain a TLS certificate that is valid. One reason why it is widely known that HTTPS websites are always considered secure is perhaps because of all the marketing done by the big companies.

We can see from our results that the number of certificates per domain has continuously increased in recent years, which can be a response to that e.g. Google has raised its requirements so that a domain needs more than a regular SSL or TLS certificate to be considered safe. As we can see in the results, well established websites have many certificates. The work Google is doing is a good step towards securing the internet but if this means that all the phishing sites need to do to adapt to this is just getting more certificates to be able to continue their attacks then we still have a problem.

But it is hopefully not that easy, many phishing websites get their secured stamp from several different companies that do not care so much about the website's actual purpose. For example, as we saw in the results, companies such as Comodo offer free domain validation certificates which means that anyone could easily get HTTPS. This is good, but it also requires that you use Googles web browser, Chrome. Which for various reasons you might not want to do. But the fact that

(37)

Chrome is offering this level of security makes it highly competitive which will hopefully result in the other browsers implementing similar strategies for being more secure. If this happens, we think it will result in a more secure internet.

However, there can also be many benefits to the ease of obtaining HTTPS, as it allows individuals and smaller companies to move their domain to an encrypted and search server faster and easier. However, the disadvantage is that the thieves also come along.

5.3 Wider context

If the digitalization continues to grow and expand, then it is inevitable that phishing attacks also grow and appear in other, more sophisticated ways. With the expanding of social manipulation attacks it may lead to greater demands on states and authorities to counteract it. To learn about it at school and to become more general education. Everyone knows about virus attacks today after it is widely known what it is and how it works. But social manipulation is a modern problem that may require us to act and learn how to fight it.

6. Conclusion

In our thesis we have tried to answer broad questions. The main conclusion of this thesis is that phishing on the net is growing, this at the same time as the creators of the phishing home pages improve and sophisticate their websites so that they look increasingly authentic. We must respond to this trend and we have shown that there are different ways to do this. One is to make the public more aware about how phishing works. We who wrote this report only had limited knowledge about phishing from taking a security course before starting the work on this thesis and we have studied IT for 3 years now. This may be an indication that people must learn more about how to handle fraudsters online. Just as we have learned how to deal with fraudsters physically, we must learn how to handle it digitally.

Furthermore, we have learned that Google is aware of the phishing problem and is starting to respond to it, hopefully the other big browser companies will do the same.

To answer our research questions:

Why and how can CAs issue approved certificates to fraudsters' websites? Through what we have learned making this report part of the problem is that it can be hard for the CAs to know if a domain is a fraudster or not. There are also different levels of certificates were there are different levels of checks done to the domain and its owner before the certificate is issued. The lowest levels of certificates barely have any checks at all and, though we can not see this from the result presented in this report, our hypothesis is that it is mainly these types of certificates that the phishing domains get.

(38)

Why are increasingly many HTTPS domains not trustworthy?

This is something we did not really find an answer to. But we think it has to do with how thieves adapt. Since people generally trust HTTPS domains thieves can exploit this and before there are any effective way to deal with this the number of HTTPS domains dedicated to scamming people will most likely continue to increase.

In future work we would like to expand our work into researching and contributing new ideas on how to avoid scams on the web. With more time and resources, it would be interesting to see how we could come up with new methods. The digitalization expands very fast and with that the phishing attack methods also becomes more sophisticated, which creates a need for new methods of protecting users from the attacks.

Lastly, we think that our code can be a valuable starting point for anyone wanting to make a deeper analysis on HTTPS-phishing domains.

(39)

7. Bibliography

[1] M.Nadhom and P.Loskot, “Survey of public data sources on the internet usage and other internet statistics”. In: Data in brief. 2018. pp 1914–1929.

[2] E. H. Chang, K. L. Chiew, S. N. Sze, and W. K. Tiong. “Phishing detection via identification of website identity”. In: Proceedings of the International Conference on IT

Convergence and Security (ICITCS), 2013.

[3] C.Hassold. “A quarter of phishing attacks are now hosted on HTTPS domain:

Why?”. In: The PhishLabs. 2017.URL:

https://info.phishlabs.com/blog/quarter-phishing-attacks-hosted-https-domains

[4] R.Zhao, B.E.Gavett, S.E.John , C.A.Bussell, J.R.Roberts and C.Yue.

“Phishing suspiciousness in older and younger adults: The role of executive functioning”. In: PLoS ONE 12(2): e0171620. 2017.

[5] J.Gustafsson, N.Carlsson, G.Overier, and M.Arlitt. “A First Look at the CT Landscape”: Certificate Transparency Logs in Practice”. In: Proceedings of the passive and active measurements conferences (PAM), 2017.

[6] O.Gasser, B.Hof, M.Helm, M.Korczynski, R.Holz, and G.Carle. “In Log We Trust: Revealing Poor Security Practices with Certificate

(40)

Transparency Logs and Internet Measurements”. In: Proceedings of the passive and active measurements conferences (PAM), 2018.

[7] N.Williams and S.Li. “Simulating Human Detection of Phishing Websites: “An Investigation into the Applicability of the ACT-R Cognitive Behavior. Architecture Model”. In: Proceedings of the IEEE International Conference on Cybernetics (CRC), 2017.

[8] T. Fadai, S. Schrittwieser, P. Kieseberg, and M. Mulazzani. “Trust me, I’m a root

CA! analyzing SSL root CAs in modern browsers and operating systems”. In: Proceedings of the International Conference on Availability, Reliability and Security (ARES), 2015.

[9] T.Y.Pan and X.Ding,"Anomaly based web phishing page detection”, In: Proceeding of the Annual Computer Security Applications Conference (ACSAC), 2006.

[10] Z.Dong, A.Kapadia, J.Blythe and L. J.Camp. “Beyond the lock icon:real-time detection of phishing websites using public key certificates”. In: Procding of the APWG Symposium on Electronic Crime Research (eCrime). 2015.

[11] J.Hu, X.Zhang, Y.Ji, H. Yan, L.Ding, J.Li and H.Meng. “Detecting Phishing Websites Based on the Study of the Financial Industry Webserver Logs”. in: Proceedings of the International Conference on Information Science and Control Engineering (ICISCE). 2016.

(41)

[12] Q.Cui, G.Jourdan, G.V.Bochmann, R.Couturier and I.V.Onut. “Tracking Phishing attacks over time”. In: Proceedings of the International World Wide Web Conference (WWW), 2017.

[13] PhishTank: Sites from PhishTank, http://www.phishtank.com/index.php.

[14] Certstream: “Real-time SSL certifiacate transparancy log update strem. Issued in real time”. https://certstream.calidog.io/.

[15] Jsoup: Java HTML Parser. https://jsoup.org/

[16] P.Gutmann. April 2014. Engineering security ed. [ebook] Auckland, University of Auckland. Pages 72-100. Available at: https://www.cs.auckland.ac.nz/~ pgut001/pubs/book.pdf

[17] R. Dhamija and J. D. Tygar. The battle against phishing. In: Proceedings of the symposium on Usable privacy and security (SOUPS), 2005.

[18] S.Garera, N.Provos, M.Chew and A.D.Rubin

“A framework for detection and measurement of phishing attacks”. In: Proceedings of the ACM workshop on Recurring malcode, 2007.

(42)

[19] S. Afroz and R. Greenstadt. Phishzoo: “Detecting phishing websites by looking a Them.” In: Proceedings of the semantic computing conference (ICSC), 2011.

[20] Y.Zhang, J.Hong, L.Cranor, " A content-based approach to detect phishing web sites", In: Proceedings of the WorldWideWeb Conference (WWW), 2007.

[21] R.Ledesma, C.Neil, D.Bon and Y.Teraguchi,” Client-side defense against web-based identity theft”, In: Proceedings of the Annual Network and Distributed System Security Symposium (NDSS), 2004.

[22] M.Cardwell and Hardenize Labs. “Detecting Phishing Sites Using Certificate Transparency Monitoring”. https://www.hardenize.com/blog/certifi cate-transparency-monitoring-phishing-detection. 2019.

An analysis of reported phishing domains