Comparison of Automated Password Guessing Strategies

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2019

Comparison of Automated

Password Guessing

Strategies

Tobias Lundberg

(2)

Comparison of Automated Password Guessing Strategies:

Tobias Lundberg LiTH-ISY-EX--19/5213--SE Supervisor: Niklas Johansson

isy, Linköpings universitet

Examiner: Jan-Åke Larsson

isy_{, Linköpings universitet}

Information Coding

Department of Electrical Engineering Linköping University

(3)

Abstract

This thesis examines some of the currently available programs for password guess-ing, in terms of designs and strengths. The programs Hashcat, OMEN, PassGAN, PCFG and PRINCE were tested for effectiveness, in a series of experiments sim-ilar to real-world attack scenarios. Those programs, as well as the program Tar-Guess, also had their design examined, in terms of the extent of how they use different important parameters. It was determined that most of the programs use different models to deal with password lists, in order to learn how new, sim-ilar, passwords should be generated. Hashcat, PCFG and PRINCE were found to be the most effective programs in the experiments, in terms of number of cor-rect password guessed each second. Finally, a program for automated password guessing based on the results was built and implemented in the cyber range at the Swedish defence research agency.

(4)

(5)

Acknowledgments

Thanks to Niklas Johansson and Hannes Holm for supervising the work and Jan-Åke Larsson for examining it.

Linköping, June 2019 Tobias Lundberg

(6)

(7)

1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 2 1.3 Research questions . . . 2 1.4 Delimitations . . . 2 2 Background 3 3 Theory and related work 5 3.1 Terminology . . . 5 3.1.1 Hash function . . . 5 3.1.2 Password cracker . . . 5 3.1.3 Password guesser . . . 6 3.1.4 Sister Password . . . 6 3.1.5 Online/offline guessing . . . 6

3.1.6 User and attacker . . . 7

3.2 Password Content . . . 7 3.2.1 Personal information . . . 7 3.2.2 Password Policies . . . 8 3.2.3 Password Re-use . . . 8 3.2.4 Dictionary words . . . 8 3.2.5 Native Language . . . 9

3.3 Password guessing programs . . . 9

3.3.1 Hashcat . . . 9

3.3.2 OMEN and OMEN+ . . . 10

3.3.3 PassGAN . . . 10

3.3.4 PCFG . . . 11

3.3.5 PRINCE . . . 11

3.3.6 TarGuess . . . 12

3.4 Other programs and techniques . . . 13

3.4.1 Rainbow Tables . . . 14

3.4.2 Other password guessers . . . 14

(8)

4 Method 15

4.1 How password guessing is performed . . . 15

4.1.1 Interviews . . . 15

4.2 Overall design of password guessing programs . . . 17

4.2.1 Input parameters . . . 17

4.3 Overall effectiveness of password guessing programs . . . 18

4.3.1 Metrics for measuring effectiveness . . . 18

4.3.2 Used sets of data . . . 20

4.3.3 Studied scenarios and tests . . . 20

4.4 The implementation of the system . . . 24

4.4.1 Overview . . . 24

4.4.2 The requirements . . . 24

4.4.3 Expected number of guesses . . . 25

4.4.4 Expected number of correct guesses . . . 25

4.4.5 Tests to determine the parameters for the curves . . . 26

4.4.6 Putting together the material into a program . . . 26

5 Result 27 5.1 How password guessing is performed . . . 27

5.2 Overall design of password guessing programs . . . 30

5.2.1 Summary . . . 30

5.2.2 Hashcat . . . 30

5.2.3 OMEN and OMEN+ . . . 32

5.2.4 PassGAN . . . 33

5.2.5 PCFG . . . 34

5.2.6 PRINCE . . . 35

5.2.7 TarGuess . . . 36

5.3 Overall effectiveness of password guessing programs . . . 38

5.3.1 Test to measure guessing speed . . . 38

5.3.2 Test with password list . . . 39

5.3.3 Test with dictionary words and password list . . . 44

5.3.4 Test using dictionaries of different languages . . . 44

5.3.5 Test against passwords with a minimum length policy . . . 44

5.4 The implementation of the system . . . 45

5.4.1 Expected number of correct guesses . . . 46

6 Discussion 49 6.1 Result . . . 49

6.1.2 Design . . . 49

6.1.3 Test with password list . . . 50

6.1.4 Test with dictionary words and password list . . . 50

6.1.5 Test using dictionaries of different languages . . . 50

6.1.6 Test against passwords with a minimum length policy . . . 51

(9)

Contents ix

6.2.1 Criticism of sources . . . 52

6.2.3 The design of the programs . . . 53

6.2.4 The effectiveness of the programs . . . 53

6.2.5 The development and implementation of the system . . . . 55

6.3 The work in a wider context . . . 56

7 Conclusions 57 7.0.1 Answers to the research questions . . . 57

7.0.2 Future work . . . 60

(10)

(11)

1

Introduction

1.1 Motivation

Passwords are currently one of the most widely used authentication methods. A compromised password can enable an attacker to obtain sensitive information or abuse privileges. This makes passwords a very interesting topic to study. In recent years, multiple different websites have been breached and millions of pass-words have been leaked [15]. Notable examples includes RockYou (32 millions passwords, 2009), LinkedIn (100 millions passwords, 2012), and Google Plus (0.5 millions, 2018). Although many websites take security measures such as hash-ing and salthash-ing the user passwords, a weak password can still be guessed by an attacker with access to the password hash [16]. This is known as an offline pass-word guessing attack. When humans pick passpass-words they tend to come up with something that can be remembered and entered again in the future [10]. That implies that passwords selected by humans typically have some sort of structure instead of being entirely random, or they would be much harder to remember [35]. If such structures can be discovered and modelled, it enables easier guess-ing of passwords. That is a good reason for users to try to avoid those structures, and for those structures to be uncovered so that users can be told about them. Another reason to model how users select password is so that a potential attacker, who targets the passwords of users, can be modelled and simulated. This would enable better understanding of how secure passwords truly are against real at-tacks, and help when coming up with protection mechanisms. Knowledge of how an attacker might target passwords would also help with defending against password guessing attacks.

(12)

1.2 Aim

The goal of this project is to examine how password guessing is typically made by people with professional experience, and how users typically pick passwords. This knowledge will be used to examine and compare state-of-the-art automated password guessing programs. The state-of-the-art password guessing programs will be compared in terms of effectiveness and design. This is done so that an attacker that guesses passwords can be modelled and implemented, which will allow for cheaper and less time-consuming attacks to be defended against in a training environment. Such an automated system will be implemented in the Cyber Range at The Swedish Defence Research Agency (FOI). The goal of the implemented system is to take the role of an attacker, which is to be defended against in the security training that FOI carries out.

1.3 Research questions

The following research questions are to be answered in this report.

1. How are password guessing attacks performed by people with professional experience?

2. What are the overall designs of current state-of-the-art automated password guessing programs?

3. How does the password guessing programs compare in terms of effective-ness?

4. How can an automated password guessing system be implemented in the Cyber Range at FOI?

1.4 Delimitations

This report will not look into implementation-specific details on various hashing algorithms. Although cryptoanalysis can be performed in order to analyse and find weaknesses of such algorithms, this paper make the assumption that the hash algorithms are not possible to reverse and focuses only on the syntax and structure of passwords. This way, the results and conclusions reached can be applied to any system in which passwords are entered by users, regardless of which specific algorithm is used.

(13)

2

Background

The work was carried out at the Swedish Defence Research Agency (FOI). The main purpose was to implement an automated password cracker in the cyber range CRATE 1. A cyber range is a simulated network environment, used for training and research. CRATE has a system which emulates attacks against net-works (an automated red team), and this system is to be expanded with password cracking capabilities.

1_{https://www.foi.se/en/foi/resources/crate---cyber-range-and-training-environment.}

html

(14)

(15)

3

Theory and related work

This chapter introduces some of the theory behind passwords, hashes, password guessers and password cracking programs.

3.1 Terminology

This section lists some of the terms used in the report.

3.1.1 Hash function

A function which calculates a hash of a fixed length, from an input of any length. Specifically, cryptographic hash function are considered here, which are designed to be infeasible to invert. They are often used to store passwords in systems that requires authentication, since they allow easy checking of passwords without re-vealing what the password is. An attack that tries to find an input with a spe-cific hash value is known as a preimage attack. In this paper, a password being ‘cracked’ refers to a preimage attack which has successfully recovered the

pass-word for a specific hash value.

3.1.2 Password cracker

A password cracker is a program which attempts to find the password which was used to generate a given hash. The input to a password cracker is the hash, the

(16)

hash algorithm used, as well as some mode of operation. Depending on the mode of operation, other inputs (such as a dictionary) might be required as well.

3.1.3 Password guesser

A password guesser is a program which comes up with guesses for passwords. They do not handle the generation of hashes or comparisons with the target hashes, unlike a password cracker. Instead, a password guesser just generates candidates of passwords, typically in order of decreasing probability. A password guesser can be used together with a password cracker to find hashes. The rela-tionship between password guessers, password crackers and hash functions can be seen in figure 1. It is worth noting that password guessers are often included in password crackers (such is the case with John the Ripper and Hashcat).

Figure 1:The relationship between various parts used for password cracking

3.1.4 Sister Password

Two passwords are sister passwords if they belong to the same user and are used as authentication for two different systems. For example, a password on website A by user U is a sister password to a password on website B by user U .

3.1.5 Online/offline guessing

When password guesses are made against the live authentication system, it is known as an online attack. If the hash digest of the password has been leaked and guesses are made without the live system, it is known as an offline attack.

(17)

3.2 Password Content 7

3.1.6 User and attacker

In this report, ‘user’ refers to the person who set the password which is being targeted by an attacker. ‘Attacker’ refers to the person who runs a password cracker or password guesser in order to find the password of the user.

3.2 Password Content

Some previous research have been made on what content users have in their pass-words, and what influences how users pick their passwords. This section exam-ines research related to password content, which is important to know when you examine which factors a password guessing program should take into account. This section shows that personal information, password policies, previous pass-words of the user, and dictionary pass-words in the native language of the user should be used when performing targeted password guessing.

3.2.1 Personal information

Su et al. showed the extent of personal information usage in Chinese passwords [29]. When they analysed more than 200 million leaked password and together with 20 million personal information records in China, they came to the conclu-sion that at least 37% of the passwords contains personal information. The per-sonal information they looked for was birth dates (9.44% of the passwords con-tained some sub-string of the birth date of the user), cell-phone number (8.97%), name pinyin (12.56%), name acronyms (12.72%), and email addresses (0.035%). Wang et al. conducted similar research, using five datasets of leaked passwords from English websites and five datasets of leaked passwords from Chinese web-sites [32]. Their results show that between 0.75% and 1.87% of users use their full name as their passwords, depending on which password leak you look at. Between 1.00% and 5.15% of Chinese users use their birth dates. Between 0.54% and 2.34% of passwords have the username in it, and between 0.77% and 5.07% have some part of the users email address in it. Furthermore, the paper by Wang et al. also shows the difference between English and Chinese users when it comes to picking passwords.

In 2017, Li et al. analysed the personal information content of 130000 passwords leaked from the Chinese website 12306.cn, along with the corresponding per-sonal information of the users [19]. Their research show that 24.10% of the users use a subset of their birth date in their password, 23.60% use some part of their account name, 22.35% use the users real name, 12.66% use part of the email of the user, 2.996% use the users ID number, and 2.726% use the cell phone number of the user. Furthermore, they conducted similar research on the leaked

(18)

passwords from the English website rockyou.com. Since that dataset only con-tained passwords and no personal information of the users, the paper looked for any real-world name in all of the passwords. Their conclusion is that more than 24.7% of the passwords contains a name with 4 or more characters, which can reasonably be assumed to be a name with some relation to the user.

Castelluccia et al. [7] looked at leaked passwords from facebook.com, and con-cluded that about 35% of the passwords are ‘somewhat correlated’ with one of the users attributes. The attributes they looked at were first name, last name, usernames, friends names, education and work, relatives and birth dates. They conclude that first name, user name, and birth dates are the most commonly used personal attributes in passwords.

The mentioned papers used different metrics to measure the amount of personal information used in passwords, so their results can not be compared by just look-ing at the numbers. However, one important conclusion you can draw is that personal information of users is often included in the passwords.

3.2.2 Password Policies

In an attempt to get users to pick stronger passwords for services, requirements (known as policies) are put on the passwords by system administrators. A paper by Komanduri et al. studies the effect of various password policies on the choices of passwords users typically make [17]. They found that requirements on length increase the security of the picked passwords, as measured by Shannon entropy, more than requirements on special characters or numbers.

3.2.3 Password Re-use

Users of multiple systems tend to re-use their passwords across multiple systems. Wash et al. studied to which extent this happens [33], and concluded that users tend to re-use strong passwords, presumable since they are harder to remember. The median number of websites for which a password is used was found to be 3, while the most used password of each participant was found to be used on an average of 9 different websites. Even when a user do not re-use a password exactly as it is, a password used on one service is often modified before it is used on a different service [31].

3.2.4 Dictionary words

In 2006, Cazier et al. showed that roughly 90% of passwords on an e-commerce site could be guessed by using dictionary words [8]. Hunt showed in 2011 that roughly 25% of the passwords used on gawker.com was present in a general En-glish dictionary [14].

(19)

3.3 Password guessing programs 9

3.2.5 Native Language

Bonneau showed that using a dictionary in the language of the targeted user could improve password guessing by up to 25% [6], as opposed to using a dic-tionary of a different language. Maoneke et al. performed a survey in which 107 Namibian and South African students were asked to generate a password, which was then checked for content [22]. The result show that 47% of the pass-words were generated using English pass-words, and 30% of them were generated using words from the participants native languages. The studies by Su et al. [29] and Wang et al. [32] also demonstrate the extent of how the culture and native language of a user influence password choice.

3.3 Password guessing programs

This section examines various specific programs which are related to password guessing and password cracking. Section 3.3.1 presents Hashcat, which is a gen-eral password cracking program. The following sections presents PCFG, OMEN and PassGAN, which are programs which learns the structure of passwords in different ways and uses this information to generate likely password candidates. PRINCE, introduced in section 3.3.5, is the only program intended to be entirely automated. TarGuess, introduced in section 3.3.6, also learns password struc-tures, and in addition it also augments the guessing generations with data re-lated to the person behind the target hash. All the programs except for Hashcat are password guessers, and do not handle the hashing and comparison of the guesses.

3.3.1 Hashcat

Hashcat [2] is a popular password cracking program. It comes with the following default ‘attack modes’, which are different ways of generating password guesses. The attack modes can further be combined in various ways for additional flexibil-ity.

1. Straight, which tests all the words in a dictionary. This mode also support word mangling rules, which are ways to manipulate the dictionary words. Examples of mangling rules includes appending a digit, reversing the word, or capitalising the first letter.1

2. Combination, which combines words from multiple dictionaries. Like the straight mode, word mangling rules can be applied in this mode as well. 3. Brute-Force, which tests all of the key-space for a given password structure

(known as a ‘mask’). The Brute-Force mode uses Markov chains, similar

(20)

to those used by OMEN, which is described in section 3.3.2. The Markov model comes pre-trained with passwords leaked from the website RockYou, but can be trained by a attacker-provided list of words.

4. Hybrid Dictionary + Mask, which takes a dictionary and a password struc-ture, and puts every possible character combination for the structure af-ter each word in the dictionary. This can be seen as a combination of the straight mode and brute-force mode.

5. Hybrid Mask + Dictionary, which is the same as above, except that it puts every possible combination before each word in the dictionary.

3.3.2 OMEN and OMEN+

A Markov chain is a model which describes a sequence of possible events, where each event depends on one or more of the most recent previous events. If the likelihood of an event occurring depends on the last n events, the Markov chain is called an n-gram Markov chain. The idea of using Markov chains to guess passwords was first proposed by Narayanan et al in 2005 [24]. They used 0-gram and 1-gram Markov chains to generate passwords, where each letter appearing was the event. For example, an 1-gram Markov chain would look at the most recent letter generated, and add output the next letter with the highest probabil-ity given that letter. This idea was later extended by Duermuth et al. [9] into a program known as Ordered Markov ENumerator (OMEN). OMEN is a password guesser which outputs the guesses in order of probability, given a model of a Markov chain [4]. It uses training data in form of previously found passwords, to determine the n-gram Markov model. n is set by the attacker and can go ar-bitrarily high, but bigger numbers increase computation time dramatically. The attacker can also provide an alphabet of the characters OMEN should consider, as well as the ‘level’ of computation. A level is a notation for how inaccurate the guesses should be before the program quits, where a low level implies that only high probability guesses will be made. For more details, see the original paper by Duermuth et al. [9]. The original paper by Duermuth et al used 4-grams, a 72 character alphabet, and 10 levels, and found those numbers to be a good trade-off between computation time and password guess accuracy. OMEN was later extended into OMEN+, which takes the personal information of the target into consideration [7]. They tried various types of personal information, but decided that the most worthwhile four are first name, username, date of birth and email address. OMEN+ is implemented in the same binary as OMEN.

3.3.3 PassGAN

A Generative Adversarial Network (GAN) is made of two neural networks. One of the networks, the generative network, tries to learn the statistical structure of some data set in order to generate new, statistically similar, samples. The other

(21)

3.3 Password guessing programs 11

network, known as the discriminating network, tries to detect which samples come from the generative network and which samples come from the original data. The training is complete when the discriminating network is unable to detect the source of the samples it receives. PassGAN is an implementation of a GAN which is used to generate new password guesses, from a large amount of other passwords [12].

3.3.4 PCFG

A context-free grammar is a set of rules on how a symbol can be transformed. The rule X → Y means that the abstract symbol ‘X’ can be transformed into the abstract symbol ‘Y ’. There can be multiple such rules for a given symbol, which makes each transformation possible. If there are multiple rules which are assigned a probability of occurring, it is known as a probabilistic context-free grammar.

The concept of using probabilistic context-free grammars to generate password guesses was first studied by Weir et al. in 2009 [34]. Since then, the concept has been studied and extended multiple times in the literature [13] [32]. The idea is to use training data of passwords to generate a probabilistic context-free gram-mar, which can then be used to generate password guesses. The PCFG password guesser looks at a large amount of passwords, in order to determine which trans-formation are the most likely for the general symbol ‘Password’. The possible abstract symbols are: Alpha Letters (A), Digits (D), Capital Letters (U), Small Let-ters (L), Keyboard Patterns (K), and Special CharacLet-ters (O). In the original paper by Weir et al., a symbol was indexed by how many times it repeated. So U L3D2

has the meaning ‘capital letter followed by 3 lowercase letters and two digits’, and would match ‘Pass23’, among others.

3.3.5 PRINCE

PRINCE stands for PRobability INfinite Chained Elements, and is a stand-alone password guesser by Jens Steube [5]. It uses a dictionary as a basis for the guesses, and combines the dictionary words in various ways. The ‘Infinite’ in its name originates from the fact that it will run until it exhausts the key space, which will take a very long time [27]. Internally, PRINCE reads each word in the dictionary into a table of ‘elements’ (the words), which is sorted by length. The elements will be combined to create ‘chains’ of a certain length. As an example, a chain of length four can be created by two elements of length two (2 + 2), or four elements of length one (1 + 1 + 1 + 1), among others. Depending on the length of the chain and the number of elements in the chain, the number of ways to materialise the chain can be different. This number is referred to as the ‘key space’ of the chain. When PRINCE guesses passwords of a length x, it sorts the different chains whose total length is x by their key space size, and exhausts them in order. [28]

(22)

If PRINCE attempts to guess passwords of length 3, it would try the chains 1 + 1 + 1, 2 + 1, 1 + 2 and 3. Suppose the dictionary fed to PRINCE contains 20 words of length 3, 5 words of length 2, and 3 words of length 1. Then PRINCE would exhaust the chains in the order indicated by table 1, starting with the words of length 2 concatenated to the words of length 1, and finishing with all the combinations of 3 words of length 1.

Chain Elements Key space 1 + 2 3 × 5 15

2 + 1 5 × 3 15

3 20 20

1 + 1 + 1 3 × 3 × 3 27

Table 1:An example of chains, elements, and key spaces used by PRINCE

3.3.6 TarGuess

The previously mentioned password guessers are mainly concerned with offline password guessing, and are intended to use huge data sets of leaked passwords to make their guesses. To demonstrate the effectiveness of employing personal information in online password guessing, Wang et al. created TarGuess, a frame-work for guessing passwords of a user given some target-specific information [32]. They defined personal information as any information which is related to a user, and split it up into several categories: Type-1, which is names, birth dates, phone numbers, and national ID number. Type-2, which is gender, age, and lan-guage. Furthermore, they also used user identification credentials, such as pre-vious passwords, personal ID numbers, user names and email addresses. Type-2 was assumed to have an implicit role in passwords, which means that they im-pact how a user pick their passwords. The other categories was instead supposed to have an explicit role, which means that the information occurred directly in the picked password. All their versions of TarGuess was demonstrated to be bet-ter than previous attempts at using personal information, but unfortunately they did not release any source code or pre-trained models of their work.

They modelled four different versions of TarGuess, each for a different attack scenario. They are briefly described here, but for more details, see the original paper by Wang et al.

TarGuess-I

TarGuess-I deals with the scenario of an attacker getting access to the name and birth date of the user, and thus employs some Type-1 Personal Information. It is built on PCFG (described in section 3.3.4), and extends the model to use the symbol N for name and B for birth date. They trained their model on a set of

(23)

3.4 Other programs and techniques 13

passwords with corresponding personal information for each user, by looking for matches between the personal information and the sub strings of the password for each user.

TarGuess-II

TarGuess-II deals with the scenario of an attacker getting access to one sister password of the targeted user. With training data of multiple different pairs of passwords, from the same user but for different sites, they tried to determine how the first password can be modified into the second. The modification rules they used were insertion, deletion, capitalisation, leet speak2, sub string movement and reversing. The result was a set of transformations sequences, each associated with a probability of occurring. This set was then used to make guesses, by ap-plying each transformation sequence on the sister password of the targeted user in order of decreasing probability.

TarGuess-III

TarGuess-III deals with the scenario of an attacker getting access to the name, birth date (Type-1 information), and one sister password of the targeted user. It does so by combining the grammar of I and the grammar of TarGuess-II directly, and can thus be seen as a combination of them.

TarGuess-IV

TarGuess-IV deals with the scenario of an attacker getting access to the name, birth date (Type-1 information), one sister password, and either the gender, age, or language (Type-2 information) of the user. Combining the other versions of TarGuess, with some help of Bayesian theory, yielded a trained model which uses all the available information.

3.4 Other programs and techniques

This section lists some of the programs and techniques which are found in the literature but are not considered in this paper. The reason is either that more pro-grams would dramatically increase the scope of the paper or that the propro-grams are no longer relevant.

(24)

3.4.1 Rainbow Tables

Rainbow tables are large pre-computed tables of passwords and hashes [26], used to quickly look up the password which was used to calculate the hash. Rainbow tables have been studied for a long time in the literature, but will not be consid-ered in this paper. This is because they only work against passwords which have been hashed without salt, which is no longer common. When a salt is used, each guess must be hashed with the salt before it is compared with the targeted hash, which can be done when a password cracker is used but not when rainbow tables are used. Modern hardware also guesses passwords quickly enough that storing a large table of hashes on disk will not save much time.

3.4.2 Other password guessers

There are multiple other password guessers which can be found both in the litera-ture and on the web. One such program is GENPass, which is based on recurrent neural networks and PCFG [20]. GENPass was not included because there was no implementation available, and because programs based on PCFG and neural networks were already included. A program developed by Melicher et al., which is also based on recurrent neural networks, is another program which was not included [23] [3]. A search of ‘password cracker’ on code hosting service GitHub reveals hundreds of password crackers [1]. Naturally, testing all of these pro-grams would be out of scope for this paper. Instead, this paper describes tests of programs which were either popular or commonly cited in the literature, and works differently from each other.

(25)

4

Method

This chapter walks through the methods which was used to answer the research questions.

4.1 How password guessing is performed

This section describes the method used to answer the first research question, ‘how are password guessing attacks performed by people with professional

ex-perience?’.

4.1.1 Interviews

Several interviews was carried out with people with professional experience in password guessing. The interviewees was selected to be people working with some sort of password guessing at either a company or an agency. The goal of the interviews was to get a clear idea of what programs are commonly used, how they are used, and what assumptions are usually made when guessing passwords. Interviews was picked out as the method of choice, as it enables the possibility to get qualitative data from a relatively small sample size. Since the goal is to get an idea how someone with professional experience would approach the problem, the interview was intended to be be semi-structured with questions that allow for open and flexible answers. Since the goal is to get answers from well-informed people, the interviews are of the type informant interview. The interviews was transcribed and summarised, as have been described by Sarah Tracy among oth-ers [30].

(26)

The interview started with getting consent and stating the goals of the interview. The opening questions below was used to get to know the interviewee and their relation with password cracking.

Opening questions

1. What is your current profession?

2. How much of that is related to password cracking?

Generative questions are made early in the interview, after the opening questions. Their purpose is to get the interviewee to talk freely about the topic and provide a framework for further questions [30].

Generative questions

1. Can you tell me about a job or assignment you had, which was related to password cracking?

Directive questions were asked, depending on the outcome of the generative ques-tions. The reason for them was to look into some specific programs.

Directive questions

1. What type of hardware do you use? 2. Which programs do you typically use? 3. How are the programs used?

4. What information do you usually have access to? 5. How is that information used?

6. Which hash functions do you typically encounter?

In order to finish off the interview, some catch-all questions were asked. The purpose was to get the interviewee to finish up any loose ends [30], and to get the interviewer to know if they have missed anything.

Closing questions

1. Is there anything else you think is important for me to know about? These questions were tested in a test session, where they were asked to someone with with some experience on the topic. This was done in order to see if they worked as intended.

(27)

4.2 Overall design of password guessing programs 17

4.2 Overall design of password guessing

programs

This section describes the method used to answer the second research question, ‘what are the overall designs of current state-of-the-art automated password

guess-ing programs?’. As defined in chapter 3, there are multiple different considera-tions to take when implementing an automated password guessing system.

4.2.1 Input parameters

This section describes how the programs were compared with regards to the con-siderations they take into account. As described in section 3.2, there are some common patterns in user-picked passwords. It was therefore decided that the programs should be tested for how well they can consider the different patterns when they make guesses. The following attributes were considered, since they are the ones considered to be the most important in the literature. The programs Hashcat, OMEN, PassGAN, PCFG, PRINCE, and TarGuess (listed in section 3.3) were checked. They were picked out because of either their frequency in the published literature, or because the interviewees (see section 4.1.1) mentioned them.

• Password structure - The structure of how a password is made. This refers to groups of characters and their position, but not the meaning the charac-ters had for the person who came up with the password.

• Dictionary words - This refers to regular words being used as part of the guesses.

• Sister Passwords - One or more passwords which has been used elsewhere by the targeted user.

• Policies - The requirements the system puts on passwords. The two most common requirements on passwords are a minimum number of characters and an inclusion of characters from different groups (such as numbers and capital letters). For this reason, this attribute was split up into Length and Characters.

• Personal Information - Non-public data which is related to the real-life per-son behind the password. Will be split up into Names (which includes, but is not limited to, the real-life name of the user), Birthdays, both the users own and others, ID numbers and Email addresses.

The programs can use the data of the above attributes in different ways. Because of this, it was decided that how a program uses data of the attribute should be categorised. For each of the attribute that was considered, it was decided that the following categorisation should be used.

(28)

• Automatically - This means that the program makes some automatic deduc-tion of how to use the attribute, and if it will be included or not. Typically, it involves some sort of training phase.

• Directly - This mean that the attribute is specified by the attacker as a set-ting to the program. If some data related to the attribute is fed to the pro-gram when it starts to generate guesses, and the propro-gram make a distinction between this data and data of a different attribute, then the program is said to use the attribute directly.

• Indirectly - This means that the attacker can run the program in some way which makes it base the guesses on the data. If the attacker can feed data re-lated to the attribute in some form to the program, and the program makes guesses which are based on the data in some form, then it is used indirectly. The main distinction between ‘directly’ and ‘indirectly’ is that attributes which are used indirectly were probably an unintended consequence of how the program was designed, and the program makes no distinction between data of the attribute and data of other attributes.

• None - This means that the program is entirely unable to use the attribute when making guesses. If you have access some data of the attribute, it would not make any difference for the generated guesses or how you run the program.

4.3 Overall effectiveness of password guessing

programs

This section describes the method used to answer the third research question, ‘how does the password guessing programs compare in terms of effectiveness?’. Section section 4.3.1 defines the different metrics which are considered for the tests, which are described in section 4.3.3. Section 4.3.2 specifices which sets of data have been used in the tests.

4.3.1 Metrics for measuring effectiveness

This section describes the metrics used to measure effectiveness. The different metrics are used in the various tests, described in section 4.3.3.

Number of passwords cracked

One metric which is often used to compare password guessers is the number of guesses required to reach a certain number of cracked passwords in a big data set [34] [9] [23] [29]. This metric was considered here as well, since it measures the quality of guesses, in a hardware-independent manner.

(29)

4.3 Overall effectiveness of password guessing programs 19

Number of cracked passwords per second

This metric encapsulates what an attacker would be interested in. Password guessing attacks are typically made with run-time in mind, and not necessar-ily the number of guesses. This metric can thus be seen as hardware-dependent version of ‘number of passwords cracked’, as described above.

Uniqueness of guesses

If a program makes duplicate guesses, the effective time to correctly guess a pass-word goes up, which is why a high number of unique guesses is desirable. For that reason, uniqueness of guesses was a metric which was considered in this thesis.

Uniqueness can also be computed in pairs, by calculating the number of overlap-ping guesses between two programs. This metric is useful to see how similar two programs are in terms of guesses. If two programs generate very similar guesses, one of them could be replaced by the other with similar results, which is why it is worth considering.

Finally, uniqueness can also be computed among all the programs. This is done by calculating how many guesses a program made, which was not made by any other program.

This thesis used all three measures of uniqueness.

Length distributions of guesses

Password guessers should make guesses which have some similarities to the train-ing data. One metric for that is the length of the passwords each program gener-ates. This metrics tells us if the program considers the length of the training data when it makes its guesses.

Guessing speed

As described in chapter 3, a full password cracking solution must generate guesses which are then hashed and tested. If either the guessing speed or the cracking speed is slower than the other, the throughput will be limited by the slowest of them. Therefore, the guessing speed for each program is a useful metric to deter-mine the effectiveness of a program.

(30)

4.3.2 Used sets of data

This section presents the sets of data used for the tests, as well as some motivation why each set was used.

LinkedIn

In 2012, social networking site LinkedIn was attacked and millions of password hashes were leaked. This thesis used the 60572669 unique passwords which have been cracked and can be found on hashes.org, as of February 2019. The LinkedIn data set has been used in multiple different research papers [12] [20]. The pass-words can be assumed to have been picked to protect some mildly sensitive user information, given the purpose of the website. For these reasons, LinkedIn was considered to be suitable for this thesis.

RockYou

Gaming website RockYou was targeted in 2009 and 14341564 unique plain-text passwords were leaked. RockYou was used for a few of the tests in this thesis. This is because it is commonly used by researchers [9] [20], and easily accessible by real-world attackers.

LastFm

In 2016, music website LastFm was breached and several million password hashes were leaked. Roughly 20 million of them have been cracked and can be found on hashes.org. This dataset was used in a couple of the tests, and is suitable as it is only a few years old and sufficiently large.

Bloggtrafik

Swedish website Bloggtrafik was hacked and shutdown in 2016. Roughly 144400 users had their credentials leaked in plain text, but only 85027 unique passwords was used between them. This thesis used the Bloggtrafik data, as it is of a reason-able size, and contains passwords set by Swedish-speaking users.

4.3.3 Studied scenarios and tests

This section presents the specific tests which was used to study the effectiveness. The tested programs were Hashcat, OMEN, PassGAN, PCFG, and PRINCE, as listed in section 3.3. TarGuess was excluded as it had no available implemen-tation and because no dataset containing personal information could be used for

(31)

this thesis. How the programs were executed in each test, as well as specifications for the used hardware, can be found in the appendix.

Unless otherwise noted, the word mangling rules used by Hashcat was OneRule-ToRuleThemAll. OneRuleToRuleThemAll is a set of rules experimentally found by combining the top-performing rules from various other well-known sets of rules [25].

Because of computational limitations, PassGAN had to be limited to only con-sider passwords up to a certain length. This means that it did not have exactly the same input as the other programs. The number of passwords ignored by Pass-GAN, as well as how much of the total training data this represents, has been noted for each test.

Test to measure guessing speed

The purpose of this test was to study the guessing speed of each program. Each program was run for 10 minutes each, and the number of guesses made during that time was measured. From this, the number of guesses made per second was calculated as well. If the program required any sort of training, the time taken for that was measured as well.

In addition to the guessing speed metric, the speed for some common hash func-tions was calculated as well. This was determined by running Hashcat in its benchmark mode. This does not test all of the available hash functions in Hash-cat, but it provides a reasonable sample.

Test with password list

The purpose of this test was to study the quality of the guesses made by each password guesser, when run in similar circumstances. In particular, the effective-ness of each program when using a password list for training/as a dictionary was measured. The considered metrics were uniqueness of guesses, length distribu-tions, and number of passwords cracked, as described in section 4.3.1. Since the metrics are hardware-independent, each program was set to generate the same amount of guesses, in this case 100 millions.

This test used passwords from LinkedIn as both training data and test data. First, the passwords were split into 10 different groups, each assigned a number be-tween 0 and 9. Each group was defined to be test data for the corresponding test session, with the remaining 9 groups defined as training data for the same test session. This means that the training data for each of the 10 test sessions was 54515402 words large, and each set of test data was 6057267 words large. In general, this is known as a k-fold cross-validation with k = 10. This approach is common in validation of machine learning models, and the purpose is to reduce

(32)

bias from data selection. The choice k = 10 is considered to be standard in the field, but other numbers are common as well [18].

Hashcat used the set of word mangling rules known as ‘Best64’. This is because Best64 is shipped with Hashcat, and is commonly featured in usage examples in the documentation and on the web. Some of the literature in which Hashcat is used use the Best64 rules [13] [12]. For these reasons, the Best64 set of rules can reasonably be called the ‘default’ set of rules.

PassGAN was limited to only consider passwords of length 16 or less. This re-sulted in less than 1% of the passwords being removed.

Test with dictionary words and password list

The purpose of this test was to study to which extent regular dictionary words are useful as a basis for password guesses. Since the programs which uses some train-ing data (PassGAN, PCFG, OMEN) are assumed to be run on lists of passwords, it was decided to see how they perform when given a list of regular words. As described in section 3.2, regular words are often found in passwords, so it can be assumed that both regular dictionaries and password lists are useful. The metric used to study the effectiveness of each option was number of cracked passwords per second, as described in section 4.3.1.

This test used passwords from LastFm as test data, and passwords from RockYou as training data.

The traditional dictionary was constructed by fetching data from the articles on English-language Wikipedia. A copy of all current articles (as of January 2019) was downloaded and each string of alpha characters between two whitespace characters was extracted. This approach has been used by other researchers to construct a general dictionary [21]. English-language Wikipedia contains tech-nical terms, names, and some common numbers, which makes it more suitable for password guessing than a general English language dictionary. Duplicated words were ignored. This resulted in a list of 14353606 words being used for this test.

Three different tests were run, using different dictionaries. • One run with only the Wikipedia dictionary as input. • One run with the RockYou dump as input.

• One run with the RockYou dump and Wikipedia dictionary merged to-gether.

Each program ran for 10 minutes, for each test.

PassGAN was limited to only consider passwords of length 30 or less. This resulted in 0.36% of the words being removed from the Wikipedia dictionary, 0.03% from the RockYou set, and 0.20% from the combined set.

(33)

Test using dictionaries of different languages

The purpose of this test was to see to which extent the language of the dictionary influences the effectiveness of the password guessers. The research mentioned in section 3.2.5 showed that a dictionary in the native language of the targeted user could yield better result. To test this, each program targeted passwords from a Swedish websites with a dictionary in Swedish and a dictionary in English. The metric used to study the effectiveness of each option was number of cracked passwords per second, as described in section 4.3.1.

The dictionaries picked for this test were ‘English (British)’ and ‘Swedish’, found in a repository on GitHub.1Including all forms, the Swedish dictionary included 400468 words and the English dictionary included 145390 words.

Each program ran for one hour, for each test.

PassGAN was limited to only consider passwords of length 30 or less. This re-sulted in 0.002% of the words in both the English and Swedish dictionary being removed.

Test against passwords with a minimum length policy

The purpose of this test was to study the effectiveness of the programs when targeting passwords of a know minimum length. In this test, each program tar-geted passwords of a know minimum length, using either other passwords of the same minimum length or all the available passwords. The metric used to study the effectiveness of each option was number of cracked passwords per second, as described in section 4.3.1.

This test used passwords from LastFm as test data and training data. Each pass-word in the set was put into one or more of the following groups:

• All of the passwords (10234857) • Passwords of length over 8 (7406354) • Passwords of length over 12 (1099628) • Passwords of length over 16 (124271) • Passwords of length over 20 (16652)

The current NIST recommendation for password policies is that they should re-quire at least 8 characters [11], which is why it was picked as the smallest length. Passwords of length 20 or longer are fairly uncommon (around 0.16% of the LastFm passwords), which is why it was decided to not try to target any pass-words longer than that. The steps between were picked to give some data for the policies in between the shortest and longest policy.

(34)

Each group was split into two random groups, one of which was training data (or dictionary) and one of which was test data (targeted passwords). Excluding the test set based on all of the passwords, each test set was targeted twice by each program. One time with training data based on all of the passwords, and one time with training data based on only the passwords of the corresponding the targeted minimum length.

Each program ran for 10 minutes, for each test.

PassGAN was limited to only consider passwords of length 30 or less. This re-sulted in 0.006% of the passwords of length 12 or longer being removed, 0.04% of the passwords in of length 12 or longer being removed, 0.33% of the passwords in of length 16 or longer being removed, and 2.5% of the passwords of length 20 or more being removed.

4.4 The implementation of the system

This section describes how the system was implemented in the Cyber Range at FOI, and thus answers the fourth research question, ‘how can an automated pass-word guessing system be implemented in the Cyber Range at FOI?’.

4.4.1 Overview

This section will refer to the system implemented in the FOI Cyber Range as Svedcrack. Svedcrack is a concatenation of ‘Sved’ (the sub-system in the Cyber Range which uses the program) and ‘crack’.

Section 4.4.2 describes the requirements on Svedcrack. Section 4.4.3 thru 4.4.5 de-scribes the design of Svedcrack, and how the additional tests were performed.

4.4.2 The requirements

The requirements of the automated systems were discussed, and it was decided that the following things should be considered by the Svedcrack.

• The targeted hash (may be multiple) • Hash function used(A)

• Time to run (t)

• Minimum password length policy (l)

• The native language of the user (s) (simplified to ‘International’ and ‘Swedish’) Both the theory and the test results of the other tests showed that these were reasonable things to consider.

(35)

4.4 The implementation of the system 25

4.4.3 Expected number of guesses

Using the results from the test described in section 4.3.1, it is possible to use A and t to calculate the expected number of guesses a program will be able to test in a given time.

Eg(program, t) = the number of expected guesses a program will make during

the given time.

Eh(A, t) = the number of expected hash function calculations which can be made

during the given time.

Et(program, A, t) = the expected number of tested guesses during the given time,

calculated as Et(program, A, t) = min(Eg(program, t), Eh(A, t)). This is because

of the fact that either the guessing speed or the hash calculation speed will be a bottleneck for the throughput of actual tested guesses.

Calculating Et(program, A, t) for each program ∈ {hashcat, pcf g, prince} yields

the expected number of tested guesses for each program.

4.4.4 Expected number of correct guesses

From running tests, it is possible to determine how many passwords a program is able to guess after a certain number of guesses. When you divide this number by the total number of passwords in the set, you get the percentage of the pass-words cracked. That percentage could also be seen as the probability of a pro-gram being able to crack a single password after the number of guesses, under the assumption that a targeted password follows the same statistical distribution as the passwords in the training data and the test data.

This reasoning was applied when designing Svedcrack. By determining a func-tion which predicts the probability of a program cracking a password hash after a certain number of guesses, you can find out which program has the highest probability of successfully cracking a password hash. Let this function be called Ec(program, guesses, l, s), where l and s is the password policy minimum length

and language of the user, respectively.

This function was determined by plotting the number of guesses against the per-centage of passwords guessed so far, and then determining the exponential func-tion f (x) = a · e−b·x_{+ c which best fitted in it the least-square sense. This}

func-tion was used because of previous observafunc-tions that the curve tended to increase rapidly in the beginning before it stops increasing completely. Ec(program, guesses, l, s)

can now be described as Ec(program, guesses, l, s) = ftls(guesses), where ftls(x) is

f (x) with different parameters for a, b, and c, depending on the program, targeted language, and targeted length.

By first calculating guesses = Et(program, A, t) as described in section 4.4.3, it is

(36)

4.4.5 Tests to determine the parameters for the curves

The results from the other tests (section 4.3.3) were intended to be used to find the parameters to f (x), as described in section 4.4.4. Unfortunately, because of how one of the programs generated data, the test results could not be used as intended and had to be redone. However, the test results showed that some of the programs performed better than others, so the new tests could focus only on the relevant programs.

Specifically, the programs Hashcat (with dictionary rules), PRINCE, and PCFG were considered. The programs were tested against every combination of the language (‘International’ and ‘Swedish’) a minimum password length in the set 8, 12, 16, 20. This means that 8 different targets were considered. Each program ran for one hour each, before the execution was terminated.

The Swedish data was combined from leaks from anstalten.nu, bloggtrafik.nu, gratisbio.se, and an unknown leak commonly known as ‘hoppstylta’, for a total of 1068211 passwords. The international data was combined from leaks from elite-hacker, myspace, faithwriters, phpbb, hak5, honeynet, rockyou, muslimmatch, singles.org, and an unidentified porn site, for a total of 34424159 passwords. They were picked because the contained duplicates of password, unlike the previ-ous considered sets of data. Both the Swedish and the international data were put into groups containing only password of the minimum length and of the target language, similar to how the data was treated in the test described in section 4.3.3. Each of these sets were split into two, one for training and one for testing.

4.4.6 Putting together the material into a program

Once a function which describes the expected number of correct guesses when given the input listed in section 4.4.2 had been made, SvedCrack could be imple-mented. SvedCrack simply calculates the expected probability that a program will be able to guess the password(s) in the given time, and starts the one with the highest probability.

(37)

5

Result

This chapter describes the results.

5.1 How password guessing is performed

This section describes the results for the first research question, ‘how are pass-word guessing attacks performed by people with professional experience?’.

5.1.1 Interviews

Two interviews was carried out. The names and workplace of the interviewees has been removed.

Interview One

The first interview was made at FOI and took about 45 minutes. The interviewee had been working with password cracking for about six years, and they mostly targeted hard drive encryption schemes. The questions listed in section 4.1.1 was not followed strictly, as the answer to the generative question also answered most of the directive questions.

Summary When asked to describe a previous job or assignment related to pass-word cracking, the interviewee stated that they often started out with running programs with default and known ‘generic’ settings. This was done until you

(38)

had gathered more information which is related to the user whose password you wish to crack. When asked about what kind of information that would be, names and numbers related to user was mentioned. The names could be the name of the user, family members, places, addresses, and pets. The number could be dates which are important to the user.

When asked about programs, Hashcat, John the Ripper, PRINCE, and Passware Kit was mentioned. When asked to compare them, the interviewee described that Hashcat was generally the best. John the Ripper was described to have a good community with many plugins, so it was sometimes used to target systems which was not implemented in Hashcat. Passware Kit was used by colleagues of the in-terviewee, and was good as it was heavily automated and simple to use. PRINCE was also something that was used by the colleagues of the interviewee, and as far as he knew it was considered to be good. The interviewee said that he preferred more control over the programs he used. Rainbow Tables was also mentioned, as something the had been using earlier but which they stopped using about a year ago. The reason was that many contemporary hashes they found used salts, which is a way to defeat rainbow tables. It was also mentioned that modern hard-ware checks hashes so quickly that having large pre-computed tables did not save much time. The interviewee also described a system in which two networks com-pete to create and detect password guesses, which was something he said was interesting and worth looking in to. The interviewer assumed this was a descrip-tion of a GAN, even though it was not mendescrip-tioned by name.

When asked about how the programs was being used, the interviewee said that a cracking session sometimes started with plain brute-force to remove the shortest and most simple candidates. A cracking session often ended with brute-force as well, when all the easier options had been exhausted. In between, word lists with mangling rules was often used. The rules they used was something they had developed by themselves over the years. The word lists used was created for each specific targeted user, but also contained a generic Swedish dictionary, names, idioms and common misspellings. They also categorised the words into different classes, which they used differently, but the interviewee was not allowed say to how exactly they did that. The interviewee mentioned that you could be creative when you created the word list. The interviewee also pointed out at multiple times in the interview that a feedback loop was important, and that new information discovered should be added in to the dictionary. It was also mentioned that different speeds of the hash function determined how they used their programs, and for slow hashes a slower guesser could be used.

A topic which was brought up was training data. The interviewee mentioned that the model of passwords that many programs used was based on the RockYou data leak. He said that RockYou was not very good as it old at this point, and that it was also too web specific. Since they typically try to guess passwords for decrypting hard drives, their targets picked different password than they probably would for a website. He mentioned that the RockYou data set did not entirely correlate to how the users they targeted picked their passwords. The interviewee did mention

(39)

5.1 How password guessing is performed 29

that the good things with using database leaks was that they were picked by real humans, and thus could show some tendencies people have when they pick passwords.

When asked about hardware used, the interviewee said that most of the work is done on GPU, specifically GTX 1080 Tis. They had previously used custom FPGA:s, which they stopped using as they were too expensive for what they of-fered. The interviewee stated that building a computing cluster with CPUs might be relevant, since one of the new hashing functions they encountered required a lot of memory, which GPUs did not have enough of.

The topic of password content was brought up at multiple points in the interview. Personal information was something they often used when guessing passwords. Previous passwords was something they often looked at as well, if they could find it. The interviewee said that they typically try to target the weakest point of the targeted user, since it might reveal information or other passwords. It was mentioned that you could often see passwords of a target evolve over time, with new parts being added at the end. For that reason, the interviewee considered it to be important to label and classify the personal information they found, since the point in time it was used might be relevant. It was also mentioned that there might be cultural differences between how users of different groups might select their passwords. The groups mentioned was age and type of work, but other profiling might be relevant as well. Keyboard walks was mentioned as something they had found in password but that it was not very common to encounter.

Interview Two

The second interview was carried out at a company and took about 25 minutes. The interviewee had been working with password cracking for multiple years, although it was a fairly minor part of his work. Most of their work was related to examining IT-related crimes, so hard drive encryption which used some pass-word was often targeted.

Summary When asked about the hardware they used, the interviewee stated that they had two GTX 1080 Tis.

When asked about programs, the interviewee mentioned that Hashcat was often used on the lowest level. However, they did not use Hashcat directly, but instead used a graphical wrapper known as Password Recovery Toolkit (PRTK). PRTK let the attackers build combinations of mangling rules with a graphical editor, and manage and merge different list of words. The interviewee mentioned that it was not always entirely clear how PTRK called on Hashcat, but that it at least started with exhausting easy searches before it moved on to searched which would take longer time.

When asked about how the programs were being used, dictionaries and mangling rules was mentioned. They used a regular dictionary which also included many

(40)

combinations of digits as well as some personal data of the victim. The personal information could include things taken from the social media of the user, such as names (both of the target and people related to the target) and important dates. This also answered the question about what information they usually have access to. The interviewee mentioned that you could be creative when you came up with the word list.

When asked about which hash functions they usually encounter, the interviewee mentioned NTLM. But since his work was mostly related to showing weaknesses in the systems he tested, it was often enough to simply tell the operators that they had used an insecure hash function, and in such cases there was no need to try to guess the password.

The topic of password re-use was brought up. The interviewee stated that it was not something they used directly themselves, but his colleagues had used old passwords of the target sometimes. Typically, users appended stuff to either the beginning or the end of an old password they had used.

The interview mentioned that it was easier to target a group of people instead of an individual, since in a group it is likelier that one of them will use a bad password.

5.2 Overall design of password guessing

programs

This section describes the results for the second research question, ‘what are the overall designs of current state-of-the-art automated password guessing pro-grams?’.

5.2.1 Summary

Table 2 summarises the results. An ‘A’ in the table means that the attribute is automated, a ‘D’ means that it is used directly, a ‘I’ means that it is used indirectly, and a ’-’ means that it is unused.

5.2.2 Hashcat

Hashcat is not intended to be an automated program, and none of the attributes are considered in an automated fashion. An attacker that uses Hashcat can, how-ever, decide if and how to use most of the attributes.

(41)

Hashcat OMEN(+) PassGAN PCFG PRINCE TarGuess

Structure D A A A - A Dictionary D I I I D -Sister PW I D I D I A Policy: Length D D D I I I Policy: Chars D I I I I I PII: Names I D I I I A PII: Birthdays I D I I I A PII: IDs I I I I I A PII: Emails I D I I I A

Table 2: Table of the extent different password guessers consider different attributes commonly found in passwords

Password Structure

An attacker using Hashcat can feed it password structures, and Hashcat will pro-ceed to make guesses which fulfils those structures. This is what the ‘brute force’ mode is for. It is specified in terms of ‘masks’ which is a sequence of character groups which matches the password to be guesses. Since the mask fed to Hashcat is used directly and purposefully, it is considered to be used directly. It is not pos-sible to use Hashcat to come up will password structures automatically.

Dictionary Words

An attacker using Hashcat can feed it a dictionary which will be used as the basis for the guesses, which is what the ‘Straight’ mode is for (as described in section 3.3.1). It is therefore reasonable to claim that this attribute is used directly by the program.

Sister Passwords

It is possible for the list of words which Hashcat uses will contain sister pass-words of the user, which will then be used as a basis for new password guesses to be generated. Hashcat consider these to be regular words, and not sister pass-words, so no distinction is made. For that reason, this attribute is used indirectly by Hashcat.

Policies

When it comes to the sub-attribute length, it is possible to set a maximum length in some of the modes. When it comes to the sub-attribute special characters, you can specify masks which will generate only passwords which is accepted by the

(42)

policy. This is one reason to classify length as something directly used by Hashcat. It is not possible to specify a password policy to Hashcat and have it generate only guesses which is accepted by the policy. You can however specify different password structures (masks) which will only match passwords for a given policy. This requires some additional work, but it can be said that Hashcat uses this sub-attribute directly.

Personal Information

The dictionary which is used by Hashcat can contain personal information, which is then used to create password guesses. However, Hashcat does not distinguish between this attribute and, for example, sister passwords which is why this at-tributes is classified as being used indirectly by Hashcat.

5.2.3 OMEN and OMEN+

OMEN comes up with password structures and corresponding candidates auto-matically, and allows length and some personal information to be specified in addition to that.

OMEN uses two phases, a training phase and a generation phase. The training phase takes a set of passwords and come up with a Markov-chain which is used to generate password candidates in the generation phase. This attribute can there-fore be considered to be automatically considered.

It is possible to include normal dictionary words in the data set which is used in the training phase. However, this is not intended, and the program will con-sider the words as passwords and generate new words which follow the same structure. Since including dictionary words in the data set would still have some effect on the guesses generated, this attribute is considered to be indirectly con-sidered.

Sister Passwords

Since you feed passwords to the program during the training phase and the pass-words will be used directly when generating the Markov model, this can be con-sidered to be used directly. It is worth noting that the program does not make

(43)

any distinction between the sister passwords of a targeted user and a password used by someone else.

Policies

The implementation used does not allow the attacker to specify requirements on special characters, but it is possible to specify the length of the password candi-dates to be guessed. Similar to PCFG, it is also possible to use only passwords which fulfils the policies for the training phase, which will make it more likely that password candidates that are allowed under the policy gets generated. So the sub-attribute length is used directly by the program, while the sub-attribute special characters is used indirectly.

Personal Information

OMEN+ (which is an extension of OMEN and is part of the used implementation) works with so-called ‘hint files’, which is to be used as hints for targeted password guessing. These files can contain first name, username, date of birth, and email address of the user. These attributes can therefore be said to be used directly by the program. Other personal attributes can still be included in the normal set of passwords and are indirectly used in the passwords, similar to dictionary words.

5.2.4 PassGAN

PassGAN comes up with password structures and corresponding candidates au-tomatically, and allows length to be specified as well.

PassGAN uses a training phase and a guessing phase. During the training phase, a list of passwords is used to train the model that is then used to generate pass-words during the guessing phase. This attribute can therefore be considered to be used automatically by PassGAN.

It is possible to include normal dictionary words in the data set which is used in the training phase. However, this is not intended, and the program will con-sider the words as passwords and generate new words which follow the same structure. Since including dictionary words in the data set would still have some