EMSEC. Analyzing and improving OMEN, a Markov-Model based Password Guesser. Fabian Wilhelm Angelstorf

Transcription

1 Analyzing and improving OMEN, a Markov-Model based Password Guesser Fabian Wilhelm Angelstorf Master s Thesis. May 22, Chair for Embedded Security Prof. Dr.-Ing. Christof Paar Advisor: Dr. Markus Dürmuth EMSEC

2

3 Abstract Password based authentication is and will be the most widely used technique of user authentication for the foreseeing future. However, human-generated passwords are often weak and have a rich structure. Therefore, they are vulnerable for guessing attacks. Recently, advanced password guessing algorithms have been proposed, that increase the effectiveness of guessing attacks. In this thesis, we revise OMEN, an efficient password guessing algorithm based on Markov-models, and provide it with improved and new functionalities. In an extensive evaluation, we point out the optimal parameters and show that the improved version cracks up to 80 % of passwords with 10 billion guesses, 11 % more than the original version does.

4

5 i Declaration I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the award of any other degree or diploma of the university or other institute of higher learning, except where due acknowledgment has been made in the text. Erklärung Hiermit versichere ich, dass ich die vorliegende Arbeit selbstständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe, dass alle Stellen der Arbeit, die wörtlich oder sinngemäß aus anderen Quellen übernommen wurden, als solche kenntlich gemacht sind und dass die Arbeit in gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegt wurde. Fabian Wilhelm Angelstorf

6

7 Contents 1 Introduction Contribution Organization of this Thesis Background and Related Work Password Basics Password Guessing Test Metric Brute-Force Attack Dictionary Attack Password Guessing with Markov-models Password Guessing with Probabilistic Grammars Common Password Guessers John the Ripper Hashcat Password Strength Estimation Concept Training the Markov-Model Probability Discretization Smoothing Level Chains Length Scheduling Variable Parameters Password Enumeration Modes Comparison to old OMEN-Version Implementation Structure and Modules createng Datatypes Procedure enumng Datatypes Procedure Password Enumeration

8 iv Contents 5 Test Setup Datasets Training & Testing Sets Ethical Consideration Parameters Results Parameter Evaluation n-gram Smoothing Length Scheduling Alphabet End Probabilities Maximum Level Summary Datasets Results Result Comparison Conclusion Summary Future Work A Acronyms 75 B Appendix 77 B.1 Users Guide B.1.1 Basic Usage B.1.2 Advanced Usage B.1.3 Smoothing Configuration B.2 Additional Program Modules B.2.1 evalpw B.2.2 alphabetcreator B.3 Used Libraries B.4 CD Content List of Figures 83 List of Tables 87 List of Algorithms 89 List of Listings 91 Bibliography 93

9 1 Introduction Password based authentication is the most widely used technique for user authentication. Passwords are used constantly, both online and offline, for example to login accounts or operation systems, and to encrypt or decrypt local data. Recent studies show that passwords are hard to replace and will remain the primary authentication technique for the foreseeable future [3]. Passwords provide several essential criteria of user authentication: They are highly portable, easy to use and comprehensible for laypersons, and easy to implement for developers. However, users typically chose weak passwords, especially if they are allowed to chose freely [44]. The chosen passwords often have a rich structure, are composed of actual words [29], or influenced by the user s native language [21]. Therefore, password secured infrastructures are vulnerable to guessing attacks. Theoretically any password can be cracked by simply trying any possible combination. These Brute-Force attacks are only limited by the time and the resources an attacker can commit, but to guess long and more complex passwords, a Brute-Force attack may need several years [34]. Brute-Force attacks can be improved by using dictionaries, which store actual words or leaked, real-world passwords. In 2005 Narayanan et al. [21] proposed a Markov-model based password guesser, that increases the effectiveness of dictionary attacks. The underlying idea of this algorithm is that the letter distribution of human-generated password is not random and therefore predictable. For example, it is more likely that an English native speaker choses the sequence th than tq. The algorithm reads real-world leaked password lists and models the estimated probability of one letter following another for each combination. Based on these probabilities, one can create a list with passwords, which exceed a certain threshold probability, and use this list for a dictionary attack. In this thesis, we try to learn more about the efficiency and the behavior of Markovmodel based password guessers, and how different parameters influence their performance. Therefore, we revise an existing implementation developed by Dürmuth et al. [10] called Ordered Markov ENumerator (OMEN). The basic idea of OMEN, in contrast to the algorithm by Narayanan et al., is to enumerate passwords with (approximately) decreasing probability, instead of enumerating all passwords which exceed a certain threshold probability. OMEN is already more efficient than comparable advanced password guessing algorithms like the one by Narayanan et al. [21] or the password guessing modes provided by the common password cracker John the Ripper [23]. However, it does not cover all aspects of Markov-model based password guessing and many essential parameters are fixed, even if they may not be optimal. To investigate Markov-model based password guessing further, we revise the concept of OMEN, providing it with new and improved functionalities, and present an extensive evaluation of the new implementation.

10 2 1 Introduction 1.1 Contribution The original Ordered Markov ENumerator (OMEN) by Dürmuth et al. [10] already produces good results, but it is rather static and several aspects of Markov-model based password guessing can not be evaluated. For example, it only supports 3-grams 1, while larger n-grams increase the effectiveness even further, and smaller ones (i. e. 2-grams), even if less accurate, will decrease the runtime. In addition, it only supports a fixed alphabet 2 with 72 characters, which is an other critical parameter and has a huge influence on the performance. In this thesis, the original OMEN is revised completely. We have implemented new and improved functionalities, allowing us to change any important parameter, including the n-gram size and the alphabet used. The overall performance has been increased as well by optimizing the n-gram creation and the password enumeration process. In addition, we have increased the usability by providing command line arguments to easily change any parameter, and by giving the user a detailed feedback about the process. We provide an in depth description of the underlaying concept and describe the essential aspects of the implementation in detail. The improved OMEN has been evaluated in an extensive testing phase, using several representative values for the available parameters and various datasets. We visualize the results using various graphs and discuss the influence of the different parameters in detail. Based on the results and those discussions, we are able to point out the optimal settings, which increase the performance of OMEN significantly, and learn more about Markov-model based password guessing in general. 1.2 Organization of this Thesis Chapter 2 provides basic knowledge on passwords in general and on password guessing in particular, and presents commonly used password guessers. This chapter also presents the ideas and results of several related work. In Chapter 3 the concept of the improved Ordered Markov ENumerator (OMEN) is described, highlighting the new features. Subsequently, Chapter 4 introduces the implementation of the algorithm. In Chapter 5 we present the datasets and parameters used for the evaluations of the improved OMEN version. The actual tests are presented, discussed, and evaluated in Chapter 6. In addition, this chapter presents a comparison between the results of the old version by Dürmuth et al. [10] and the improved OMEN version. Finally a conclusion is drawn in Chapter 7 and possible improvements are presented. In the Appendix we provide a users guide, which explains how to use OMEN. 1 The n-gram size sets the probabilities modeled in the Markov-model. For instance, a 2-gram Markovmodel models the probabilities of one character following another character, and a 3-gram Markovmodel the probabilities of one character following a sequence of two characters (see Section 2.2.4). 2 Only characters that occur in the alphabet are modeled in the Markov-model (see Section 2.2.4).

11 2 Background and Related Work In this chapter, we present background information and work related to passwords in general and password guessing in particular. First, we provide basic knowledge about passwords, pointing out their strengths and weaknesses. Subsequently, the concepts of password guessers and techniques to improve their performance are presented. In this context, we explain Markov-models in detail and how they can be used to optimize password guessing. In addition, we present the test metric used to evaluate the performance of password guessers and introduce two commonly used password guessers. 2.1 Password Basics Passwords provide several benefits and are therefore most widely used for user authentication: They are easy to use and comprehensible for laypersons, highly portable, and easy to implement for developers. Bonneau et al. [3] present several criteria of user authentication and evaluates several authentication mechanisms. Their work shows, that passwords are difficult to replace and will still be in use for some time, despite their weaknesses. In general, human-generated passwords are rather weak and have a rich structure. The chosen passwords are often only composed of lower case letters and numbers, or contain actual words [9], [29]. Even a random password is often influenced by the users native language [21]. Therefore, most human-generated passwords are vulnerable to guessing attacks (see Section 2.2). Even if the studies by Adams et al. [1] and Yan et al. [45] show that user education can improve the strength of the chosen passwords, while they still remain memorialized, users still chose weak passwords if they allowed to chose freely, as shown by Yampolskiy [44]. It is best practice, to store the hash of the passwords and not the passwords plain text in a database. This shall prevent the passwords plain text from leaking, in case a database storing users passwords gets compromised. Formula 2.1 shows how to generate the hash h of a password pwd, using the hash function H. To complicate precomputation attacks, the salt s is used as additional input to the hash function. s is a bitstring, randomly chosen for each password. The tuple (h, s) is then stored in the password database. h = H(pwd s) (2.1) To authenticate a user, the hash of the entered password is computed, using the same hash function H and the same salt s, and compared with the hash h stored in the database [10], [20].

12 4 2 Background and Related Work 2.2 Password Guessing In this section password cracking approaches and techniques to improve their performance using Markov-models or probabilistic grammars are presented. In addition, we present the test metric used to compare the effectiveness of different password guessers. We do not present any attacks based on social engineering (for instance phishing) Test Metric There are two types of guessing attacks on a password secured infrastructure: online and offline guessing attacks. In an online attack, an attacker has only access to a login prompt provided by a server. The server can implement several techniques to slow down the attack process and limit the amount of guesses an attacker can make [10]. During an offline attack, an attacker has unlimited access to the password secured infrastructure or has access to the hash and the used salt (i. e. the tuple (h, s), see Section 2.1), and tries to recover the password (i. e. pwd). In this case, the amount of guesses are only restricted by the time and the resources an attacker is willing to commit [9], [10]. As in previous work by Dürmuth et al. [10], we consider offline attacks, where an attacker has access to the tuple (h, s) and tries to recover the password pwd. Hash functions are frequently used to slow down guessing attempts by increasing the computational effort needed to determine the hash value [9], [27]. Therefore, we do not evaluate a password guesser by the time taken to guess a correct password, but rather by the amount of guesses required Brute-Force Attack The simplest approach to crack a password is a Brute-Force attack, also known as Exhaustive Search. In this attack one tries any potential correct password by creating any possible character combination based on a given character set [25], [26]. Even if any password is theoretically crackable with a Brute-Force attack, the attack is primarily suitable for short passwords, because the amount of possible combinations grows exponentially with increasing password length: Using the 26 lower case characters of the Latin alphabet as character set and a password length seven, there exist about 8 billion possible combinations (26 7 ). With a password length of eight characters the possible combinations increase to over 208 billion (26 8 ). However, modern personal computers can generate several billion passwords per second [34] and special hardware can improve the performance even further [16]. Therefore, a password should cover a larger character set, in the best case with lower and upper case characters, numbers and special characters, as well as a decent length to withstand a Brute-Force attack for a time unacceptable for any attacker [34] Dictionary Attack A variance of the Brute-Force attack is the dictionary attack. This technique tries multiple passwords stored in a precomputed wordlist in order to crack a password [26]. A basic

13 2.2 Password Guessing 5 wordlist contains words from a real dictionary or real-world passwords, which have leaked and are available to the public [8]. The dictionary can also be optimized based on information about the target [25]. Dictionary attacks are quite efficient in general, due to the poor password choices of many people. However, there is no guarantee, that a dictionary attack succeeds: If the correct password is not part of the used dictionary, it will never be guessed. To cover a larger amount of potential passwords even with small dictionaries, mangling rules can be used to create different variants of each word, for instance by appending digits, setting all characters to lower case, or replacing a This variance of the dictionary attack is called rule-based dictionary attack [25], [37]. Offline dictionary attacks can be executed much faster, if the hash value to each password is precomputed instead of computing the hash values during the attack. Those precomputed dictionaries are called rainbow tables. The concept is based on the timememory trade-off attacks proposed by Hellman [14] in 1980, which were improved by Rivest [30] in More recently (2003) Oechslin has proposed rainbow tables [22]. However, attacks using rainbow tables can be avoided by salting the password hash (see Section 2.1) Password Guessing with Markov-models Markov-models are widely used in computer linguistics, for instance to improve speech recognition. Narayanan and Shmatikov propose a concept to use Markov-models to improve dictionary attacks [21]. The basic idea of the algorithm is that the letter distribution in human-generated passwords is influenced by the users native language, even if the password is not composed of dictionary words. Therefore, adjacent characters are not independent from each other. For example, the two character sequence th is much more likely than tg, and the letter e is very likely to follow th. The information to estimate the probability of a character following a prefix character sequence is read form a real-world password list and modeled in the Markov-model. This estimation is influenced by several parameters. The most important ones are the alphabet Σ, storing the supported characters, and the length of the prefix sequence. An n-gram Markovmodel stores the transition probability of each character and prefix sequence combination, with a prefix sequence length of n 1 (i. e. P (c n c 1,..., c n 1 ) with c i Σ). In addition, the initial probabilities are stored (the probabilities for the first n 1 characters, i. e. p(c 1,..., c n 1 ) with c Σ). Formula 2.2 shows the general approach to estimate the probability of the password (c 1,..., c m ) with length m based on the precomputed initial and transition probabilities using n-grams. The accuracy of the Markov-model and therefore the accuracy of the probability estimation is highly dependent on the size of the password list used to train. If the used password list is too small, the transition and initial probabilities can not be estimated accurately. m P (c 1,..., c m ) = P (c 1,..., c n 1 ) P (c i c i n+1,..., c i 1 ) (2.2) i=n

14 6 2 Background and Related Work The Formulas 2.3 and 2.4 exemplary show how the probability of the password omen is estimated using 2-grams respectively 3-grams. 2-gram: P (omen) = P (o) P (m o) P (e m) P (n e) (2.3) 3-gram: P (omen) = P (om) P (e om) P (n me) (2.4) The algorithm of Narayanan et al. uses 2-grams and enumerates all possible passwords based on the Formula 2.2 that exceed a certain threshold probability λ to create a dictionary. This dictionary can then be used in a dictionary attack (see Section 2.2.3). An extension to the password cracker John the Ripper [23] is based on this concept and can execute the dictionary attack (see Section 2.3.1). However, the created dictionary only contains passwords with an (estimated) probability larger than λ and they are not necessarily in descending order. The guessing time can be drastically reduced, if the passwords are guessed in the correct order, as shown by Dürmuth et al. [10]. Dürmuth et al. propose the Ordered Markov ENumerator (OMEN). This algorithm is based on the idea of Narayanan et al. [21], but implements new functionalities to enumerate the passwords in order of their probability. In addition, OMEN uses 3-grams instead of 2-grams, which also improves the performance significantly. Besides outperforming the basic concept of Narayanan et al., OMEN also performs better than the context-free grammar based password guessing by Weir et al. [42] (see Section 2.2.5), as the results presented by Dürmuth et al. [10] indicate Password Guessing with Probabilistic Grammars Weir et al. [42] propose a similar approach to the rule-based attacks. The scheme is based on context-free grammars and makes the assumption that password structures have a different probability. In a preprocessing phase a probabilistic context-free grammar is created based on lists of real-word passwords. During this phase, the frequency of certain patterns are measured, for example the occurrence of passwords composed of eight letters followed by two digits (the password password12 would match this pattern). The scheme distinguishes sequences of alphabet symbols, digits and special characters. In addition to the password structure, the probabilities of the values for the digits and the special characters are obtained from the password lists. To generate a list of password candidates for the actual attack, the scheme uses the password structures, replacing the digit and special character sequences based on the determined probabilities and the alphabet sequences with words from dictionaries. The results presented by Weir et al. [42] show, that an attack using dictionaries created with the context-free grammars outperforms the default rule-based dictionary attack implemented by the common password cracker John the Ripper (see Section 2.3.1), both using the same password list as input.

15 2.3 Common Password Guessers Common Password Guessers In this section the two common password guessers John the Ripper and Hashcat are presented. Even if several other widely used password guessers are available, the presented ones implement a Markov-model based attack mode and are therefore related to this thesis John the Ripper John the Ripper (JtR) is one of the most popular password crackers. It was developed by Alexander Peslyak and the community of JtR, available for several operation systems (e. g. UNIX, Win32, and DOS) and released under the GNU General Public License. In addition, a commercial version called John the Ripper pro has been released, which is optimized for the different operation systems. The password cracker supports several hash functions and different cracking modes, for instance the wordlist mode and the incremental mode. The wordlist mode processes a rule-based dictionary attack (see Section 2.2.3), the incremental mode an improved brute-force attack (see Section 2.2.2) based on a statistical method. It considers trigraph frequencies, separately for each position and each for password length, to narrow the search space to the most probable passwords. However, the exact details of this algorithm remain unclear. In addition, the external mode can be used to apply a customized cracking mode, defined in a subset of the C programming language [23]. The host of JtR also provides several extensions, including a Markov-model based attack mode. The implementation is based on the algorithm presented by Narayanan et al. [21] (see Section 2.2.4), and uses 2-grams [24]. The JtR incremental mode is outperformed by the Markov mode, as several tests show, for example by Weir [39] and Dürmuth et al. [10] Hashcat Hashcat is a password recovery tool, which supports multi core CPUs. The developers also released two GPGPU-based 1 password crackers called oclhashcat-plus and oclhashcat-lite. All hashcat modules support several attack modes, while oclhashcat-lite is optimized for performance and has therefore a limited functionality. In addition, multiple hash algorithms like MD5, SHA1 and SHA256 are provided. It is the fastest password recovery tool currently available, as the developers claim [37], [28]. The current license agreement of hashcat can be seen in the hashcat forum [36]. Hashcat provides its own implementation of a Markov-model based word generator called statsprocessor. The statsprocessor uses 3-grams, separately for each position [37]. A comparison [32] between the current version of the statsprocessor (version 0.08) and the JtR Markov mode, both trained on the same password list, shows, that the statsprocessor 1 The GPGPU (General-Purpose computation on Graphics Processing Unit) utilizes the graphics processing unit (GPU) instead of the central processing unit (CPU) to perform computations, because the GPUs in personal computers often exceed the capabilities of the CPU [11].

16 8 2 Background and Related Work is outperformed by the JtR Markov mode. However, a yet unreleased, improved version of the statsprocessor performs better than the JtR Markov mode, as a unofficial comparison [31] indicates. The exact algorithm of the statsprocessor and the improvements of the new version remain unclear. 2.4 Password Strength Estimation Another area concerning passwords, where much research is being conducted, is password strength estimation. Several websites implement a password strength meter to estimate the strength of a password to ensure a certain level of security, for example during a registration process. Modern password strength meters, so-called pro-active password checkers, are used to exclude weak passwords [2], [5], [18], [35]. Those password checkers apply different rules to measure the strength of a password, for instance by checking the length of a password or the used character types (i. e. upper and lower case letters, numbers, or special characters). However, studies by Komanduri et al. [19], Weir et al. [40], and Castelluccia et al. [6] have shown that most pro-active password checkers are a rather bad indicator to measure a passwords strength, because in general they only implement a simple rule set. Often, weak passwords are classified as strong, while strong ones are classified as weak. Kelley et al. [17] studied the influence of password creation policies on password strength, by analyzing passwords collected via an online study. More recently, advanced algorithms for the password strength estimation have been proposed, for example by Schlechter et al. [33]. The idea of this algorithm is to compare a password to the passwords already stored in the database and to determine the passwords strength based on the count of the particular password. In addition, Markov-models have been proven useful for password strength estimation, as shown by Castelluccia et al. [6].

17 3 Concept In this section the concept and functionality of the improved version of the Ordered Markov ENumerator (OMEN) are presented, which are based on the original version by Dürmuth et al. [10]. Besides new functionality, like a variable n-gram size, the new version provides an improved usability and a better performance. The basics of Markov-model based password guesser and the original OMEN are introduced in Section Here, we explain the procedure for the computation of the n-gram probabilities and the enumeration of ordered password lists in detail, and introduce the terms level, level chain, and length scheduling algorithm. In addition, the supported smoothing functions are presented, which should ensure good results during n-gram creation even with limited input. To increase the readability, the described basic concept is explained using simplified settings: We use 3-grams and small, exemplary alphabets, as well as a fixed password length l = 4. How the actual lengths are chosen, which parameters may be changed and how they influence the password enumeration is shown in separate subsections. 3.1 Training the Markov-Model OMEN distinguishes three different types of n-gram probabilities: Conditional Probabilities (CP) Initial Probabilities (IP) End Probabilities (EP) As in the original OMEN version [10], CP are the conditional probabilities, which store the transition probability of a single character following a prefix of length 2 (i. e. they store 3-grams), and IP the initial probabilities, storing the probabilities for the first two characters of each password (i. e. they store 2-grams). Later on during the password enumeration the IP are used to determine the first two characters of each new password, thus to be able to determine the first 3-grams. Both, the conditional and the initial probabilities are described in Section The end probabilities (EP) are completely optional, in contrast to the necessarily required CP and IP. They store the probabilities for the last two characters of each password. Those 2-grams can be used to validate the end of a new password during the password enumeration password. In the original OMEN version [10] the common procedure to use special end signs to indicate the end of a word (i. e. password) is used. To increase performance at the expense of memory usage and to be able to easily ignore those end probabilities (by simply not applying them during the

18 10 3 Concept counter CP['ome']++ omen counter CP['men']++ counter IP['om']++ omen counter EP['en']++ Figure 3.1: Counting the occurring 3- and 2-grams in the word omen. enumeration process), this solution is used in the new version. The underlaying idea of EP is that the endings of human-generated passwords can also be estimated due to the typical rich structure (for example, many passwords end with digits [9]). Therefore, the validation of the end of each password could increase the performance. To be able to calculate the probabilities, OMEN first counts the occurrence of the different n-grams in a given password list, meaning each occurring initial respectively end 2-gram, as well as each 3-gram is counted. The Figure 3.1 exemplary shows the count process using the word omen. Based on these counts the probabilities are determined. The IP for each initial 2-gram is calculated as shown in Formula 3.1. The computation of the EP proceeds equivalent. prob xy = count xy Σ i,j=1 count j (3.1) with i, j, x, y Σ and Σ being the size of the alphabet. CP is calculated based on the counted 3-grams. As explained in Section those are transition probabilities. The count of each 3-gram is divided by the summed count of the 3-grams with the same prefix, according to Formula 3.2. For instance, using the alphabet Σ = {a, b, c}, the conditional probability of the 3-gram aab is the count of aab divided by the summed counts of aaa, aab and aac. prob xyz = count xyz Σ i=1 count xyi with x, y, z, i Σ and Σ being the size of the alphabet. (3.2) 3.2 Probability Discretization To simplify the processing, decrease the memory requirements, and to ease the probability estimation of a password, the n-gram probabilities are discretized into level on a scale from 0 to k, with 0 being most likely and k least likely. Therefore, k + 1 different level are available. To achieve this, we use the natural logarithm, according to the Formula 3.3, based on the algorithm by Narayanan et al. [21]. In contrast the original algorithm, we use positive levels instead of negative levels by multiplying the rounded results of the logarithm with 1. The default value for k is 10 (i. e. eleven different levels are available, 0 10). k is also referred as maximum level.

19 3.2 Probability Discretization 11 lvl i = round(log(c 1 prob i + c 2 )) (3.3) The constants c 1 and c 2 are chosen, such that the most frequent n-grams get a level of 0 and a good overall distribution can be achieved. c1 is referred as level adjustment factor. We vary the value in order to achieve good results with different input lists. The handcrafted default values for the level adjustment factor are shown in Table 3.1. The factors for IP and EP are larger than the one for CP, because the denominators used during the IP and EP calculation are the sum of all according counts and therefore much larger than the one used for CP (see Section 3.1). The value of c2 however is fixed at e k, to avoid calculating the invalid logarithm of 0 and set any level that would be larger than k to k. In the following the term level is used for the different probabilities. Table 3.1: Default level adjustment factor for the different n-grams Probability Factor CP 2 IP 100 EP 100 The level graduation grants multiple benefits: The most important factor is that it ease the computation of the overall level (i. e. the overall probability) of a generated password. Instead of calculating the probability by multiplying the probability of each occurring initial, conditional and end gram probability, we can simply add the level, because they are based on the natural logarithm, as shown in Formula 3.4. This property of the natural logarithm enables the usage of the level chains, a key element of the OMEN concept (see Section 3.4). log(prob abcd ) = log(prob IP [ab] prob CP [abc] prob CP [bcd] prob EPcd ) = log(prob IP [ab] ) + log(prob CP [abc] ) + log(prob CP [bcd] ) + log(prob EP [cd] ) (3.4) = lvl IP [ab] + lvl CP [abc] + lvl CP [bcd] + lvl EP [cd] = lvl abcd In addition, the level graduation simplifies the processing. For instance, integer arrays can be used to store the different n-grams. The performance of comparative operations is increased as well, because of the smaller number range. It also increases the readability and the comparability of the output files.

20 12 3 Concept 3.3 Smoothing Smoothing functions are used to improve the results even with a small input password list. If a certain 3-gram does not occur in the input password list, the calculated probability equals 0. Therefore it would never be selected during the password enumeration. Smoothing functions are used, to adjust the probability of unseen grams, so that they will not be ignored. In addition, all other grams are adjusted accordingly. This section presents the supported smoothing functions and explains their functionality. Default Smoothing OMEN always provides a simple smoothing: during the level graduation all grams with a smaller level than k are set to k (with k being the maximum level). Therefore there are no n-grams with a probability of 0. However, this smoothing does not adjust the level of all other grams accordingly. The default smoothing is always applied, even if another smoothing is used. Additive Smoothing During the probability computation and before mapping them to a level, the additive smoothing adds a variable value δ to each gram, adjusting the overall count (i. e. the denominator) accordingly. The computation for IP and CP are adjusted as shown in Formula 3.5 (the computation of the End Probabilities/Counts (EP) proceeds equal to the IP computation). IP: prob xy = CP: prob xyz = count xy + δ Σ i,j=1 (count ji) + δ Σ 2 count xyz + δ Σ i=1 (count xyi) + δ Σ (3.5) with i, j, x, y Σ [7]. 3.4 Level Chains The level chains are one of the essential elements of OMEN to enumerate an ordered password list and are already used by [10]. Basically they prescribe the level structure of the created passwords, i. e. which level the n-grams at which position in a new password must have. A level chain stores integers between 0 and k, representing the level. The amount of levels stored in a chain is derived from the length of the passwords that should be created and influenced by the size of n. The first integer in a level chain dictates which level the initial n 1 characters of a new password must have, meaning which 2-gram stored in the initial probabilities may be used. Based on the selected first two characters and the next level in a level chain we can determine which 3-grams can be appended. All others but the last level in a level chain dictate the level of a n-gram appended to the password at the corresponding position. The last level indicates the level for the end probability, to validate the last two characters. Since the end probabilities are optional, they can be

21 3.4 Level Chains 13 ignored. If so, the last level of the level chain is not used and must not be computed. In the following explanation and examples, the end probabilities are applied. The level chains are created based on an overall level, which is initialized to 0 and that increases if no more chains can be created using this overall level. The sum of each level in a level chain must not exceed the current overall level. The password enumeration for a fixed length of four, using the level chain and the overall level, works as follows: 1. The overall level is initialized to 0 and the only possible level chain 0000 is created. 2. Every possible password that fits the level chain is created. 3. Since there is no other possible level chain for the overall level, the overall level is increased by Based on the current overall level (= 1) the first level chain 1000 is created and based on this chain all possible passwords. 5. The next level chain 0100 and the possible passwords are created. 6. If the last possible level chain with the current overall level is created, the process restarts at step 3 (setting the overall level to 2 and creating the first level chain 2000 in step 4). The level chains up to an overall level of 2 with length four can be seen in Table 3.2. Figure 3.2 shows a simple example of the first three passwords created using a small alphabet Σ = {a, b} and the fixed length of l = 4, using the CP, IP and EP levels shown in Table 3.3. Like mentioned above, the level chain length depends on the password length l. Since the first level in a level change is used for the first n 1 characters and an additional last one for the validation of the last n 1 characters, the total length of a level chain equals level chain = l (n 2) + E (3.6) with E = 1 if the end probabilities are applied and E = 0 if not. Table 3.2: Level Chains up to an overall level of 2 for passwords of length 4 (if the end probabilities are applied) Level 0 Level 1 Level

22 14 3 Concept Table 3.3: Exemplary CP, IP and EP with the alphabet Σ = {a, b} IP CP EP 2-gram level 3-gram level 3-gram level 2-gram level aa 0 a aa 1 a ba 0 aa 1 ab 1 b aa 0 b ba 0 ab 0 ba 1 a ab 0 a bb 1 ba 0 bb 3 b ab 2 b bb 3 bb 2 1. select first 2-gram of IP with level 0 ("aa") select first 3-gram from CP 3. select first 3-gram from CP 4. last 2-gram "ba" is validated (level 0 and prefix "aa") (level 0 and prefix "ab") using EP (accept password) aa-- v aab- v aaba aaba 5. Since there is no other combination for the level chain 0000 (there is no other 3-gram matching the prefix "ab" of step 2 nor the prefix "aa" of step 1), the next level chain 1000 is created. 6. select first 2-gram of IP with level 0 ("aa") 1000 ab-- 7. select first 3-gram from CP 8. select first 3-gram from CP 9. last 2-gram "ba" is not validated (level 0 and prefix "aa") (level 0 and prefix "ab") (reject password) 1000 v aba v abaa 1000 x abaa 10. select second 3-gram from 11. last 2-gram "ba" is validated CP (level 0 and prefix "ab") using EP (accept password) v abab aaab 12. The next passwords created using the level chain 1000 are "baab" and "baba". Then the next level chain 0100 is created and the progress continues. Figure 3.2: Exemplary password creation using level chains with a length of 4 and the alphabet Σ = {a, b}

23 3.5 Length Scheduling Length Scheduling Normally the length of passwords is not fixed. The selection of the length of the passwords to be created is a critical factor during the password enumeration. This process is referred to as length scheduling. To achieve a good length scheduling, OMEN counts the lengths (LN) of the occurring passwords, with a minimum length of 3 (the size of the used n-grams) up to a length of 19, and computes the probability of each length: The count of each length is simply divided by the count of all passwords evaluated, according to the Formula 3.7. prob i = count i 19 i=n (count i) (3.7) Those probabilities, also mapped to the level in a range from 0 to k using the natural logarithm, are referred to as initial length level (lvlinit). The actual length selection uses the overall level for level chain creation and process as follows: 1. Initialize the overall level to 0 (this is the same overall level used for the level chains, the value is not re-initialized!). 2. Check the initial length level and select any lengths with a level smaller or equal to the current overall level. 3. For each selected length: (a) Calculate the current length level by subtracting the overall level from the initial length level: lvlcur l = lvl overall lvlinit l, with l being the current length. (3.8) (b) Enumerate any possible passwords using the level chains for the current length and the current length level (see Section 3.4). 4. Increase the overall level by one and restart at step 2. The default length scheduling simply takes the order based on the length level to select the length. But this neglects an important factor regarding the length of passwords: there are many more possible character combinations for larger passwords, as there are for shorter ones. Using the default alphabet with 72 characters, there are less than (27 million) 1 different password combinations of length four, and over (720 quadrillion) 2 different combinations of length eight. Therefore, and even if a password of length eight may be more likely, it is much more efficient to create the quite small amount of odds on passwords of length four, before creating any password from the huge amount of likely ones of length eight. 1 Possible combinations with length four: 72 4 = Possible combinations with length eight: 72 8 =

24 16 3 Concept Computed Length Level Computed Length Level Length Level Factor Level Level Length Length (a) without length level factor (b) with length level factor 1 Figure 3.3: Histograms exemplary showing the influence of the length level factor To handle this problem, OMEN provides three different advanced length scheduling algorithms. The length level factor and length level mask manipulate the level of each length and therefore the selection order, but are still based on the length selection presented above. Using these algorithms the lengths level may be set to a larger level than 10. Both algorithms may be used simultaneously. The third length scheduling algorithm is called adaptive length scheduling and chooses the lengths based on the success rate of previous tries, ignoring the computed length level. In addition, OMEN may also enumerated passwords only for a fixed length. In this case, there is obviously no need for any length scheduling algorithm. Length Level Factor This algorithm adds the actual length multiplied by a variable factor to each length level computed. The factor is called length level factor and is a float with a maximum value of 10.0 and a minimum value of 0.0. Setting the factor to 0.0 equals using no length level factor. The effect to the lengths level can be exemplary seen in Figure 3.3. Length Level Mask The length level mask adds a different value to the level of each length. This algorithm provides a more precise manipulation of the lengths level in comparison to the length level factor algorithm. It can also be used to completely overwrite the length levels computed, setting them to the given values. This may be useful in a certain scenarios, where background information about the password lengths is known (e. g. only passwords of length 8 to 12 are allowed).

25 3.6 Variable Parameters 17 Adaptive Length Scheduling The adaptive length scheduling has been invented by [10] and is the only length scheduling algorithm supported by the old OMEN version. It ignores the length level and does not presort the lengths. Instead it stores an independent overall level for each length, initialized to 0. First, all possible password combinations for each lengths based on the overall level of 0 are created, beginning with the smallest possible length. Based on the result of this first run-through, the success rate for each length is calculated, meaning the amount of passwords successfully guessed divided by the amount of attempts. The algorithm then proceeds as follows: 1. Select the length with the best success rate 2. Increase the overall level for the selected length by one. 3. Compute all possible level chains and the according passwords based on the current overall level for the selected length. 4. Re-calculate the success rate based on the last run-through and restart at step 1. Lengths with a success rate of 0 % are used again, once all lengths have a success rate of 0 %. This concept my cause problems: If the amount of n-grams with level 0 is too small, only a few passwords can be created for each length within the initial run-through, and the computed initial success rate my not be representative. 3.6 Variable Parameters Besides the smoothing function and the length scheduling, OMEN provides a lot of parameters. The following parameters can be changed: n-gram size Alphabet Maximum level (k) Amount of guesses The n-gram size and the used alphabet have the greatest influence on the n-gram creation and therefore the performance of the password enumeration. The n can be set to 2, 3, 4 or 5. The size of the IP and EP n-grams is adjusted accordingly (i. e. n 1). A larger n-gram size increases the accuracy of the Markov-model, but drastically increase the run-time as well. The alphabet is variable 3, but selecting the right alphabet is tricky: If the alphabet is too small, many password combinations will not be created, since the according character is not part of the alphabet. On the other hand, a large alphabet increases the run-time and may not even increase the overall effectiveness, because many characters may not be used at all. A larger maximum level increases the level range and therefore the accuracy of the Markov-model, but influences the run-time as well. In addition, the end probabilities are optional. Ignoring the end probabilities decreases the run-time, but also influence the success rate. 3 Based on the 8-bit ASCII table according to ISO , not allowing \n, \r, \t and (space)

26 18 3 Concept 3.7 Password Enumeration Modes OMEN provides three different password enumeration modes: default, pipe, and simulated attack. The default mode simply stores the generated passwords in a text file. The pipe mode can be used to pipe the produced passwords to another application. Using the simulated attack mode, the generated passwords are directly tested against a given password list. OMEN then produces a file showing the crack process and the used lengths. The adaptive length scheduling (see Section 3.5) only works using the simulated attack mode, since a feedback about the crack process is necessary to compute the success rates. The simulated attack mode is used to evaluate the effectiveness of OMEN. 3.8 Comparison to old OMEN-Version Beside some minor bugs, the old OMEN version is too static to produce meaningful results about the effectiveness of Markov-models for passwords enumeration: Essential parameters regarding the n-gram level computation, for instance the alphabet and the level adjustment factor, can only be changed directly in the source code (that means recompiling the whole project to have the changes take place). But more important it only supports 3-grams and is not able to investigate the effectiveness of larger or smaller n-grams. In addition, the maximum level can not be changed. The password enumeration only supports the simulated attack mode (see Section 3.7) and only implements the adaptive length scheduling. Therefore, it is not able to create passwords of a fixed length. In the new version nearly all parameters of interest can be set via command line, including the n-gram size and the maximum level. The parameters are introduced in Section 3.6. Besides the increased usability and functionality, the new version also increases the performance of OMEN and provides a better expandability. In addition, several enumeration modes and length scheduling algorithms are provided.

27 4 Implementation This chapter presents the implementation of the improved OMEN based on the concept described in Chapter 3. OMEN is realized in C and is mainly based on the ANSI C standard libraries. In addition, we use the getopt library to parse command line arguments and the hashmap provided by uthash (see Appendix, Section B.3). First, an overview introduces the basic structure of the implementation, points out the main tasks and how those tasks are divided into two modules. Subsequently the two main modules createng and enumng are discussed in detail, describing the data types used and the procedure of each module. We use pseudo code and several code listings to explain the implementation. In addition we use Nassi Shneiderman Diagrams to visualize important parts of the source code. Although the implementation is not object oriented, we use the Unified Modeling Language (UML) class diagrams to model the different logical function units (see the following Section) and their interfaces. 4.1 Structure and Modules In this section we present the basic structure of OMEN. The program parts (i. e. functions and global variables) that are used to fulfill one of the different tasks are merged into logical function units. Those function units are modeled using UML class diagrams to visualize the structure. The task of implementing a Markov-model based password generator is split into two main tasks. In the training phase a password list is read and the n-grams with their according level are computed and stored in a new list. This list is then used in a second step to actually generate the passwords ordered by level. The module responsible for the creation of the n-grams and the computation of their level is called createng, the password enumerator enumng. The Input-Processing-Output scheme for the two autonomous modules can be seen in Table 4.1. Both modules can be used separately via command line and provide no graphical frontend. Figure 4.1 visualizes the two main modules and their logical function units. Notice that the figure only shows the name of each unit. A complete diagram with all public functions, as well as the public and private global variables can be seen in the Appendix, Figure B.1. In the following each logical function unit is shortly introduced. In the Section 4.2 and 4.3, the function units createng and enumng are described in detail.

28 20 4 Implementation createng enumng Table 4.1: Input-Processing-Output scheme for OMEN Input Processing Output Password list Count n-grams and compute the according level Level lists Level lists Enumerate passwords based Ordered password on the n-gram level list Figure 4.1: Overview of the logical function units createng reads the input file and counts the occurring n-grams. The n-gram levels, computed by smoothing - are stored in files, one for each n-gram type and one for the lengths. createng also generates a configuration file referred as createconfig storing the used settings. smoothing is used by createng to apply a smoothing to the counted n-grams and compute the level. common provides basic functions which are frequently used by all other function units. Besides functions to manipulate c-strings (e. g. replace characters or append another c-string), common provides functions to gain the position of a character in a given alphabet or compute an n-gram based on a given position (see Section 4.2.1). arginterpreter is used by createng and enumng to interpret command line arguments. For instance, it converts c-strings into integer and changes filenames and alphabets. commonstructs encapsulate the different structs used by createng and enumng. It provides functions memory management and fills the structs based on the input. OMEN uses multiple structs to encapsulate filenames and n-gram arrays (see Section and Section 4.3.1). errorhandler stores errors and warnings that occur. The errors and warnings can be printed to the command line or stored in a file. Especially during the n-gram creation a lot of warnings may occur, since every unknown symbol produces a warning. The errorhandler is not introduced in detail.

29 4.2 createng 21 enumng reads the createconfig generated by createng as well as the level files using the ngramreader. Based on the settings and the level read, enumng performs the ordered password enumeration. enumng supports several modes, which are explained in Section 4.3. attacksimulator is used by enumng to perform a simulated attack. It reads a list of plain-text passwords, referred to as training set, and provides function to check, if a certain password created by enumng is part of the training set. ngramreader reads the given level files and fills the according arrays. 4.2 createng In this Section we describe the implementation of createng, the module responsible for reading the n-grams from a password list and computation of the level. The functions of other logical function units that are used by createng are also discussed. A class diagram showing the relevant functions and variables of createng and the used function units can be seen in Figure Datatypes The n-gram counts are stored in integer arrays, separate ones for each IP, CP, and EP. The size of each array is Σ n, for example 72 2 for IP and EP and 72 3 for CP using the default alphabet with 72 characters and 3-grams. To map an n-gram to its position in the according array, the function get_positionfromngram provided by common is used. This function uses the position of each single character in the alphabet, determined by the function get_positioninalphabet, also provided by common. The computation of the position processes as shown in Algorithm The exact calculation for 2- and 3-grams can be seen in Formula 4.1. To increase readability we use the short notation posng for get_positionfromngram and posσ for get_positioninalphabet in the upcoming formulas. Algorithm 4.2.1: Position computation to a n-gram (get_positionfromngram) Data: n-gram ng, size of n, size of alphabet Σ Result: position pos of ng 1 begin 2 for i 0 to n 1 do 3 charp os get_positioninalphabet(ng[i]) 4 pos pos + charp os Σ (n 1) i

30 22 4 Implementation posng(xy) = posσ(x) Σ + posσ(y), posng(xyz) = posσ(x) Σ 2 + posσ(y) Σ + posσ(z), with x, y, z Σ. (4.1) The length counts (LN) are stored in an integer array of size 19 and the actual length is used as index. Since the minimum length equals n of the used n-grams, the array fields with an index smaller than n are not used. The different arrays are encapsulated in the ngramscount struct, storing the actual arrays, the corresponding size, and the size of the n-gram. The alphabet and the size of the alphabet is encapsulated in the alphabet struct, the filenames for the input password list and the output level files in the filenames struct. commonstructs provides the functions to initialize, allocate and free the memory needed by those structs. Figure 4.2: Logical function units with functions and variables relevant for createng

31 4.2 createng Procedure The access point of the createng is the main function. main calls the following functions to initialize and execute the n-gram level creation process based on the given password list: 1. initialize initializes all variables and assigns the default values. To initialize the structs the functions provided by commonstructs are used (see Section 4.2.1). 2. evaluate_arguments evaluates the given command line arguments and sets the affected parameters accordingly, using the functions provided by arginterpreter. The input password list is the only command line argument, that must be given. If the size of the n-grams or the alphabet changes, the size of the n-gram arrays is adjusted accordingly. The smoothing function is selected and configured using an additional configuration file. The file is forwarded to and evaluated by the smoo_readinput function of the smoothing function unit. The settings are stored in the private struct smoo_additive_vars, the actual smoothing functions are assigned as function pointers in the public struct smoo_selection. Each array type (i. e. IP, CP, EP and LN) may have his own settings and function. 3. run_creation executes the actual level creation process: I. The password input file is evaluated by the evaluate_inputfile function. The file is read line by line, using the get_nextline function. If a line is too short (i. e. smaller than n) or too large (i. e. larger than 19), it is ignored. After stripping the newline token (if any), the counter for the IP, CP and EP as well as the LN counter are adjusted based on the occurring n-grams and the line length, as shown in Listing 4.1. II. The selected parameters (i. e. the used alphabet, the n-gram size, the filenames etc.) are stored in the configuration file by the write_config function. If the filename has not been changed, the default name createconfig is used. III. write_array is called for each array type. The levels are computed using the corresponding smoothing function stored as function pointers in the public struct smoo_selection. Once the levels are computed, they are stored in a text file named after the array type with the attachment.level. Listing 4.2 exemplary shows the smoothing and level computation for CP which is called for each position (i. e. for each n-gram). 4. If the process ends as intended or aborts due to an error, the exit_routine function frees any allocated memory.

32 24 4 Implementation 1 BOOL evaluate_inputfile(char* filename){ 2 char *line = NULL; 3 int len, pos; // line length and position in array 4 int n = ngramscount->sizeof_n; // size of n 5 /* open file "filename", return FALSE if the file can not be opened*/ 6 while(get_nextline(&line)){ 7 /* strip newline token from line (if any) */ 8 len = strlen(line); 9 // check line length 10 if(len < 20 len >= n){ 11 // adjust length counter 12 ngramscount->ln[len]++; // get position of the first n-1 character and adjust ip counter 15 get_positionfromngram(&pos, line, n-1, alphabet->size, alphabet->alphabet) 16 ngramcount->ip[pos]++; // get position of each possible n-gram from line[i] to line[i+n] and adjust cp counter 19 for(i = 0; i < len - (n-1) ; i++){ 20 get_positionfromngram(&pos, line + i, n, alphabet->size, alphabet->alphabet) 21 ngramcount->cp[pos]++; 22 } // get position the last n-1 character and adjust ep counter 25 i = len - (n-1); 26 get_positionfromngram(&pos, line + i, n-1, alphabet->size, alphabet->alphabet) 27 ngramcount->ep[pos]++; 28 } 29 } 30 return TRUE; 31 } Listing 4.1: Simplified function showing the process of evaluate_inputfile 1 void smoo_additive_funct_conditional(int *level, int pos){ // level to be set and current position 2 long sum = 0; // total sum 3 double prob = 0.0 // current probability 4 int count = ngramcount->cp[pos] + delta; // count for current position adjusted by delta 5 int pos_prefix = pos - ( pos % alphabet->size ); // position of the prefix of the current n-gram 6 7 // calculate sum for conditional probability 8 for(i = 0; i < alphabet->size; i++) 9 sum += ngramcount->cp[pos_prefix + i]; 10 // apply delta (from smoo_additive_vars) and avoid devision by 0 11 sum += (alphabet->size * delta); 12 if(sum == 0) 13 sum = 1; // compute probability and apply level adjustment factor (from smoo_additve_vars) 16 prob = (double)(count) / (double)(sum); 17 prob *= (double)leveladjustfactor; // use log (defined in stdlib math.h) and invert sign 20 *level = (int)(log(prob)); 21 *level *= -1; 22 if(*level >= (maxlevel)) // if max level is exceeded, set to max level 23 *level = maxlevel - 1; 24 } Listing 4.2: Simplified function showing the process of the additive smoothing function and the level computation for CP (smoo_additive_funct_conditional)

33 4.3 enumng enumng The module enumng enumerates an ordered password list based on the level computed by createng. This Section presents the implementation of enumng and the functions of the other logical function units called by enumng. Figure 4.3 shows a class diagram with the used logical function units and the relevant function and variables. Figure 4.3: Logical function units with functions and variables relevant for enumng

34 26 4 Implementation Datatypes enumng reads the n-gram size and the alphabet from the configuration file and the n-gram and length level from the level files created by createng. The data is stored in the structs provided by commonstructs, the same that are used by createng (see Section 4.2.1). Since the struct for the n-grams and lengths encapsulates integer arrays, it can store the counts and the level as well. It is named ngramlevel. To generate passwords, enumng must convert a position into the corresponding n-gram, meaning reverse the computation of Algorithm The function get_ngramfromposition computes the conversion and is provided by common. Algorithm shows the process, which uses get_charatposition to determine a character to a given alphabet position. To increase the overall performance of the enumeration process the n-gram arrays IP and CP are sorted by their level. In addition, the conditional probabilities are also sorted by the prefix. The EP are not sorted, because they are only used for the final evaluation of a new password and no loop iterates over the whole array. The sorted IP is named sortedip, the sorted CP sortedlastgram. Algorithm shows the sort algorithm for the CP array. Since the amount of n-grams for the different levels and prefixes is unknown at the beginning of the process, the sortedlastgram arrays grow dynamically. Based on the example shown in Table 3.3 in the concept section, the sorted arrays would store the data as shown in Listing 4.3. The lengths are sorted by their level (after applying any length level modifier), beginning with the smallest level, and stored in sortedlength. If two or more lengths have the same level, they are sorted by the actual length value, beginning with the smallest length. The sorted n-gram arrays are also encapsulated in structs, provided by commonstructs. This function unit also provides functions to manage the memory needed by those structs and processes the actual sort algorithm. With the functions provided by the function unit attacksimulator the simulated attack can be executed. To be able to compare the created passwords to the one in the password list read by the attacksimulator in an acceptable run-time, the list is stored in a hashmap. The function add_testsetpassword and find_testsetpassword are used to add a password to the hashmap or to find a created password in the hashmap, respectively. Algorithm 4.3.1: n-gram computation from a position (get_ngramfromposition) Data: position pos, size of n, size of alphabet Σ Result: n-gram ng 1 begin 2 charp os pos mod Σ 3 ng[n 1] get_charatposition(charp os) 4 for i (n 2) to 0 do 5 pos pos/ Σ 6 charp os pos mod Σ 7 ng[i] get_charatposition(charp os)

35 4.3 enumng 27 Algorithm 4.3.2: Sorting the CP array (sortedlastgram_fill) Data: n-gram array CP, size of n-gram array CP, size of n, size of alphabet Σ Result: sorted n-gram array sortedlastgram 1 begin 2 for i 0 to CP do 3 level CP [i] // Get last gram position: 4 ng get_ngramfromposition(i, n) 5 lastgram_pos get_positioninalphabet(ng[n 1]) // Get prefix position using get_positionfromngram with n 1: 6 pref ix_pos get_positionfromngram(ng, n 1) // Get current index for level and pref ix_pos and add last gram: 7 index sortedlastgram[level].index[pref ix_pos] 8 sortedlastgram[level].lastgrams[pref ix_pos][index + 1] lastgram_pos /* If necessary, reallocate sortedlastgram[level].lastgrams[pref ix_pos] */ 1 // Example how to access the sorted IP and CP structs 2 /* sorted IP 3 * basic usage to get i th IP characters for the given level: 4 * sortedip[level].ip[i] 5 */ 6 // level 0: 7 sortedip[0].ip[0]; // = 0 ( aa ); 8 // level 1: 9 sortedip[1].ip[0]; // = 1 ( ab ); 10 sortedip[1].ip[1]; // = 2 ( ba ); 11 // level 3: 12 sortedip[3].ip[0]; // = 3 ( bb ); /* sorted last gram (CP) 15 * basic usage get i th last gram for the given level and prefix : 16 * sortedlastgram[level].lastgrams[prefix][i] 17 * Notice that the prefix is an integer (representing the position of the 2-gram), but to 18 * increase the readability the actual 2-grams are used! 19 */ 20 // level 0: 21 sortedlastgram[0].lastgrams["aa"][0]; // = 1 ( b ) 22 sortedlastgram[0].lastgrams["ab"][0]; // = 0 ( a ) 23 sortedlastgram[0].lastgrams["ba"][0]; // = 0 ( a ) 24 sortedlastgram[0].lastgrams["ba"][1]; // = 1 ( b ) 25 // level 1: 26 sortedlastgram[1].lastgrams["aa"][0]; // = 0 ( a ) 27 sortedlastgram[1].lastgrams["bb"][0]; // = 0 ( a ) 28 // level 2: 29 sortedlastgram[2].lastgrams["ab"][0]; // = 1 ( b ) 30 // level 3: 31 sortedlastgram[3].lastgrams["bb"][0]; // = 1 ( b ) Listing 4.3: Example of sorted IP and CP based on the example in Table 3.3 (3-gram, Σ = {a, b})

36 28 4 Implementation Procedure The access point of enumng is the main function. In the following we present the basic process and list the functions called, beginning with the main function. The actual password enumeration, including the level chain creation is discussed in detail. 1. initialize only initializes the variables needed to store the data read from the configuration file and the level files (i. e. the structs storing the sorted n-grams and lengths are not initialized at this point). It also assigns default values to all those parameters. 2. evaluate_arguments evaluates the given command line arguments and sets the affected parameters accordingly, using the functions provided by arginterpreter. 3. apply_settings reads the input files using the function read_inputfiles provided by ngramreader. First the configuration file is evaluated. The file stores the filenames of the level files, the used alphabet, the n-gram size, etc. If the size of the alphabet or the n-sizes changes, the size of the n-gram arrays is adjusted accordingly. Subsequently the levels are read from the level files and stored in the corresponding arrays. Once the levels are read, the sorted arrays are initialized and n-grams are sorted using the functions provided by commonstructs (see Section 4.3.1). If the simulated attack mode has been selected, the given password list is stored in a hashmap by the simatt_genertatetestingset function (provided by attack- Simulator). 4. Based on the selected length scheduling (see Section 3.5) the enumeration processes is executed by one of the following functions. The actual password enumeration is introduced in Section a) run_enumeration_fixedlenghts is called if only passwords of a fixed length should be created. The enumeration algorithm described in Section is executed for that length and for each overall level(beginning with 1) until the maximum amount of passwords has been created or no more level chains can be created. b) run_enumeration is the default enumeration function. It selects the lengths based on their level using the function sortedlength_getmaxindexforlevel (provided by commonstructs). The function returns the index of the first length with a level larger than the given overall level. Based on this index, run_enumeration loops over all lengths with a level smaller than the current overall level, computes the current length level (by subtracting the initial length level from the overall level) and executes the password enumeration for each lengths as described in Section c) run_enumeration_optimizedlengths executes the enumeration using the adaptive length scheduling. This length scheduling only works, if the simulated attack

37 4.3 enumng 29 mode is selected. The function uses two arrays: one integer array to store the level of each length (initialized to 0) and a float array to store the success rate for each length (initialized to 1). The smallest length with the highest success rate is selected and the enumeration process using this length is executed, as described in Section Subsequently the success rate of the last run-through is computed. If the success rate is 1, it is set to (so that lengths not yet tested are at least selected once), if its = 0 its set to , to ensure it is selected again once all success rated are that small. The success rate of a length is set to 0 once all possible level chains for this length have been created. 5. If the process ends as intended or aborts due to an error, the exit_routine function frees any allocated memory. In addition, a log file is created using the print_log function, storing information about the enumeration processes, for example the amount of passwords created and their lengths, the amount of n-grams and length levels, and occurring errors and warnings Password Enumeration The password enumeration is executed using two essential functions: getnext_levelchain generates a new level chain or the next possible level chain for a given length and level. The recursive function getnext_levelchain_recursive is used to determine the level chains. enumerate_password generates all possible passwords based on the created level chain, using the recursive function enumerate_password_recursive for the actual password enumeration. Both functions return a boolean: getnext_levelchain returns false if no more level chains for the given length and level can be created. enumerate_password returns a boolean to indicate whether the maximum amount of attempts have been reached. The enumeration process for a given length and level can be seen in Algorithm The called functions and the used recursive functions are shown in Listing 4.4 (password enumeration) and Listing 4.5 (level chain creation). Notice that the password is stored as an integer array during the enumeration to decrease the run-time. Therefore the functions get_positionfromngramasint and get_ngramasintfromposition are used, working in analogy to get_positionfromngram and get_ngramfromposition presented in Algorithm and In addition, a Nassi Shneiderman Diagram for the two recursive functions can be seen in Figure 4.4 and 4.5. Once a password has been created, the function handle_createdpassword handles the password according to the selected mode (see Section 3.7). For example, if the simulated attack mode has been selected, the function simatt_checkcandidate checks whether the password occurs in the hashmap (see Section 4.3.1). If the pipe mode is selected, the created password is simply printed to the command line.

38 30 4 Implementation Algorithm 4.3.3: Basic password enumeration process Data: Overall level, password length, size of n 1 begin 2 newchain TRUE 3 lengthlc length (n 2) + 1 // Compute length for level chain 4 while getnext_levelchain(levelchain, lengthlc, level, newchain) do 5 newchain FALSE 6 if enumerate_password(levelchain,length) == f alse then /* Maximum attempts reached, end enumeration */ 1 BOOL enumerate_password(int levelchain[], int lengthmax){ 2 int n = ngramlevel->sizeof_n; 3 int password[20]; // to increase the runtime, the password is stored as int[] 4 int ip_level = levelchain[0]; // the first level of the levelchain is for the initialprob 5 int length = (n - 1); // the lengths equals the size of n - 1 (size of the IP) 6 // for each initialprob with the given ip_level 7 for( i = 0; i < sortedip[ip_level].index ; i++){ 8 // set the first (size of n - 1) int according to the position stored in sortedip 9 get_ngramasintfromposition(password, sortedip[ip_level].ip[i], n-1, alphabet->size); 10 // call the recursive function 11 if(!enumerate_password_recursive(password, levelchain, length, lengthmax) ) 12 return FALSE; // return FALSE, if max attempts have been reached 13 } 14 return TRUE; 15 } // end enumerate_password BOOL enumerate_password_recursive(int password[], int levelchain[], int length, int lengthmax){ 18 int n = ngramlevel->sizeof_n; 19 int position = 0; // prefix position 20 /* get current level from the levelchain (need to be adjusted by n since IP uses 1 level from 21 levelchain, but n-1 characters in the Password) */ 22 level = levelchain[length - (n-2)]; 23 // get position of the prefix (the last added (n-1) characters) 24 get_positionfromngramasint(&position, password + ( length - (n-1) ), n -1, alphabet->size); 25 // if length of new password equals max length and check endprob level 26 if(length == lengthmax && level == ngramlevel->ep[position]){ 27 // handle created password (returns FALSE, if max attempts have been reached) 28 return handle_createdpassword(password, levelchain, length); 29 } 30 else{ // length!= lengthmax 31 // for each lastgram with current level and position 32 for( i = 0; i < (sortedlastgram[level].index)[position] ; i++){ 33 // append lastgram to the password 34 password[length] = (sortedlastgram[level].lastgrams)[position][i]; 35 // call recursive function with length if(!enumerate_password_recursive(password, levelchain, length + 1, lengthmax) ) 37 return FALSE; // return FALSE, if max attempts have been reached 38 } 39 } 40 return TRUE; 41 } // enumerate_password_recursive Listing 4.4: Simplified function showing the process of enumerate_password and enumerate_password_recursive

39 4.3 enumng 31 1 BOOL getnext_levelchain(int levelchain[], int length, int levelmax, BOOL newchain{ 2 // set level to levelmax... 3 int level = levelmax; 4 //...but if the level should exceed the global maximum level... 5 if(level > (glbl_maxlevel - 1)) 6 level = glbl_maxlevel - 1; //...set level to global maximum level 7 // check, if the chain is possible 8 if(levelmax > (glbl_maxlevel-1) * length) 9 return FALSE; // no levelchain can be created 10 // if no new chain should be created; 11 if(!newchain) 12 levelchain[length-2]++; // increase second last to avoid doublets // for each level <= levelmax (or glbl_maxlevel),set levelchain[0] 15 for(; levelchain[0] <= level ; levelchain[0]++){ 16 // call recursive function to determine 2nd to last int for the levelchain with levelmax 17 if(generate_levelchain_recursive(levelchain, 1, length, levelchain[0], levelmax)) 18 return TRUE; // levelchain set and accepted 19 // if current levelchain[0] is increased, reset the following (levelchain[1]) 20 levelchain[1] = 0; 21 } 22 return FALSE; // no more levelchains can be created 23 } // end getnext_levelchains BOOL generate_levelchain_recursive(int levelchain[], int depth, int length, 26 int levelcur, int levelmax){ 27 // if last int for levelchain int is reached if(depth == (length - 1)){ 29 // the last int must be the rest of levelmax levelchain[length - 1] = levelmax - levelcur; 31 //... and should not be larger than the global maximum level 32 if(levelchain[lengthmax - 1] > (glbl_maxlevel - 1)) 33 return FALSE; // reject generated levelchain 34 else 35 return TRUE; // accept generated levelchain 36 } 37 else{ // depth < (length-1) 38 // get max possible level, but do not exceed global maximum level 39 int level = levelmax - levelcur; 40 if(level > (glbl_maxlevel - 1)) 41 level = (glbl_maxlevel - 1); // proceed from level set previously 44 for( ; levelchain[depth] <= level; levelchain[depth]++){ 45 // compute new current level 46 int levelnew = levelcur + levelchain[depth]; 47 // call recursive function to determine levelchain[depth+1] and levelnew 48 if(generate_levelchain_recursive(levelchain, depth+1, length, levelnew, levelmax)) 49 return TRUE; // levelchain set and accepted 50 // if current levelchain[depth] is increased, reset the following (levelchain[depth + 1]) 51 levelchain[depth + 1] = 0; 52 } 53 } 54 return FALSE; // no more levelchains can be created 55 } // end generate_levelchain_recursive Listing 4.5: Simplified function showing the process of getnext_levelchain and getnext_levelchain_recursive

40 32 4 Implementation Figure 4.4: Nassi-Shneiderman diagram showing the process of the recursive function enumerate_password_recursive Figure 4.5: Nassi-Shneiderman diagram showing the process of the recursive function generate_levelchain_recursive (with LC = levelchain)

41 5 Test Setup To evaluate the performance of the improved OMEN version, different settings and input password lists are used. The tests are executed using the simulated attack mode (see Section 3.7). We evaluate OMEN based on the amount of attempts required to correctly guess passwords in a set of test data (see Section 2.2.1). The result of a test run-through equals the percentage of passwords of the test data guessed correctly, with a fixed amount of guesses (i. e. the amount of passwords created by OMEN that also occur in the test data used, divided by the count of all passwords in this test data). This chapter presents which password lists are used to train the Markov-model and which ones are used to test the created passwords. In addition, we present the variable parameters and their values used to vary the n-gram creation and the password enumeration process during the tests. The results, the discussion about the parameters and all inferences are presented in Chapter Datasets To produce meaningful results and fully evaluate the effectiveness of OMEN, we use multiple password lists, where each is split into two subsets: the training sets to train the implemented Markov-model and the testing sets to test the created passwords during a simulated attack. In addition to several training sets based on leaked password lists we use special dictionaries, containing frequently used terms and passwords, to train OMEN. Table 5.1 provides an overview of the selected password lists, the used acronyms and the size of their training and testing set. The used leaked password lists (RockYou and phpbb list) are for instance available at [4], the used dictionaries at [15] Training & Testing Sets RockYou list The largest leaked password list, the RockYou list (RY), contains over 32.6 million passwords. It has been obtained by an SQL injection attack in 2009, and only contains passwords in plain text, without any further information [10]. Thanks to the large size of the RockYou list, we can produce a well-trained Markov-model. 30 million randomly chosen passwords of the RockYou list are used as training set (RY-t), the remaining 2.6 passwords as testing set (RY-e). In addition, an unique RockYou list with over 14 million passwords, each only occurring once in the list, is used as training set (RY-u). This list is used to gain a good comparison to the InsidePro dictionaries (see below), which also contains unique entries.

42 34 5 Test Setup phpbb list The leaked phpbb contains passwords hashed with MD5, which have been cracked by Brandon Enright in 2009 [4]. The list contains over non-unique passwords, meaning that some identical passwords occur multiple times in the list. We use passwords of those as training set (BB-t) and the remaining as testing set (BB-e). InsidePro dictionaries InsidePro is a software developer that provides different password recovery tools for specific cases, for instance to crack lost special user passwords for different Windows versions [15], [43]. They also provide multiple dictionaries for different languages. We use the English dictionary (ED-t) as a training set. The dictionary contains about common English terms without any special characters or numbers. In addition, InsidePro hosts a password list, called PasswordPro, with about 3 million common passwords and no further information. Every password in these lists is unique. We use the whole list as training set (PP-t), testing it against other training sets. Table 5.1: Training and testing sets. Name Acronym Training Set Testing Set RockYou-list RY-t/-e 30 Million 2.6 Million RockYou-list (unique) RY-u 14 Million - phpbb-list BB-t/-e PasswordPro-list PP-t 3 Million - English dictionary ED-t Ethical Consideration Leaked password databases have been used in a number of studies on passwords ([6], [10], [41], [42]). Studying databases of leaked passwords has arguably helped the understanding of users real world password practices. The databases we use in our study were already available to the public and only contain the passwords and no further personal information. 5.2 Parameters The n-gram creation and the password enumeration processes can be varied with several parameters. To learn more about OMEN and Markov-models for password guessing in general, we vary seven parameters in total: The n-gram size, the alphabet, the smoothing function, the level range, the length scheduling algorithms, the amount of guesses, and the end probabilities (EP). In this section the parameters and their possible values are presented. The values chosen cover a solid cross section of all possible values. n-gram Sizes To evaluate the benefit of larger n-grams we test all settings and password lists with 2-, , and 5-grams.

43 5.2 Parameters 35 Alphabets We test OMEN using four different alphabets, referred to as small, default, large, and frequency. The small alphabet contains the upper and lower case Latin alphabet and the numbers from zero to nine. The default alphabet equals the one used to evaluate the original OMEN version [10]. It contains the same as the small one, adding ten special characters. The large alphabet equals the small alphabet, with thirty additional special characters. The special characters are determined by their occurrence in the non unique RockYou list 1. In addition, we use three different frequency alphabets, with various lengths, containing the 20, 50 and 72 most frequent characters in the RockYou list. Table 5.2 lists the characters of each alphabet. Table 5.2: Alphabets used in the test cases. Acronym Character Size Description small [a-z][a-z][0-9] 62 upper and lower Latin alphabet, numbers default large ;"<]% :[ˆ >{ freq20 ae1ionrls02tm3c98dy5 20 freq50 freq72 ae1ionrls02tm3c98dy54hu6b7kgpj vfwzaxeiolrnstmqc.db ae1ionrls02tm3c98dy54hu6b7kgpj upper and lower Latin alphabet, numbers, 10 special characters upper and lower Latin alphabet, numbers, 30 special characters 20 most frequent characters in non unique RY 50 most frequent characters in non unique RY 72 most frequent characters in non unique RY Smoothing Beside testing OMEN without using any smoothing, we test the supported additive smoothing presented in Section 3.3. The level adjustment factor is also closely linked to the smoothing, since it influences the level computation drastically, so we compare them simultaneously. While the factors for CP and LN remain unchanged (see Table 3.1), we variate the ones for IP and EP, with both always using the same factor. We also evaluate a smoothing setting which does not apply the additive smoothing function nor the level adjustment factor, to show that the level adjustment factor is necessary. The used smoothing setups can be seen in Table Based on the 8-bit ASCII table according to ISO , not allowing \n, \r, \t and (space)

44 36 5 Test Setup Table 5.3: Smoothing settings used in the test cases. Acronym add0(1) add0(100) add0(250) add1(100) add1(250) Description No smoothing, no level adjustment factor No smoothing, level adjustment factor of 100 No smoothing, level adjustment factor of 250 Additive Smoothing (δ = 1), level adjustment factor of 100 Additive Smoothing (δ = 1), level adjustment factor of 250 Additive Smoothing Level Adjustment Factor no 1 no 100 no 250 yes (δ = 1) 100 yes (δ = 1) 250 Length Scheduling Settings The length scheduling algorithm defined in Section 3.5 and the length levels computed during the n-gram creation are combined, to test six different length settings. First of all, we test OMEN using just the computed levels, applying no scheduling algorithms. Also based on the computed levels, a length level factor of 1.0 is applied. In addition, the length level factor of 1.0 and 2.0 are used, ignoring the computed length levels. Furthermore the adaptive length scheduling is used. The selected length scheduling settings provide a good cross section of all possible combinations, allowing to evaluate and compare the effectiveness of the computed length levels, the length level factor, and the adaptive length scheduling. An overview and the acronyms of the different combinations can be seen in Table 5.4. Table 5.4: Length Scheduling used in the test cases. Acronym w/prop(0) w/prop(1) noprob(1) noprob(2) adaptive Description with Probabilities, length level factor 0 with Probabilities, length level factor 1 no Probabilities, length level factor 1 no Probabilities, length level factor 1 adaptive length scheduling Computed Level Length Level Factor Adaptive Scheduling yes yes no no yes

45 5.2 Parameters 37 Maximum Level The maximum levels used for the test cases are 5 (level range 0 to 5), 10 (level range 0 to 10, the default setting), and 20 (level range 0 to 20) (the maximum level equals the factor k in Section 3.2). Maximum Guesses To compare the results of the other parameters with each other, we use a rather small amount of 10 million (10 7 ) attempts. This amount already produces meaningful and comparable results. In addition we test some promising combinations with 1 billion (10 9 ) attempts to fully investigate the performance. To be able to compare the results to the ones produced by Dürmuth et al. [10], we test the improved OMEN with 10 Billion (10 10 ) as well. End Probabilities Since the end probabilities (EP) are optional (see Section 3.1), we test OMEN with and without applying them.

46

47 6 Results This chapter presents the results for the different training and testing sets, showing the percentage of passwords guessed correctly under the influence of the different parameters. The used datasets and parameters are introduced in Chapter 5. First, multiple relevant combinations of the parameters are tested and the according results are rehashed and presented using graphs and tables. Since the possible combinations are enormous, we focus on the RockYou list as training set, which provides the best trained Markov-model of all training sets available and therefore produces meaningful results. Using the RockYou list, we determine which settings are the most efficient and relevant for our case and provide a discussion of our choice. Based on the most relevant parameters, we show the results using the other training and testing sets. Subsequently, the new results are compared with the one presented in previous work. 6.1 Parameter Evaluation In this section we determine the most relevant combination of parameters, using the RockYou list as training and testing set. At the beginning, we show which n-gram is the optimum for our case, comparing the results with different parameters. Subsequently, we evaluate the smoothing functions, focusing on the created levels and their distribution. To determine the best length scheduling algorithm, we present and compare the crack process with different length scheduling algorithm using the most promising smoothing settings. After that we present the results with different alphabets, and investigate the effect of the end probabilities and a larger maximum level. All results are presented using tables and visualized using various graphs. Even if we could draw conclusions about multiply parameters at once, we only focus on one parameter to increase the readability n-gram First, we determine an optimized n-gram, comparing the produced results and the provided usability. Even if the overall performance of the different n-gram sizes is independent from the used parameters, we variegate the length scheduling settings to produce meaningful results. However, we only use a representative selection: w/prop(0) (representing the influence of the actual length level) and noprob(1) (representing the influence of the length level factor), see Table 5.4. We use 10 million guesses, which already show the trend of the crack process, and a maximum level of 10. For all other parameters the following settings are used: default alphabet, add0(100) smoothing, apply EP

48 40 6 Results Results The Table 6.1 provides an overview of the produced results with the different n-grams. The Figure 6.1 shows the crack process with 10 million guesses using the length scheduling settings w/prop(0) (a) and noprob(1) (b). Table 6.1: Results for the different n-gram sizes, using different length scheduling settings (RY-t & RY-e, add0(100), default alphabet, 10 7 guesses, EP, maximum level 10). Length Scheduling w/prop(0) noprob(1) 2-gram % % 3-gram % % 4-gram % % 5-gram % % n-grams Passwords Cracked (of 2.6 Million) gram 3-gram 4-gram 5-gram 0 2e+06 4e+06 6e+06 8e+06 1e+07 Guesses (a) w/prop(0) 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 Guesses (b) noprob(1) Passwords Cracked (of 2.6 Million) Figure 6.1: Graphs showing the results for the different n-gram sizes, using the length scheduling settings w/prop(0) and noprob(1) (RY-t & RY-e, add0(100), default alphabet, 10 7 guesses, EP, maximum level 10). Discussion & Conclusion The overall results using different n-gram sizes differ widely, but the graphs show, that OMEN has a great cracking speed, guessing most of the passwords within the first million

49 6.1 Parameter Evaluation 41 guesses. Especially the curves in Figure 6.1b indicate, that the passwords are enumerated with an approximately decreasing probability. The graph section without any pitch that occur in the graphs, especially in Figure 6.1a, are investigate in Section grams perform as expected (based on the results by Dürmuth et al. [10]) and outperform 2-grams by nearly 10 %. Using 4-grams, OMEN is even more efficient. After 10 million guesses, 4-grams outperform 3-grams by more than 8 %. When using a better length scheduling algorithm (e. g. noprob(1)), nearly 10 % more passwords are guessed correctly. In addition, the chosen level adjustment factor is optimized for 3-grams. Using a level adjustment factor optimized for 4-grams, the results can be improved even further (see Section and 6.1.3). 5-grams produce even better results (4.5 % more than 4-grams on an average), but the trend already shows, that increasing the n-gram size does not increase the crack rate proportionally. However, due to the drastically increased runtime and the huge memory requirements, 5-grams are not suitable for the upcoming evaluations. Therefore, we use 4-grams for the upcoming parameter evaluations and the dataset evaluation in Section 6.2. For the comparison to the original OMEN version in Section 6.3 we use both, 4- and 5-grams Smoothing Next, we investigate the effect of the additive smoothing function and the length level factor, focusing on the produced levels and their distribution. We present and compare the level distribution for IP, CP and EP using the different smoothing settings. The only additional parameters that are relevant for the level creation are the n-gram size, the used alphabet and the maximum level. We use 4-grams, the default alphabet, and a maximum level of 10, which produce reliable results as shown above. The other parameters, meaning the length scheduling and the amount of guesses, do not influence the produced levels. The fixed parameters are: 4-grams, default alphabet, maximum level 10 To produce the crack results for the smoothing settings, we use 10 million guesses, the same length scheduling settings used for the n-gram evaluation (w/prop(0) and noprob(1)), and apply the end probabilities. Results The crack results for the different smoothing functions using the selected length scheduling settings w/prop(0) and noprob(1) can be seen in Table 6.2. Discussion The results in Table 6.2 show that the influence of the smoothing settings on the performance is minor. The crack rate variance is smaller than 1 %, considering the different smoothing settings for one length scheduling setting. The only exception is add0(1), which applies no level adjustment factor and produces worst results.

50 42 6 Results Table 6.2: Smoothing settings result comparison using different length scheduling settings (RY-t & RY-e, 4-grams, default alphabet, 10 7 guesses, EP, maximum level 10). Length Scheduling w/prop(0) noprob(1) add0(1) % % Smoothing add0(100) % % add0(250) % % add1(100) % % add1(250) % % We present and discuss two comparisons: First we compare add0(1), add0(100), and add0(250) with each other, to point out the influence of the level adjustment factor on the level distribution of IP and EP (for CP we always use a factor of 2) and the consequential results. Subsequently, we compare the created levels using add0(250) and add1(250) to show the effect of the additive smoothing function, focusing on CP. Level Adjustment Factor The histograms in Figure 6.5 and 6.3 visualize the level distribution for the different smoothing settings for IP respectively EP, the associated tables in the figures show the exact values. Note that the visualized counts in the histograms are cut at to increase the comparability (considering level 10). The count of level 10 is so huge, because many 3-grams do not occur in the training set. The occurring problems for the level creation of IP and EP are in analogy, therefore we only elucidate the one of IP. The level distribution with no level adjustment factor is not useful: no 3-grams with a level of 0 to 2 are created and only a few for level 3 to 5. As can be seen in Table 6.2, this level distribution is not good evaluable and results in a bad performance (30.72 % on an average). Using a level adjustment factor improves the distribution and the produced results, because more 3-grams with lower level are created: twelve 3-grams with level 0 are created using add0(100) (level adjustment factor 100) and an average crack rate of % can be achieved. With add0(250) (level adjustment factor 250) nearly eighty 3-grams with level 0 are created and an average crack rate of % can be achieved. A higher level adjustment factor would improve the distribution even further, but it would not necessarily results in an improved performance: The level adjustment factor distorts the computed probabilities and with a high factor, less probable passwords are created within the first guesses. Especially with other datasets, a higher level adjustment factor may distort the actual probabilities.

51 6.1 Parameter Evaluation 43 Level add0(1) (100) (250) Count (in thousand) add0(1) add0(100) add0(250) Figure 6.2: Histogram and table showing amount of IP 3-grams created with add0(1), add0(100), and add0(250) (RY-t, 4-gram, default alphabet, maximum level 10). Level Level add0(1) (100) (250) Count (in thousand) add0(1) add0(100) add0(250) Figure 6.3: Histogram and table showing amount of EP 3-grams created with add0(1), add0(100), and add0(250) (RY-t, 4-gram, default alphabet, maximum level 10). Level

52 44 6 Results Additive Smoothing Function The Figure 6.4 shows a histogram visualizing the level distribution of the created CP 4-grams with and without applying the additive smoothing function, the associated table shows the exact values. Note that the visualized counts in the histogram are cut at to increase the comparability (considering level 3 and level 10). Using the additive smoothing function the distribution is shifted towards the level 3. The reason for this accumulation is shown in Formula 6.1: If a 4-gram has count = 0, and all 4-grams with the same prefix have count = 0 as well, all 4-gram with this prefix gain a level 3 (if the default alphabet is used). Without the additive smoothing, those 4-grams are assigned to level 10. count xyzi + δ prob xyzi = Σ j=1 (count xyzj) + δ Σ, with δ = 1, Σ = 72, count xyz* = 0 = = log(c 1 72 ) 3.585, with c 1 = 2 (level adjustment factor of CP) level xyzi = 3 (6.1) Even if the additive smoothing assigns 4-grams with count = 0 to level 3, the adjustment has only a slight influence on the performance, because the probabilities are conditional. 4-grams with a small count, where other 4-grams with the same prefix have a high count, still get a high level. The results in Table 6.2 indicate, that applying the additive smoothing has only a slight influences on the crack rate. But the results are not clear: With a level adjustment factor of 100 the crack rate increases using w/prop(0), but decreases using noprob(1). The effectiveness of the additive smoothing functions, and the smoothing settings in general, is highly depending on the used length scheduling setting. Therefore, no final conclusion about the effectiveness of the additive smoothing can be drawn at this point. In Section 6.1.3, evaluating the length scheduling settings, the influence of the additive smoothing function is investigated further. Figure 6.5 shows the influence of the additive smoothing on the IP level distribution. Since the influence on EP is equivalent, their distribution is not shown. The probability of IP are not conditional, therefore the additive smoothing with δ = 1 has only a slight effect, shifting the level distribution towards level 10. Applying the additive smoothing to a smaller training set may even deteriorate the level distribution: If only a few n-grams are counted, adding 1 to each n-gram count has a huge influence on the probability computation and may disturb it.

53 6.1 Parameter Evaluation 45 Level add0(250) add1(250) Count (in hundred thousand) add0(250) add1(250) Figure 6.4: Histogram and table showing amount of CP 4-grams created with add0(250) and add1(250) (RY-t, default alphabet, maximum level 10). Level Level add0(250) add1(250) Count (in thousand) add0(250) add1(250) Figure 6.5: Histogram and table showing amount of IP 3-grams created with add0(250) and add1(250) (RY-t, 4-grams, default alphabet, maximum level 10). Level

54 46 6 Results Conclusion The Table 6.2, presenting the results, already shows, that the level adjustment factor is necessary to achieve a good performance. With no level adjustment factor, the lower IP and EP levels will not be computed at all, because the count of every single n-gram that occurs, divided by the count of all occurring n-grams, result in a small probability and therefore a higher level. With a level adjustment factor of 100 and 250 the levels are adjusted and a better distribution can be achieved. Choosing an efficient level adjustment factor is not a simple task, because it is influenced by several parameters, like the n-gram size, and especially by the used training set. A good solution would be, to dynamically generate the level adjustment factor during the runtime based on the computed level distribution. This additional computation may increase the runtime, but that is no critical factor of the n-gram creation. The level distribution comparisons with or without applying the additive smoothing show, that the additive smoothing is not that efficient and may even deteriorate the level distribution, when a smaller training set is used. Some advanced smoothing functions could improve the level distribution more consistent. Besides the add0(1), all smoothing settings produce decent results. However, the results with or without applying the additive smoothing are not meaningful. Therefore, we use both, the additive smoothing function and no smoothing function in combination with the level adjustment factor of 100 and 250 in the following Section to evaluate the length scheduling settings (i. e. we use add0(100), add0(250), add1(100), and add1(250)). Those results show, that the add1(250) performs best and is the optimal smoothing setting of the ones available for the RockYou testing set Length Scheduling In this section we evaluate the performance of the length scheduling algorithms. The results in Section already show, that the length scheduling setting has a huge impact on the overall performance and the crack process, while being independent from some of the other parameters: The length of the password does not change when a different alphabet or n-gram sizes is used. Therefore, we use the default alphabet and 4-grams, because both produce reliable results as shown above. However, the used smoothing setting has a noticeable influence, especially when using the adaptive length scheduling, as shown below. We use the most promising smoothing settings, based on the conclusion of Section The maximum level is set to 10 for the upcoming tests. Therefore, the fixed parameters are: 4-grams, default alphabet, maximum level 10, apply EP We present the results for the different length scheduling settings and smoothing functions using tables and graphs. An interesting factor regarding the length scheduling algorithms is the amount of guesses: We focus on the results using 10 million guesses, but since the behavior of the length scheduling algorithms with more guesses is an important factor, we test the most promising combinations with 1 billion guesses as well.

55 6.1 Parameter Evaluation 47 Results Table 6.3 shows the results for the different length scheduling settings, using different smoothing functions and 10 million guesses. Table 6.3: Length Scheduling results comparison using different smoothing settings (RY-t & RY-e, 4-grams, default alphabet, 10 7 guesses, EP, maximum level 10). Length Scheduling w/prop(0) w/prop(1) noprob(1) noprob(2) adaptive add0(100) % % % % % add0(250) % % % % % add1(100) % % % % % add1(250) % % % % % Smoothing Discussion We consider each setting individually and evaluate their performance. In addition, the differences to each previously presented setting is pointed out. To ensure comparability between the results, we use the smoothing settings add0(100), even if it is not the best choice in general. The results are presented using graphs, that show the crack process and the used length. Histograms are used to show the amount of created and correctly guessed passwords for each length. Since the adaptive length scheduling implements a completely different approach, it is considered from a different angle. To explain the varying performance of this setting, we present the results based on different smoothing settings. Subsequently we draw an overall feedback and point out the best length scheduling setting for our case. w/prop(0) The settings using the computed length level and no length level factor produce the worst results (32.70 % on average). The length levels used by w/prop(0) to determine the length order can be seen in Figure 6.6 (the scale of the histogram is chosen to increase the comparability with the other length level histograms, visualizing larger length level). The crack process of w/prop(0) using the add0(100) smoothing settings are visualized in Figure 6.7. Those are the worst results of all w/prop(0) test cases, but they clearly show the general problem using this length scheduling setting. The most interesting section is highlighted by the oval. It covers the section of the graph from approximately 6 to 9 million guesses. This section has no pitch at all. In other words, OMEN has enumerated nearly 3 million passwords, without guessing one that also occurs on the testing set. The Figure 6.7 also shows the used lengths in association to the crack process. This visualizes the occurring problem: In the highlighted section OMEN enumerates passwords with a length of 11 and larger. Once the length scheduling chooses a smaller length, the

56 48 6 Results crack rate increases drastically. As mentioned in Section 3.6, there are a lot more possible combinations using large lengths than there are with smaller lengths. Therefore, the probability of guessing a large password that also occurs in the testing set is quite small. Figure 6.8 shows a histogram visualizing the amount of passwords created as well as the amount of passwords guessed correctly for each length. The figure also provides a table containing the exact values. The ratio between guessed and cracked passwords confirms the problem: OMEN creates over 2 million passwords of length 10, but crack less than 20 thousand passwords successfully, because there are so many possible combinations. With a length of 6, OMEN cracks nearly 390 thousand passwords with approximately 410 thousand guesses. Note that the testing set is not unique. Therefore, the amount of passwords cracked may be larger than the amount of passwords created (see the results for length 5 in Figure 6.8). This result shows that length selection simply based on the computed levels is not efficient, because the password guessed within the first 10 million are too large. Level Length Level Length Figure 6.6: Histogram showing the length levels used by w/prop(0).

57 Parameter Evaluation "!!! "!!# 222 Figure 6.7: Graph showing the results using w/prop(0) and add0(100) and the used lengths, highlighting a section with nearly no pitch (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10). Len Created Cracked Count (in million) Length (created/cracked) Created Cracked Figure 6.8: Histogram and table showing number of created and cracked passwords for each length using w/prop(1) and add0(100) (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10).

58 50 6 Results Level Computed Length Level Length Level Factor Length Figure 6.9: Histogram showing the length levels used by w/prop(1). w/prop(1) Unlike w/prop(0) the setting w/prop(1) apply a length level factor of 1 to the computed length levels. The adjusted length levels can be seen in Figure 6.9. On average, w/prop(1) archives a crack rate of % and outperforms w/prop(0) by over 7 %, independently from the used smoothing function. Figure 6.10 shows the crack process and the associated lengths, again using the add0(100) smoothing settings. The graph visualizes the effect of the length level factor: Since the actual length value is added to the according length level, the larger lengths are not used to create any passwords in the first 10 million guesses. With the length scheduling setting w/prop(1), OMEN achieved a steady crack rate and good overall results. Figure 6.11 shows the exact amount of created and creaked passwords for each level and visualizes the values in a histogram. The figure shows that nearly 500 thousand passwords of length 6 are guessed correctly, but over 2 million passwords in total are created to archive that amount. Compared to Figure 6.8, where 390 thousand passwords are guessed with approximately 410 thousand guesses, the crack rate has decreased drastically. The reason that the crack rate decreases with an increasing amount of guesses is, that the most probable passwords, that occur more than once in the testing set, are guessed at first. For example, OMEN creates the password within the first thousand guesses. This password occurs 39 thousand times in the testing set. Therefore, OMEN archives a high crack rate at the beginning, but once the most frequent passwords are guessed, the crack rate decreases. The result show that the usage of the length level factor increases the overall performance drastically, because the smaller lengths are guessed first.

59 6.1 Parameter Evaluation 51 Passwords Cracked (2.6 Million) Used lengths add0(100) Lengths e+06 4e+06 6e+06 8e+06 1e+07 Guesses Figure 6.10: Graph showing the results using w/prop(1) and add0(100) and the used lengths (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10). Len Created Cracked Count (in million) Length (created/cracked) Created Cracked Figure 6.11: Histogram and table showing number of created and cracked passwords for each length using w/prop(1) and add0(100) (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10).

60 52 6 Results Level Length Level Factor Length Figure 6.12: Histogram showing the length levels used by noprob(1). noprob(1) The noprob(1) length scheduling settings ignores the computed length levels and determines the length order only by the length level factor. The used length levels can be seen in Figure The setting noprob(1) produces good overall results, independent from the used smoothing setting. The performance of noprob(1) and w/prop(1) is nearly equal: In average OMEN archives a crack rate of % with noprob(1), slightly less than with w/prop(1) (40.07 %). The only noticeable difference occurs with the add1(250) smoothing, where w/prop(1) performs 0.7 % better. However, another training and testing set may change the advantage. The Figure 6.13 shows the crack process and the used lengths for noprob(1). The crack process is steady, but the pitch slightly decreases at each length peak. In this section of the curve passwords of lengths 11 and larger are created. After that length 4 is selected again. The occurring problem with the larger lengths is the same that occurred with w/prop(0): The huge amount of possible combinations with larger lengths decreases the crack rate. In addition, a lot of passwords with length 4 and 5 are created overall due to the small length levels (see 6.12). The histogram and the table in Figure 6.14 show the amount of created and cracked passwords. Compared to Figure 6.11 (showing the same information for w/prop(1)) noprob(1) creates nearly 600 thousand passwords of length 5 while cracking only 90 thousand, while w/prop(1) cracks nearly 80 thousand passwords with one third amount of guesses (less than 200 thousand). However, by guessing more passwords of length 9 and 10 and less of length 8 noprob(1) compensate the overall crack rate. It is noticeable that noprob(1) and w/prop(1) creates the same amount of passwords of length 6 and 7, due to the relatively equal length levels (see Figure 6.12 and 6.9). The results show that only using the length level factor and ignoring the computed length levels produces good overall performance and archives a good length variance. However, the length levels of the smaller lengths are slightly too small.

61 6.1 Parameter Evaluation 53 Passwords Cracked (2.6 Million) Used lengths add0(100) Lengths e+06 4e+06 6e+06 8e+06 1e+07 Guesses Figure 6.13: Graph showing the results using noprob(1) and add0(100) and the used lengths (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10). Len Created Cracked Count (in million) Length (created/cracked) Created Cracked Figure 6.14: Histogram and table showing number of created and cracked passwords for each length using noprob(1) and add0(100) (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10).

62 54 6 Results Level Length Level Factor Length Figure 6.15: Histogram showing the length levels used by noprob(2). noprob(2) Like the noprob(1) length scheduling settings noprob(2) ignores the computed length levels and determines the length order only by the length level factor. The used length levels can be seen in Figure noprob(2) still outperforms w/prop(0) by nearly 7 %, but compared to the of w/prop(1) and noprob(1) the results are inferior: With % on an average noprob(2) guesses nearly 0.5 % less than w/prop(1) and noprob(1). With the add0(100) smoothing setting, noprob(2) cracks 1 % respectively nearly 1 % passwords less than noprob(1) and w/prop(1). w/prop(1) also outperforms noprob(2) by 1 % using the add1(250) smoothing setting. The results are worse, because the selected lengths are too small. Figure 6.16 shows, that after each length peak no more passwords are cracked for several thousand guesses, for example in the highlighted section between 5 and 6 million guesses. Since a huge amount of passwords of length 4 to 6 have already been created and cracked after 4 million guesses in total, the probability of guessing more passwords of this length is small. The table and the histogram in Figure 6.17 confirm this problem: noprob(2) creates over 3.8 million passwords of length 6 while cracking only 538 thousand, while noprob(2) cracks 495 thousand passwords of length 6 with less than 2.1 million guesses. Even if noprob(2) still produces good results, the trend shows that a length level factor larger than 2 decreases the effectiveness of OMEN.

63 Parameter Evaluation "!!! "!!# 222 Figure 6.16: Graph showing the results using noprob(2) and add0(100) and the used lengths (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10). Len Created Cracked Count (in million) Length (created/cracked) Created Cracked Figure 6.17: Histogram and table showing number of created and cracked passwords for each length using noprob(2) and add0(100) (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10).

64 56 6 Results adaptive The adaptive length scheduling tests each length with overall level 0 once, and chooses the lengths based on their success rate. It ignores the computed length levels and does not use the lengths level factor. This setting produces the best results of all length scheduling setting with the add0(250) and add1(250) smoothing setting (40.95 % on an average), but the worst results of all with add0(100) and add1(100) (24.52 % on an average). To investigate this huge difference, we compare the results with add0(100) and add0(250). The crack processes and the used length for add0(100) can be seen in Figure The figure for add0(100) shows a small length variance and that several million passwords of length 6 are created. The exact amount of passwords created and cracked can be seen in the table in Figure 6.19, visualized using a histogram. Over 7 million passwords of length 6 are created and less than 70 thousand passwords more have been cracked, compared to the 2 million passwords created by w/prop(1) with the same smoothing settings. In addition, no passwords of the quite frequently used lengths 4, 5, 7 and 8 have been created by adaptive. The occurring problem is the small amount of IP and EP 3-grams with level 0. With the add0(100) smoothing setting only 12 IP and 5 EP 3-grams are created (see Section 6.1.2). Therefore, the calculated success rates for the different length are not meaningful. Figure 6.22 visualized the success rates in a histogram, and shows the exact values, as well as the amount of created and cracked passwords in the first run-through for each length. The figure covers the success rate for both considered smoothing settings. As shown on the left in the table (the columns for add(100)), optimize determines a success rate of 0 % for several lengths, because no password for the corresponding length could be created. Since the lengths with 0 % are not chosen again until all success rates reach 0 %, optimize archive the poor length choice and therefore the bad results. Note that a length even if more passwords have been cracked than created, the success rate does not exceed 100 %. This problem does not occur with the add0(250). Figure 6.20 and 6.19 show that a better length variance is achieved. The graph in Figure 6.20, showing the crack process, is steady. Both figures, and especially Figure 6.19 shows, that multiple passwords of various lengths are created, but since a different smoothing has been used, the results can not be compared to the one of the other length scheduling settings. The calculated success rates can be seen in Figure 6.22 on the right hand side. Based on the add0(250) smoothing setting 79 IP and 46 EP 3-grams are created and a useful success rate can be calculated within the first run-through. With a good smoothing setting, adaptive produces the best length variance and therefore the best results. However, if the initial success rate calculation does not produce decent results, due to a poor level distribution, the results deteriorate.

65 6.1 Parameter Evaluation 57 Passwords Cracked (2.6 Million) Used lengths add0(100) Lengths e+06 4e+06 6e+06 8e+06 1e+07 Guesses Figure 6.18: Graph showing the results using adaptive and add0(100) and the used lengths (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10). Len Created Cracked Count (in million) Length (created/cracked) Created Cracked Figure 6.19: Histogram and table showing number of created and cracked passwords for each length using adaptive and add0(100) (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10).

66 58 6 Results Passwords Cracked (2.6 Million) Used lengths add0(250) Lengths e+06 4e+06 6e+06 8e+06 1e+07 Guesses Figure 6.20: Graph showing the results using adaptive and add0(250) and the used lengths (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10). Len Created Cracked Count (in million) Length (created/cracked) Created Cracked Figure 6.21: Histogram and table showing number of created and cracked passwords for each length using adaptive and add0(250) (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10).

67 6.1 Parameter Evaluation 59 add0(100) add0(250) Len Created Cracked Rate Created Cracked Rate % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % 1 add0(100) add0(250) Success Rate Success Rate Length Length Figure 6.22: Histogram and table showing the success rate with overall level 0 for each length, using adaptive with add0(100) and add0(250) (RY-t & RY-e, 4-gram, default alphabet, 10 7 guesses, EP, maximum level 10).

68 60 6 Results smooth. Length Scheduling w/prop(1) noprob(1) adaptive add0(100) % % % add1(250) % % % Passwords Cracked (of 2.6 Million) e+08 4e+08 6e+08 8e+08 1e+09 Guesses w/prop(1) noprob(1) 0.1 adaptive 0 0 2e+08 4e+08 6e+08 8e+08 1e+09 Guesses Passwords Cracked (of 2.6 Million) (a) add0(100) (b) add1(250) Figure 6.23: Results for w/prop(1), noprob(1), and adaptive, using the add0(100) and add1(250) with 10 9 guesses (RY-t & RY-e, 4-gram, default alphabet, EP, maximum level 10). Conclusion The results of length scheduling settings are only slightly influenced by the used smoothing setting, except the adaptive length scheduling, which only produces good results with a level adjustment factor of 250. The length scheduling settings that produce the best results are w/prop(1), noprob(1), and adaptive. Figure 6.23 shows a table with the results for those three settings with 1 billion guesses. The graphs in the figure visualize the crack process based on add0(100) (a) and on add1(250) (b). In the previous tests with 10 million guesses the smoothing setting add0(100) produces the best results for noprob(1) and add1(250) the best for w/prop(1) and adaptive. The Figure 6.23 shows that adaptive does not catch up using add0(100), but still produces the best results of all using the add1(250). The adaptive length scheduling algorithm could produce more consistent results and be independent from the level distribution, if the initial success rate would be determined based on a fixed amount of passwords created for each length instead of all passwords created with an overall level of 0 for each length. w/prop(1) and noprob(1) produce good results independent of the used smoothing setting, but the crack rate of these length scheduling settings, and especially of noprob(1), is not as steady as the one of adaptive, and some sections have a

69 6.1 Parameter Evaluation 61 decreased pitch or no pitch at all (see Discussion). As expected, the crack rate for all curves steadily decreases over time. Even if the adaptive setting performs slightly better than w/prop(1), we use w/prop(1) for the upcoming parameter evaluations, because it produces the most consistent results, independent of the used smoothing function. However, for the dataset tests in Section 6.2 we use the adaptive setting to accomplish a fair comparison to the results of [10] Alphabet In this section we compare and evaluate the results with the different alphabets. Since the crack processes, concerning the alphabets, are not of interest, we only present the percentage of passwords guessed correctly and no graphs showing the process. We use a maximum level of 10 and the optimal settings determined in the previous evaluations: 4-grams, add1(250) smoothing, w/prop(1) length scheduling, apply EP We test the different alphabets with 10 million and 1 billion guesses, to fully evaluate their effectiveness. In addition we present an overview of the characters and their occurrence in the non-unique RockYou-list to draw conclusions about the used alphabets. Results Table 6.4 shows the results using the different alphabets and 10 million respectively 1 billion guesses. The frequency of the 99 characters that occur at least once in the non-unique RockYou-list can be seen in Table 6.5, ordered by their frequency. Figure 6.24 shows the basic composition of the passwords in the non-unique RockYou-list. Table 6.4: Alphabet result comparison using 10 7 and 10 9 guesses (RY-t & RY-e, 4-gram, add1(250), w/prop(1), EP). Guesses small % % default % % large % % freq % % freq % % freq % % Alphabet Discussion & Conclusion The results in Table 6.4 show, that the chosen alphabets have a noticeable influence on the performance. Based on the results of the freq20 alphabet (crack rate % with 1 billion guesses) we can conclude, that an alphabet should not be too small. If an alphabet

70 62 6 Results is too small, many password combinations can not be created. For instance, using the freq20 alphabet, OMEN will never generate any password with the quite frequent term password, because the letter p (character 29 in Table 6.5) is not part of the alphabet. Even if an alphabet covers more of the most frequent password, like the freq50 alphabet (64.38 %), the performance still decreases. The alphabets default (66.36 %) and freq72 (66.34 %) perform nearly equally well, and even the small alphabet (65.89 %) is not outperformed by much, because the passwords in the RockYou-list, used as training and testing set, are rather simple. Most of them only consist of lower case letters (41.7 %) or letters in combination with numbers (36.9 %), as the Figure 6.24 shows. Only 3.8 % of the passwords in the RockYou-list contain a special character and the special character with the highest frequency, the dot (., character 48 in Table 6.5), only has a frequency of Table 6.5: Frequency of the characters that occur in the non-unique RockYou-list. # Char % 1 a Y Q e k H ? g _ ) i p ! ( o j U n v P ; r f K < l w G % s z J " A ] x * ~ t E : m I V [ O F ^ c L W R Z > N # { d S / } y T X M $ BS * q , ETX * h C \ EOT * 0 23 u & SUB * D DEL * 0 25 b B = 0.01 * : BS = Backspace, ETX = End of Text, EOT = End of Transmission, SUB = Substitute, DEL = Delete

71 6.1 Parameter Evaluation 63 Figure 6.24: Pie chart showing the basic composition of the passwords in the RockYou-list (values taken from [38]) %. Therefore, even the small alphabet covers majority of the most frequent symbols. With another password list, containing more complex passwords with multiple special characters, the alphabets based on a frequency analysis would perform better than the default or large alphabet. The large alphabet (66.44 %) produces nearly equal results as the default alphabet. Using the large alphabet, more 4-grams are created, but the actual transition probabilities remain approximately equal, and therefore the same passwords are enumerated. Even if the large alphabet produces the best overall results, it only slightly out-performs the default alphabet and the increased computational effort does not justify the usage. Therefore, we declare the default alphabet as the optimum of the presented alphabets and use it for the upcoming tests End Probabilities In this section we test OMEN with and without using the end probabilities, that is the validation of the last n-gram of each password. To evaluate the influence we use a maximum level of 10 with 10 million and 1 billion guesses. The other parameters are set to the optimal settings, determined in the previous evaluation: 4-grams, default alphabet, add1(250) smoothing, w/prop(1) length scheduling