Cyberspace Security Use Keystroke Dynamics. Alaa Darabseh, B.S. and M.S. A Doctoral Dissertation In Computer Science

Transcription

1 Cyberspace Security Use Keystroke Dynamics by Alaa Darabseh, B.S. and M.S. A Doctoral Dissertation In Computer Science Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Approved Dr. Akbar Siami Namin Committee Chair Dr. Rattikorn Hewett Dr. Donald Jones Dr. Susan Mengel Dr. Mark Sheridan Dean of Graduate School August, 2015

3 ACKNOWLEDGMENTS First, I would like to thank my advisor Dr. Akbar Siami Namin for his great support and long hours with me. The research presented in this dissertation will not be possible without his help. I also would like to express my appreciation to my committee members (Drs. Rattikorn Hewett, Donald Jones, and Susan Mengel) for their endless support to make this work possible. I feel privileged and honored to have them as members of my dissertation committee. None of this work could have been done without the support of my father, who always encouraged me to be successful. Without his prayers and love I would never have reached this moment. Finally, I would like to dedicate this achievement to my mother soul.. ii

4 TABLE OF CONTENTS Acknowledgments... ii Abstract... vi List of Tables... viii List of Figures... ix 1. Introduction Biometrics: The Focus of This Dissertation Motivation Research Questions Contributions Background User Authentication in Computer Security Something You Know Something You Have Something You Are Combination of Authentication Factors Biometric Systems Performance Measures Keystroke Dynamics Static and Dynamic Authentication Systems Keystroke Features Literature Review Static Verification Continuous Verification Motivational Challenges and Contributions of the Thesis Research Motivations Research Contributions Experimental Setup Data Collection Features and their Extractions...30 iii

5 5. Keystroke Feature Selection Using Statistical Methods In Hypotheses Testing Motivations Research Questions Data Analysis Methods Mann-Whitney Statistics Test Effect-Sizes R's Statistics Package Analyses And Results Mann-Whitney Test Analyses Mann-Whitney Test Results Effect-Size Analyses Effect-Size Results Subset Feature Performance Conclusion Keystroke Feature Selection Using Machine Learning-Based Classifications Research Questions Machine Learning Methods Support Vector Machine Linear Discriminate Classifier K-Nearest Neighbors Naïve Bayes One Class SVM R Packages Machine Learning Analyses And Results Feature s Items Performance Features Performance Conclusion Keystroke Feature Selection Using Wrapper-Based Feature Selections Filter Approach Wrapper Approach...66 iv

6 7.2. Wrapper-Based Feature Subset Selection Techniques Greedy Algorithms Best-First Search Algorithm Genetic Algorithm Particle Swarm Optimization Weka Analyses Tools Results Continuous User Authentication Based on Sequential Change-Point Analysis Motivations Research Questions Intrusion Detection Change-Point Detection Batch Change-Point Detection Sequential Change-Point Detection CMP R Package Analyses Data Set Analyses Methods Analyses Results Conclusion Conclusion and Future Work...95 Bibliography...99 v

7 ABSTRACT Most of the current computer systems authenticate the user identity only at the point of entry to the system (i.e., login). However, an effective authentication system includes continuous or frequent monitoring of the identity of the user to ensure the valid identity of the user throughout a session. Such a system is called a continuous authentication system. An authentication system with such security scheme protect against certain attacks such as session hijacking that can be performed by a malicious user. Recently, keystroke analysis has acquired popularity as one of the main approaches in behavioral biometrics techniques that can be used for continuously authenticating user. There are several advantages when applying keystroke analysis: First, keystroke dynamics are practical, since every user of a computer types on a keyboard. Second, keystroke analysis is inexpensive because it does not require any additional components (such as special video cameras) to sample the corresponding biometric feature. Third and most importantly, typing rhythms can be still available even after the authentication stage has been passed. A major challenge in keystroke analysis is the identification of the major factors that influence the performance accuracy of the keystroke authentication detector. Two of the most influential factors that may impact the performance accuracy of the keystroke authentication detector include the classifier employed and the choice of features. Currently, there is insufficient research that addresses the impact of these factors in continuous authentication analysis. The majority of exciting studies in keystroke analysis focuses primarily on the impact of these factors in the static authentication analysis. Understanding the impact of these factors will contribute to the improvement of continuous authentication keystroke based system performance. Furthermore, most of the existing schemes of keystroke analysis require having predefined typing models either for legitimate users or impostors. However, it is difficult or even impossible in some situations to have typing data of the users (legitimate or impostors) in advance. For vi

8 instance, consider a personal computer that a user carries to a college or to a cafe. In this case, only the computer owner (legitimate user) is known in advance. For another instance, consider a computer that has a guest account in a public library; in this case, none of the system users are known in advance. Thus, a new automated and flexible technique that has the ability to authenticate the user without the need for any prior user typing model is needed. This dissertation focuses on improving continuous user authentication systems (that are based on keystroke dynamics) designed to detect malicious activity caused by another person (impostor) whose goals to is take over the active session of a valid user. The research will be carried out by1) studying the impact of the selected features on the performance of keystroke continuous authentication systems; 2) proposing new timing features that based on utilization of the most frequently used English words (e.g. The, And, For ) that can be useful in distinguishing between users in continuous authentication systems; 3) comparing the performance of keystroke continuous authentication systems with the application of different algorithms; 4) investigating the possibility of improving the accuracy of continuous user authentication systems by combining more than one feature; 5) proposing a new detector that does not require predefined typing models either from legitimate users or impostors. vii

9 LIST OF TABLES 4.1: Number of occurrences of the feature items in the English passages : Average of effect sizes for all feature items using 28 users data, sorted ascendingly : Item performance using SVM : Item performance using OC-SVM : Item performance using KNN : Item performance using NB : Performance Comparison of four different keystroke features on SVM, LDC, NB, and KNN : Comparison of features selected : Performance of detection system using different ARL of letter E : Letter performance using sequential change-point detection when employing Lepage statistical tests, the change occurs after 50 observations : Letter performance using sequential change-point detection when employing Lepage statistical tests, the change occurring after 100 observations :Letter performance using sequential change-point detection when employing different statistical test techniques viii

10 LIST OF FIGURES 2.1: Traditional biometrics system : Keystroke dynamics enrollment and authentication schemes : Security cycles of static and continuous keystroke authentication system : Keystroke timing information : Main window of the application used to collect the data from users : An example of four keystroke features extracted for the word the : A comparison between the performances of four different keystroke features. (a) Results obtained using Mann-Whitney statistic test. (b) Results obtained using Mann- Whitney statistic test and Voting Schema : A comparison between the performance of four different keystroke features items using 28 participants data with Mann-Whitney statistic test : The performance of different combinations of different keystroke features performed using 8 participants data : FAR values for the 5, 10, and 15 most discriminative items based upon two approaches applied (p-values and effect sizes) using 28 user s data : FAR values for the 5, 10, and most frequent items of each feature using 28 users data : Performance comparison of four keystroke features used independently against the ten different combinations performed on SVM : Performance comparison of four keystroke features used independently against the ten different combinations performed on KNN : Three main categories of feature selection : Feature subset selection using Wrapper approach : Forward and Backward Selection Wrapper Approach Algorithm : Best-First Search Wrapper Approach Algorithm : Genetic Search Wrapper Approach Algorithm : Particle Swarm Optimization Wrapper Approach Algorithm : An illustration of change-point detection ix

11 8.2: An illustration of sequential change point detection. The detection delay is measured in the amount of data a test needed to signal a change point in the sequence : Performance of detection system using different ARL of letter E x

12 CHAPTER 1 INTRODUCTION In modern society, the combination of username and password security scheme is being the mainly used authentication method to control access to sensitive and important resource particularly in computer security and related fields. In this case, the user claims an identity by providing her user name and then proves the ownership of the claimed identity by providing a password. However, these traditional protection mechanisms that relay on using a passwords are less satisfactory and vulnerable to be stolen. Moreover, these traditional protection mechanisms that relay on using username and passwords are only verify user s identity at the point of entry to the system (i.e., login). However, an effective authentication system includes continuous or frequent monitoring of the identity of the user to ensure the valid identity of the user throughout a session. Such a system is called a continuous authentication system. An authentication system with such security scheme protect against certain attacks such as session hijacking that can be performed by a malicious user. There are numerous conceivable applications and scenarios that require a continuous user authentication approach. For instance, consider a student who takes online quizzes/tests. This is an important application in the light of the fact that the number of students taking online classes is increasing and teachers are getting more concerned about true assessment and academic integrity. Threat in this case includes substitute of the valid student who is already authenticated at the start of the exam. For another example, consider an employee who works for an organization. In this case, threats include an insider intruder who can takes over an active session. Recently, keystroke analysis has acquired popularity as one of the main approaches in behavioral biometrics techniques that can be used for continuously authenticating user. There are several advantages when applying keystroke analysis: First, keystroke dynamics are practical, since every user of a computer types on a keyboard. Second, keystroke analysis is inexpensive because it does not require any additional components (such as special video cameras) to sample the corresponding biometric feature. Third and 1

13 most importantly, typing rhythms can be still available even after the authentication stage has been passed. The intention that forms the basis of this dissertation is to improve continuous keystroke biometrics user authentication systems designed to detect malicious activity caused by another person (impostor) who intends to take over the active session of a valid user. Section 1.1 provides an overview of biometrics. The motivation of this dissertation is provided in Section 1.2. The main questions that will be addressed in this dissertation are provided in Section 1.3. Dissertation contributions are provided in Section Biometrics: The Focus of This Dissertation Authentication and Identification of individuals are amongst the most emerging problems that have been arisen in daily operations. Since the beginning of human interactions, humans have used facial features and voices to recognize others. In this context, biometrics refers to human characteristics and traits that make each individual unique. For a long time, humans have used fingerprints as a biometric feature to authenticate the identity of others. Handwriting, signature, voice, and hand geometry have also been used since the 1980s. During the 90 s, commercial software capable of facial recognition and iris scanning have been added to the market. In a modern society that heavily depends on computers, biometrics plays a critical role in nearly every aspect of our lives. Government agencies, transportations, healthcare, financial institutions, education, and security are increasingly reliant on biometrics to ensure the quality and more importantly the integrity and security of daily activities. Biometrics can be used to prevent unauthorized access to ATMs, smart cards, cellular phones, desktop PCs, and computer networks. Unlike traditional authentication mechanisms, such as keys or passwords, biometrics cannot be lost, stolen, or forgotten. Biometrics is inherently secure by nature and cannot be socially engineered or shared with others, unless it is forged. 2

14 In terms of computer security, biometrics refers to authentication techniques that depend on the use of human characteristics that make each individual unique, and often involves personal characteristics that can be used to uniquely verify a person's identity. Biometrics- based techniques are mainly classified as physiological biometrics features such as fingerprint, face, or iris or behavioral biometrics features, such as gait, handwritten signatures, keystroke dynamics, etc. Biometric techniques based on physiological features are considered more successful than those based on behavioral characteristics (Ashbourn 2014). This relative success is probably due to the fact that physiological features are more stable and generally do not change over time, whereas behavioral features can be affected by an impermanent state, such as stress or illness. However, biometrics techniques based on physiological features usually require an additional sampling special tool (such as video cameras). Therefore, in the case of access to computers, the need for an additional special tool leads to increased cost. In contrast, biometrics techniques based on behavioral features such as keystroke dynamics can be a natural choice to increase computer access security when used in conjunction with traditional authentication methods. The aim of this dissertation is to advance the user active authentication using keystroke dynamics. Through this dissertation, we assess the performance and influence of various keystroke features on keystroke dynamics authentication systems. Furthermore, this dissertation proposes a new detector that does not require predefined typing models either from legitimate users or impostors Motivation Keystroke dynamics are defined as a behavioral biometric characteristic which involves analyzing a computer user s habitual typing pattern when interacting with a computer keyboard (Ashbourn 2014). In other words, keystroke biometric systems assume that typing characteristics of an individual are unique and difficult to reduplicate. A biometric keystroke authentication system consists of two phases: the enrollment 3

15 phase, which includes capturing typing data, filtering, feature extraction, and pattern learning; and the verification phase, which includes capturing typing data, filtering, feature extraction, and performing the comparison with the biometric pattern. The amount of typing data that is required to build a user pattern in the enrollment phase and the amount of typing data that must be collected before the comparison occurs in verification phase is an important issue in the keystroke authentication system. An efficient keystroke authentication system should have the ability to build the user pattern in minimal time, and it should be able to achieve quickest detection while maintaining good accuracy. However, maintaining high detection accuracy and short detection time is difficult and somehow these two are conflicting requirements that must be balanced for optimum detection. In order to tackle this problem, we could reduce the amount of data needed to be collected without loss of accuracy of a prediction model. This can be achieved by reducing the number of features that need to be learned by the classifier. As a matter of fact, the proper selection of features plays a key role in enhancing the accuracy of keystroke authentication detectors (Killourhy and Maxion 2010). Through this dissertation, we assess the performance and influence of various keystroke features in keystroke dynamics authentication systems. In particular, this dissertation investigates the performance of keystroke features on various subsets of the 20 most frequently used English alphabet letters, the 20 most frequently appearing pairs of English alphabet letters, and the 20 most frequently appearing English words. The performance of four features including the key duration, flight time latency, digraph time latency, and word total time duration are analyzed. Experiments are conducted to measure the performance of each feature individually and of different subset combinations of these features. This dissertation focuses on improving continuous user authentication systems (that are based on keystroke dynamics) designed to detect malicious activity caused by another person (impostor) whose goals to is take over the active session of a valid user. The 4

16 research will be carried out by 1) studying the impact of the selected features on the performance of keystroke continuous authentication systems; 2) proposing new timing features that based on utilization of the most frequently used English words (e.g. The, And, For ) that can be useful in distinguishing between users in continuous authentication systems; 3) comparing the performance of keystroke continuous authentication systems with the application of different algorithms; 4) investigating the possibility of improving the accuracy of continuous user authentication systems by combining more than one feature; 5) proposing a new detector that does not require predefined typing models either from legitimate users or impostors Research Questions The main research questions addressed in this dissertation are the follows: 1. Which keystroke timing feature(s) among key duration, flight time latency, digraph time latency, and word total time duration timing performs better in keystroke dynamics? This question will be discussed in Chapters 5 and Does any combination of timing features improve the accuracy of an authentication scheme? This question will be discussed in Chapters 5 and Which feature item contribute more to the accuracy of the model, where feature item are defined as the exact instances of letters in each feature, e.g. a, Th, etc.? This question will be discussed in Chapters 5 and How does word total time duration feature perform in comparison with key duration, flight time latency, and digraph time latency? This question will be discussed in Chapters 5 and Which authentication algorithm(s) among (SVM, LDC, NB, and KNN) performs better in keystroke dynamics? This question will be discussed in Chapter 6. 5

17 6. Can we detect the impostor who takes over from the legitimate user during the computer session when NO predefined typing models are available in advance either for legitimate users or impostors? This question will be discussed in Chapter Is it possible to detect an imposter in the early stages of an attack? If so, what amount of typing data is needed for a system in order to be able successfully detect the imposter? This question will be discussed in Chapter Contributions The aim of this dissertation is to advance active user authentication technology that utilizes keystroke dynamics. This dissertation focuses on various keystroke features as they affect the performance of the keystroke dynamics authentication system. The present study also considers the possibility of improving performance by combining four keystroke features including key duration, flight time latency, digraph time latency, and word total time duration. The major contribution of this dissertation is the utilization of most frequently used English words in determining identify of users typing on the keyboard. A comparison is conducted between a less common timing feature for keywords, namely the total time to type whole word, and other more common features, such as digraph and flight times of letters. Another important contribution of this dissertation includes proposing a novel approach based on sequential change-point methods for early detection of an imposter in computer authentication. There are two main advantages of the sequential change-point methods. First, they can be implemented online, and hence, enable the building of continuous user authentication systems without the need for any user model in advance. Second, they minimize the average delay of attack detection while maintaining an acceptable detection accuracy rate. 6

18 The key contributions of this dissertation are as follows: Introduction of new features that can distinguish between users in continuous authentication systems based on keystroke dynamics. The new features are based on utilization of the most frequently used English words in deciding the identity of users typing on the keyboard. Extraction of various types of keystroke features and investigation as to which one is the most efficient in keystroke dynamics. Adaptation of new anomaly detection algorithm for analyzing typist data. The new algorithm is based on sequential change-point analysis for early detection of an imposter in computer authentication. Performance comparison of several anomaly detection techniques in terms of their ability to distinguish between users i n continuous authentication systems. Application of machine learning-based features selection techniques useful in selecting the most influential features in continuous authentication systems based on keystroke dynamics, which maintain sufficient classification accuracy of the system and decrease the training and testing time required. 7

19 CHAPTER 2 BACKGROUND This chapter provides an overview of the authentication concepts. It also provides an overview of different kinds of biometrics authentication methods focusing on keystroke authentication methods. This chapter is organized as follows. Section 2.1 provides an overview of authentication methods. Section 2.2 discusses biometrics authentication systems. Section 2.3 discusses keystroke dynamic systems in detail. Section 2.4 provides a brief summary of extant studies of keystroke analysis User Authentication in Computer Security Authentication can be defined as the process that verifies if someone is, in fact, who he or she claims to be. In other words, it gives proving ownership of the claimed identity. For instance, passwords are usually required to gain access to computers; PIN codes are required to get money from ATM machines. There are several methods that provide the ability to authenticate a user. Traditionally, these methods are categorized into three main types: Something you know and do not share, e.g., passwords, PIN or pass-phrases. Something you have and keep safe, e.g., smart cards. Something you are and cannot be shared, e.g. biometric features such fingerprints. In the following subsections we provide a brief description of each type Something You Know In this type, the user who needs to get access to the system must simply provide knowledge of a secret. For example, PIN codes are required to get money from ATM machines, or passwords are usually required to gain access to computers. The main 8

20 advantages of this method are that it is a fast authentication mechanism with less cost compared to other methods, and easy to implement in real world. However, this method has some limitations: namely, secrets are easy to forget, or they can be stolen when the user writes them down Something You Have In this type, the user who wants access to the system is required to provide a unique piece of hardware (a key, a smart card, a SIM card) that can be used to match a user identity. The main advantage of this method is that the user here does not need to remember any secrets to get access to the system as in the case of the previous class. However, this method is more expensive since it requires special pieces of hardware for the equipment as well as for the user. Moreover, this method requires taking action when the hardware is lost or stolen Something You Are An authentication process of this type is based on utilization of biometric features such as fingerprints or an iris pattern. It based on the assumption that biometric features are unique from one person to another. For instance, fingerprints are unique even in the case of identical twins. The main advantage of this method is that biometric features cannot be lost, stolen, or forgotten. Biometrics is inherently secure by nature and cannot be socially engineered or shared with others, unless it is forged Combinations of Authentication Factors In this type, to access to a digital system, the user must provide at least two authentications from the above three types. One example of such a combination could be in the case of withdrawing money from an ATM machine where two things must be provided: the bank card and the PIN code. In this case, the user provides two factors: something he knows (a PIN code) and something he has (a bank card) to access his bank account. It is apparent that the purpose of this sort of combination is to increase the system security. 9

21 2.2. Biometric Systems According to (Ashbourn 2014), a biometric system is "the automated verification of human identity through repeatable measurement of physiological and/or behavioral characteristics". The operation of biometric authentication systems consists of two phases (Ashbourn 2014): I. The enrollment phase: this phase includes data capture, data filtering, feature extraction, and pattern learning. II. The verification phase: this phase includes data capture, feature extraction, and the comparison of the performance with the biometric pattern. The main scheme for a biometric authentication system as outlined by Ratha et al. (2001) is depicted in Figure 2.1. Figure 2.1: Traditional biometrics system. The biometric authentication system is based on utilization of human characteristics (features) that make each person unique. However, in order for those features to be used in building practical and effective biometric systems, they need to satisfy, to some degree, certain essential properties. The properties are: 1. Universality- The characteristic (features) should exist for each person. 2. Distinctiveness- No two people share the same characteristic. 10

22 3. Permanence- The characteristic should be constant over time. 4. Collectability-Data of the characteristic must be quantitatively measurable. 5. Performance- The characteristics need to provide good accuracy. 6. Acceptability- The characteristics must be accepted by most people. 7. Circumvention- The characteristics cannot be easily imitated or replicated. Generally, biometric characteristics should meet all of these properties in order for the biometric system to be a feasible and effective. However, it is clear that no characteristics can possibly satisfy all those properties to the highest degree. The degree to which a candidate characteristic satisfies these properties usually determines the type or level of biometric security application it can be used for Performance Measures In the previous section, we have seen that the operation of a biometric authentication system consists of two phases, the enrollment phase and the verification phase. In the enrollment phase, the system tries to learn user characteristics and build a pattern. In the verification phase, the system compares the new sample with the stored pattern and generates a similarity or dissimilarity score. The generated score is then matched against some predetermined threshold. The threshold is a line that represents the difference between the legitimate user and the imposter. All scores that fall below a threshold line are considered to characterize the legitimate user, and all scores that fall above a threshold line are considered to characterize an impostor. A biometric authentication system requires the person being identified to make a claim to an identity. Given the fact that the claimant may or may not be a genuine user, there are four possible outcomes, two of which describe false claims. 1. An impostor tries to authenticate and is denied (true negative). 2. A legitimate user tries to authenticate and is accepted (true positive). 3. An impostor tries to authenticate and is accepted (false positive). 11

23 4. A legitimate user tries to authenticate and is denied (false negative). Typically, performance of biometric systems is measured in terms of various error rates. The most commonly used error rates are False Acceptance Rate (FAR) and False Rejection Rate (FRR). FAR refers to the percentage of imposters who were mistakenly accepted by the system. In statistics, this error is referred to as a Type II error, defined as: Number of AcceptedImposter Attempts FAR Total Number of Imposter Match Attempts FRR refers to the percentage of authorized users whose identities were mistakenly denied from the system. In statistics, this error is referred to as a Type I error, defined as: FRR Number of RejectedLegitimate User Attempts Total Number of Legitimate Match Attempts For an ideal case, both of these error rates should be equal to 0%. However, there is a trade-off between these two metrics; i.e., increasing one might not be possible without decreasing the other one. The choice of which threshold to use is a very critical issue in biometric authentication systems that may depend heavily on the security level of the application. For instance, in domains where a high security application is required, FAR must be as low as possible to detect as many impostors as possible. Another important error metric that is often used to compare different biometric systems is the Equal Error Rate (EER). EER can be defined as the point where FAR value is equal to FRR value. In the next section, we describe the keystroke dynamics authentication in detail as one of the behavioral biometric types. 12

24 2.3. Keystroke Dynamics Keystroke dynamics are defined as a behavioral biometric characteristic which involves analyzing a computer user s habitual typing pattern when interacting with a computer keyboard (Ashbourn 2014). In other words, keystroke biometric systems assume that typing characteristics of an individual are unique and difficult to reduplicate. There are several advantages to keystroke analysis: First, keystroke dynamics are practical, since every computer user types on a keyboard. Second, it is an inexpensive method because it does not require any additional components. Third and most important, typing rhythms can remain available even after the authentication phase has passed. The concept behind keystroke dynamics is inspired by much older works that differentiate telegraph operators by their tapping rhythm over telegraph lines. This capability was improved and used during World War II by the US military to distinguish an ally from an enemy. In the mid-70s, Spillane (1975) in an IBM technical report was the first to suggest using a computer keyboard s typing rhythms to identify users. In the early 80s, Gaines et al. (1980) conducted a feasibility study on the timing of keystroke patterns as an authentication method. Since that time, keystroke dynamics has been an active research area. Keystroke dynamics research is known by other names such as keystroke analysis, typing rhythms, and typing biometrics. Several studies (Joyce and Gupta 1990, Bleha et al. 1990, Obaidat and Sadoun 1997, Sang et al. 2005) conclude that users tend to have constant behavior patterns when they type on a keyboard. Thus, keystroke dynamics can be used for authentication. As with other biometric authentication systems, a keystroke authentication system consists of two phases: the enrollment phase, which includes capturing typing data, filtering, feature extraction, and pattern learning; and the verification phase, which includes capturing typing data, filtering, feature extraction, and performing the comparison with the biometric pattern. The main scheme for the keystroke dynamics system as outlined by Romain et al. (2011) is pictured in Figure 2.2. A user types on the keyboard and the timing of typing features is extracted and compared with the user's 13

25 stored typing pattern by the matcher. The matcher decides whether the current user is a legitimate user. Figure2.2: Keystroke dynamics enrollment and authentication schemes Static and Dynamic Authentication Systems Keystroke analysis techniques can be primarily classified into two main categories static and dynamic (or continuous) analysis. Static analysis means that the analysis is executed at certain points in the system (i.e., at log-in time). This type of analysis ordinarily involves short typing samples such as those which might be seen at log-in time; for example, user IDs, passwords, names, and/or passes phrases. This method is often used to add additional security to the system and to address some limitations inherent in the traditional authentication techniques. With such a security scheme, during an authentication process (at log-in time), the verification system attempts to verify two issues: first, is the user credential correct? Second, is the manner of typing the password similar to the user profile? Therefore, if an attacker was able to steal the user s credentials, he/she will be rejected by the verification system because he/she will not type in the same pattern as the legitimate user. 14

26 With a static security scheme, the authentication process is statically performed only at the point of entry to the system (i.e., log-in). However, an effective authentication system continuously verifies the identity of the user by gathering typing data throughout the user s session to ensure the valid identity of the user. An authentication system with such a security scheme can be protected against certain attacks, such as session hijacking performed later by a malicious user. In contrast to static analysis, dynamic analysis includes continuous or frequent monitoring of one's keystroke behavior. It is first checked during the log-in session and continues after the initial authentication. In this case, larger typing samples are usually necessary in order to build an individual model. Typing samples can be collected directly, by requiring individuals to type some predefined long text several times, or indirectly, by monitoring their keystroke activities (e.g., while they are writing s and using word processing). Figure 2.3 shows the security life cycles of static and continuous keystroke authentication systems. It is obvious that continuous keystroke authentication systems provide greater security. Figure 2.3: Security cycles of static and continuous keystroke authentication system. Additionally, static authentication analysis can be utilized only in systems where there is no need for additional text entry (e.g., to check bank accounts online). By contrast, there are numerous conceivable applications of the keystroke biometric for dynamic authentication analysis. 15

27 One such application is the verification of the identity of students taking online quizzes/tests. This is an important application in light of the fact that the number of students taking online classes is increasing, leading to growing concerned among education professionals about true assessment and academic integrity. Another good application is the use of dynamic biometrics to prevent insider attacks by installing key loggers on each employee's computer to monitor his/her keystroke activities Keystroke Features When a person types on a keyboard, two main events occur: 1) the key down event, when a person presses a key, and 2) the key up event, when a person releases a key. Timestamps of each event are usually recorded to keep track of when a key is pressed or released. A variety of timing features can then be extracted from this timing information. Two of the most used features are 1) duration of the key, which is the time the key is held down, and 2) keystroke latency, which is the time between two successive keystrokes. Latency can be calculated by many different methods. The most commonly used methods are: Press-to-press (PP) latency, which is the time interval between consecutive key presses; PP is also called digraph time. Release-to-press (RP) latency, which is the time interval between releasing the key and pressing the next one; PR is also called flight time. Release-to-release (RR) latency, which is the time interval between releases of two consecutive keys. It is also possible to capture other keystroke dynamics information, such as the time it takes to write a word, two letters (digraph) or three letters (tri-graph). Figure 2.4 presents the most popular features that can be extracted from keystroke timing information. Other features such as difficulty of typing phrase, pressure of keystroke, and frequency of word 16

28 errors can also be used as features in keystroke analysis. However, not all features are favorable since some of them require extra tools, as in the case of keystroke pressure. Therefore, this dissertation focuses on keystroke timing information of the user when typing on the keyboard. Figure 2.4: Keystroke timing information Lierature Review So many reports can be found on keystroke analysis based on a static authentication. Fewer works are to be found on keystroke analysis with continuous authentication. In this section, we present a brief summary of extant studies on both types of keystroke analysis Static Verification Joyce and Gupta performed a study on 33 users, out of whom six users represented legitimate users and 27 users represented impostors (Joyce and Gupta 1990). All users were asked to type in their log-in name, password, first name and last name, eight times in one session. The mean reference's signature was computed by calculating the mean of the eight values. Digraph latencies that fall within 1.5 standard deviations of the reference s signature mean are considered to belong to a valid user. Using absolute distance, the researchers achieved 0.25% of FAR and 16.67% of FRR. Bleha et al. ( 1990) conducted a similar study to that proposed by Joyce and Gupta 17

29 (1990), using a digraph to distinguish between legitimate users and impostors. They performed experiments on 39 users, out of whom 14 users represented legitimate users, whereas 25 users represented impostors. Using a minimum distance classifier, Bleha et al. achieved 8.1% of FAR and 2.8% of FRR. Brown and Rogers (1993) used the search strategies introduced in Artificial Neural Networks (ANNs) as a pattern classifier for keystroke dynamics data. They were the first researchers to use keystroke duration as a feature to differentiate between legitimate users and impostors. They performed experiments on 61 users, out of whom 46 users represented legitimate users, while 15 users represented impostors. Each user provided 41 and 30 samples of eight characters long. This study reported the FAR to 0.0 percent and achieved a FRR of percent. Obaidat and Sadoun (1997) used keystroke duration and latency together as a feature to differentiate between legitimate users and impostors. Their experiment involved 15 participants who were asked to type their names 225 times each day over a period of eight weeks. Using neural network as a classifier, Obaidat and Sadoun were able to achieve a FAR of 0.0 percent and a FRR of percent. Cho et al. (2000) collected the required data online for their experiments. In the study, 21 participants represented legitimate users and 15 represented impostors who provided 275 and 75 samples of 8 characters long. Cho et al. used the multi-layer perceptron (MLP) ANN for classification, reporting a FAR of 0.0 and a FRR of Sang et al. (2005) performed a similar experiment to that of Obaidat and Sadoun, using duration and latency together as a feature to differentiate between legitimate users and impostors. Support vector machine (SVM) was used to classify ten user profiles. This study achieved very good accurate results of 0.02% FAR and 0.1% FRR. Ara ujo et al. (2005) conducted seven experiments on 30 users. They attempted to improve the regular log-in-password authentication performance by combining more than 18

30 one feature. Four features (key code, two keystroke latencies, and key duration) were analyzed. The statistical distance classifier was used in their experiments for classification. The better results were achieved with the use of all features, which yielded a FAR of 1.89% and FRR of 1.45%. Revett et al. (2007) performed an experiment on 50 participants using the probabilistic neural network (PNN). In this study, 20 participants represented legitimate users and 30 represented impostors, who provided 13 and 30 samples of text range from 6 to 15 characters long. Keystroke event times were recorded in milliseconds with an accuracy of ±1 ms. they reported an average FAR/FRR of Continous Verification As previously mentioned, fewer studies are to be found on the subject of continuous verification. Some papers tested the possibility of continuous verification of the user during a complete typing session. Gaines et al. (1980) conducted a feasibility study on the use of timing patterns of keystrokes as an authentication method. The experiment involved six professional typists who were required to provide three passages consisting of 300 to 400 words, two times each, over a period of four months. A statistical t-test was applied under the hypothesis that the means of the digraph times were the same in both settings, and with the assumption that the two variances were equivalent. It was demonstrated that the number of digraph values that exceeded the test were typically between 80 to 95 percent. The most frequent five digraphs that appeared as distinguishing features were in, io, no, on, ul. Umphress and Williams (1985) conducted an experiment in which 17 participants were asked to provide two typing samples. The first sample, which was used for training, included about 1400 characters, and the second, which was used for testing, included about 300 characters. Digraph latencies that fall within 0.5 standard deviations of its 19

31 mean are considered to belong to a valid user. They achieved 6% of FAR and 12% of FRR. John and Williams (1988) performed experiments on 36 participants who were asked to type the same text of 537 characters twice in two separate events over a month. The first sample was used for training and building an authentication model of the users, and the second sample was used for testing. The test digraph latency was counted valid if it fell within a 0.5 standard deviation of the mean reference digraph latency, and was accepted if the ratio of valid digraph latencies to total latencies was more than 60 percent. A False Acceptance Ratio (FAR) of 5.5% and a False Rejection Ratio (FRR) of 5% were achieved. Monrose and Rubin (2000) performed a study on both static and dynamic keystroke analyses. Overall, 31 users were asked to type a few sentences from a list of available phrases and/or enter a few free sentences. Three different methods were used to measure similarities and differences between typing samples: normalized Euclidean distance, weighted maximum probability, and non- weighted maximum probability measures. About 90% of correct classification was achieved when fixed text was used, which was later enhanced to 92.14% with expansion of the analyses to 63 users (Monrose and Rubin 1997). However, it was reported that when different texts are used, accuracy collapsed to 23% of correct classification in the best state. Dowland et al. (2001) monitored normal activities of four users for some weeks on computers using Windows NT, which means there were no constraints on the users. Different statistical techniques were applied. Only digraphs that occurred less frequently by the users were used to build the users profiles. A new sample compared users profiles in two steps: first, each digraph in the new sample is compared to the mean and standard deviation of its corresponding digraph in the users profiles and marked accepted if it fell within some defined interval. Second, the new sample was assigned to 20

32 the user whose profile provided the largest number of accepted digraphs. A 50% correct classification was achieved. Bergadano et al. (2002) used the type error and intrinsic variability of typing as a feature to differentiate between legitimate users and impostors. Their experiment involved 154 participants, of whom 44 users, as legitimate users, were asked to type a fixed text of 683 characters long for five times over a period of one month, while 110 users were asked to provide only one sample to be used as impostor users. The degree of disorder within tri-graph latencies was used as a measure for dissimilarity metric and statistical method for classification to compute the average difference between the units in the array. This approach was able to achieve 0.0% of FAR and 2.3% of FRR. Curtin et al. (2006) conducted three identification experiments in which subjects were asked to type three texts 10 times. The first two texts were 600 characters in length and the third one was 300 characters in length. The first text was used for training. The nearest neighbor classification technique employing Euclidean distance was used in these experiments. A 100% identification accuracy was achieved on eight users typing the same text. However, this accuracy diminished with 30 subjects, typing different texts, and gradually decreasing the length of the text. It was concluded that the best performance could be achieved under these conditions: sufficient training and testing text length, sufficient number of enrollment samples, and the same keyboard type used for enrollment and testing. Gunnetti and Picardi (2005) conducted an experiment on 205 participants. Their work focuses on long free text passages. They used 40 participants to represent legitimate users who provided 15 samples each, and 165 participants to represent impostors who provided only one sample each. They developed a method for comparing the two typing samples based on the distance between typing times. They reported a FAR below 5% and a FRR below 0.005%. 21

33 Hu et al. (2008) used 19 participants to represent legitimate users who provided 5 typing samples each, and 17 participants to represent impostors who provided 27 samples. Typing environment conditions were not controlled in the experiment. Typing samples were used in building users profiles by creating averaging vectors from all training samples. K-nearest neighbor technique was employed to cluster the users profiles based on the distance measure. 66.7% of correct classification. 22

34 CHAPTER 3 MOTIVATIONAL CHALLENGES AND CONTRIBUTIONS OF THE THESIS A major challenge in keystroke analysis is the identification of the major factors that influence the performance accuracy of the keystroke authentication detector. Two of the most influential factors that may impact the performance accuracy of the keystroke authentication detector include the classifier employed and the choice of features. Currently, there is insufficient research that addresses the impact of these factors in continuous authentication analysis. The majority of exciting studies in keystroke dynamics focuses primarily on the impact of these factors in the static authentication analysis. Understanding the impact of these factors will contribute to the improvement of continuous authentication system performance. While chapter 1 provides an overview of the motivations and contributions of this dissertation, this chapter provides a detailed account of the same. The motivation of this dissertation is provided in Section 3.1. And dissertation contributions are provided in Section Research Motivations Extant literature on keystroke dynamics demonstrates conflicting results regarding which feature is the most effective timing feature in terms of distinguishing between users in keystroke dynamics domain. Experiments show that hold times are much more important than latency times (Pin et al. 2012, Robinson et al. 1998). It is also observed that using tri-graph time offered better classification results than using digraphs or higher order n-graphs (Killourhy and Maxion 2010). Revett et al. (2007) also reported that the digraph and tri-graph times were more effective compared to hold time and time of flight. Accordingly, recent studies combine more than one of these features (Pin et al. 2012, Araújo et al. 2005). Research has found that use of all three types of features (i.e., hold times, digraph times, and flight times) produces better results (Araújo et al. 2005). In 23

35 contrast, it has also been reported that, when hold times and either digraph times or flight times are included, the particular combination has a trivial effect (Pin et al. 2012). Another observation taken from the literature on keystroke dynamics is that existing schemes use some features (i.e., hold times, digraph times, and flight times) from static authentication keyboard-based systems to represent user typing behavior in continuous authentication keyboard-based systems. These features can be used in the static authentication keyboard-based systems for successful user authentication. However, these features do not guarantee strong statistical significance in continuous authentication keyboard-based systems. Hence, better informative timing features must be added to the extant literature to guarantee successful distinguishing between users in continuous authentication keyboard-based systems. On the other hand, other research demonstrates that the employed algorithm (classifier) also plays an important role in enhancing the performance of keystroke dynamic systems (Killourhy et al. 2009, Zhao 2006). Killourhy et al. (2009) benchmarked and compared the performance of 14 algorithms on a single data set. Techniques varied from simple statistics classifiers such as the mean and standard deviations of typing times, to more complex pattern-recognition classifiers, such as neural networks and support vector machines. The study reported that different algorithms will have different error rates. Zhao (2006) used seven machine learning methods to build models to differentiate user keystroke patterns and found that certain classifiers are more accurate than others. However, even though these existing works are helpful in terms of deciding which classifiers provide more accurate rates, the focus of these studies is comparison of the algorithms performance on static authentication keyboard-based systems. Hence, more work is required to compare the algorithms performance in continuous authentication keyboard-based systems. From a different angle, another important conclusion that can be drawn from the existing literature of keystroke dynamics is that most of the existing schemes require 24

36 having predefined typing models either for legitimate users or impostors. It is difficult or even impossible in some situations to have typing data of the users (legitimate or impostors) in advance. For instance, consider a personal computer that a user carries to a college or to a cafe. In this case, only the computer owner (legitimate user) is known in advance. For another instance, consider a computer that has a guest account in a public library; in this case, none of the system users are known in advance. Thus, a new automated and flexible technique that has the ability to authenticate the user without the need for any prior user typing model is needed Research Contributions This dissertation focuses on the impact of the selected feature on the performance of the keystroke dynamics authentication system. The major contribution of this dissertation is the utilization of most frequently used English words in ascertaining identify of users typing on the keyboard. Another important contribution of this dissertation is the proposal of a novel approach based on sequential change-point methods for early detection of an imposter in computer authentication. There are two main advantages of the sequential change-point methods: First, they can be implemented online, and hence, enable the creation of continuous user authentication systems without the need for any user typing model in advance. Second, they minimize the average delay of attack detection while maintaining an acceptable detection accuracy rate. The key contributions of this dissertation are as follows: Identification of two main factors that may influence error rates of a continuous keystroke authentication system. In particular, this dissertation focuses on the impact of the choice of features and algorithm employed on error rates of a continuous keystroke authentication system. Introduction of a new anomaly detection approach based on sequential changepoint methods for early detection of attacks in computer authentication. The 25

37 developed algorithms are self-learning, which allow the construction of continuous keystroke authentication systems without the need for any user typing model in advance. Introduction of new features useful in distinguishing between users in continuous authentication systems based on keystroke dynamics. The new features are based on the utilization of the most frequently used English words to ascertain identify of the users typing on the keyboard. Introduction of feature selection techniques useful in selecting the most influential features in continuous authentication systems based on keystroke dynamics, which maintain sufficient classification accuracy of the system and decrease the training and testing times required. The proposed techniques use a wrapper approach based on different machine learning classifiers such as SVM for use in feature evaluation. The process of subset feature selection is performed by various searching algorithms such as Genetic Algorithm and Greedy Algorithm. Investigation of the possibility of improving the performance of the keystroke continuous authentication system by performing various combinations of the extracted keystroke features. In particular, this dissertation compares the independent performances of four keystroke features, including i) key duration, ii) flight time latency, iii) digraph time latency, and iv) word total time duration against ten different combinations of these keystroke features. Extraction of various types of keystroke features and determination of the most efficient in keystroke dynamics. In particular, the performance of four features such as i) key duration, ii) flight time latency, iii) digraph time latency, and iv) word total time duration are extracted and analyzed in order to find which one among these four features performs better in keystroke dynamics. 26

38 Introduction and comparison of the performance of several anomaly detection techniques in the terms of their ability to distinguish between users i n continuous authentication systems. In particular, four machine learning techniques are adapted for keystroke authentication in this dissertation. The selected classification methods are: Support Vector Machine (SVM), Linear Discriminate Classifier (LDC), K-Nearest Neighbors (K-NN), and Naive Bayesian (NB). The adapted techniques are selected because of their prevalence in the literature. Furthermore, these techniques are accompanied by a good and comprehensive set of methods and implementations to enable various comparisons of less common timing features used in keywords, such as total time to type whole word, with other more common features, such as digraph and flight times of letters. 27

39 CHAPTER 4 EXPERIMENTAL SETUP This chapter describes the experimental procedures, data collection, and keystroke features used in this dissertation. A set of experiments is executed based on the collected data. Those experiments were based on fixed rather than free text, and designed to collecting keystroke timing data from users with different typing experience levels. This chapter is organized as follows. Section 4.1 provides data collection procedures. Section 4.2 discusses feature extraction. 4.1 Data Collection A VB.NET windows form application was developed to collect raw keystroke data samples. For each keystroke, the time (in milliseconds) a key was pressed and the time a key was released, was recorded. Participants had the ability to use all keys on the keyboard including special keys such as Shift and Caps Lock keys. More importantly, they were also able to use the Backspace key to correct their typing errors. Participants had the choice of either using our laptop computer or downloading the application for collecting data on their own machines. The main window of the application was split into two sections. The top section displayed the text that participants were required to type, i.e., fixed text. The bottom section provided a space to allow the participant enter the text. The texts used in our experiment were a collection of random English sentences randomly selected from a likewise randomly selected pool of English passages. Figure 4.1 shows a screen shot of the main window of the application used to collect the data from users. Once the participant finished typing the text, the participant was asked to hit the collect data button. The collected timing raw data was sent to SQL database that was developed to save participant timing data. The timing raw data file that was recorded by 28

40 the application for each participant contains the following information for each keystroke entry: Key character that was pressed. Time the key was pressed in milliseconds. Time the key was released in milliseconds. The experiment of keystroke timing dynamic involved 28 participants. However, the process of collecting data from participants involves two stages, in the first stage; we collect data only from 8 participants, in the second stage, we collect data from another 20 participants. The participants were undergraduate and graduate students with different majors. All participants were asked to type the same prepared text (5000 characters) one time. Furthermore, 6 of the 28 participants were asked to provide other two-samples of the same text at different times. We were thus able to have two data sets: the first data set contained one sample for each one of the 28 users; the second data set contained two additional samples for only 6 users who played the role of the owner of the device and whose active authentications were of interest. Figure 4.1: Main window of the application used to collect the data from users. 29

41 4.2. Features and their Extractions The collected raw data from both data sets were used to extract the following features for our experiment: 1. Duration (F1) of the key presses for the 20 most frequently appearing English alphabet letters (e, a, r, i, o,t, n, s, h, d, l, c, u, m, w, f, g, y, p, b) (Gaines 1956). 2. Flight Time Latency (F2) for the 20 most frequently appearing pairs of English alphabet letters (in, th, ti, on, an, he, at, er, re, nd, ha, en, to, it, ou, ea, hi, is, or, te) (Gaines 1956). 3. Digraph Time Latency (F3) for the 20 most frequently appearing pairs of English alphabet letters (in, th, ti,on, an, he, at, er, re, nd, ha, en, to, it, ou, ea, hi, is,or, te) (Gaines 1956). 4. Word Total Duration (F4) for the 20 most frequently appearing English words (for, and, the, is, it, you, have, of, be, to, that, he, she, this, they, will, I, all, a, him) (Fry and Jacqueline 2012). Every feature (F1, F2, F3, and F4) consisted of 20 data items. For instance, F1 contained 20 alphabet letters, where each letter represented an item. Figure 4.2 illustrates an example of four keystroke features extracted for the word the." Figure 4.2: An example of four keystroke features extracted for the word the. Table 4.1 reports the number of instances a feature item occurs in the English passages. The items are grouped into one letter, two letters, and words with more than 30

42 two letters. These items are reported to be the most frequent English letters and words [ (Gaines 1956), (Fry, Edward B., and Jacqueline E. Kress 2012)] [8, 9]. Table 4.1: Number of occurrences of the feature items in the English passages Item Ave Frequency Item Ave Frequency Item Ave Frequency A 388 AN 62 AND 27 E 568 AT 61 FOR 27 I 418 ED 30 HAVE 27 N 222 ER 53 IS 49 O 412 HE 136 IT 37 R 236 IN 37 OF 28 S 320 ON 22 THE 29 T 472 TH 125 YOU 28 H 350 EA 28 ALL 32 D 159 EN 29 BE 28 L 249 HA 81 HE 32 C 68 HI 71 I 50 U 124 IS 107 SHE 28 M 138 IT 64 THAT 27 W 129 ND 45 THEY 28 F 103 OR 51 THIS 34 G 86 OU 49 TO 34 Y 120 RE 55 WILL 29 P 58 TE 27 A 59 B 59 TO 54 HIM 28 31

43 CHAPTER 5 KEYSTROKE FEATURES SELECTION USING STATISTICAL METHODS IN HYPOTHESES TESTING This chapter focuses on comparison of features to determine the features most effective in distinguishing users. The individual performance of four features is analyzed, including key duration, flight time latency, digraph time latency, and word total time duration. Experiments are conducted to measure the performance of each feature individually and the results from the different subsets of these features. In addition, this chapter reports how the word total time duration feature performs compared to key duration, flight time latency, and digraph time latency in determining identify of users when typing. This chapter is organized as follows. Section 5.1 provides chapter motivation. Section 5.2 provides research questions to be addressed in this chapter. Section 5.3 describes and explains the methods of analysis to be used. Section 5.4 explains the analyses conducted and the results obtained. Section 5.5 provides a conclusion and answers to the questions posed Motivations Section 3.1 showed that the literature of keystroke dynamics demonstrates conflicting results regarding which feature is the most effective timing feature in the keystroke dynamics domain. Section 3.1 showed that existing schemes in the continuous authentication keyboard area use some features from static authentication keyboard-based systems to represent user typing behavior. However, these features do not guarantee strong statistical significance in continuous authentication keyboard-based systems. Hence, it is necessary for new piece of informative timing feature to be added to the existing literature to guarantee successful performance of continuous authentication keyboard-based systems. 32

44 5.2. Research Questions The main research questions that will be addressed in this chapter are: Q1. Which keystroke timing feature(s) among key duration, flight time latency, digraph time latency, and word total time duration timing performs better in keystroke dynamics? Q2. Does any combination of timing features improve the accuracy of authentication scheme? Q3. Which feature items contribute more to the accuracy of the model, where feature items are defined as the exact instances of letters in each feature, e.g. a, Th, etc.? Q4. How does the word total time duration feature perform compared with key duration, flight time latency, and digraph time latency? 5.3. Data Analysis Methods In order to answer the research questions posed in the previous section, two statistical methodologies are applied. Statistical methodologies are based on finding statistical parameters (mean and standard deviation) differences between the typing samples of the users. In particular, Mann-Whitney statistics test and effect sizes are applied in this chapter. The following subsections describe these methodologies and their application to the collected data Mann-Whitney Statistics Test This statistical test is one of the most powerful and common non-parametric tests for comparing the mean values of two populations. If one of the two random variables is randomly greater than the other, the test evaluates the differences in mean values between the two underlying populations. 33

45 The Mann-Whitney test is used to test a null hypothesis: The null hypothesis where μ 1 and μ 2 are the mean values of the two populations, assumes that the two-samples have been generated from the same population, against the alternative hypothesis which assumes that there is a statistically significant difference between the mean values of the two-samples compared. The decision criterion is then derived from the estimated p-value with 95% confidence, as formulated as follows: P-value>= 0.05 Otherwise Do not reject H0 Reject H0 Thus a p-value is simply a measure of the strength of evidence against H Effect-Sizes Statistical significance testing normally relies on the null hypothesis and p-values to interpret results. That is, a null hypothesis test assesses the likelihood that any effect such as a correlation, and differences in means between two groups, may have occurred by chance. As indicated earlier, we calculate a p-value: the probability of the null hypothesis as being true, and thus provide evidence against the null hypothesis. However, sole reliance on statistical testing in formulating. A hypothesis raises the following issues. First, the statistical significance test does not tell the size of a difference between two measures, it only provides evidence against the null hypothesis. In particular, with this process we spend all our time testing whether or not the p-values reach 0.05 alpha value, such that p= is often reported as statistically significance but p= is not (such cases are called practically infeasible). Thus, even very small effects may reach statistical significance. Second, a statistical significance test depends 34

46 mainly on the sample size. Therefore, even the most trivial effect will become statistically significance if we test enough samples sizes. To account for this result, we also report effect size, i.e. the magnitude of difference between two-sample mean values. Effect size is defined as: the magnitude or size of an effect. In general, effect size measures either the sizes of correlations or the sizes of differences between two groups. The most common effect-size measures are Pearson r or correlation coefficient and Cohen s d (also known as the standardized mean difference). Pearson r actually measures the relationship strengths or association between two variables. On the other hand, Cohen's d can be used when comparing two groups' means, as the Mann-Whitney statistic test described in the previous section which is also used to compare two means. This dissertation focuses and reports the Cohen's d effect-size values. Cohen's d effect size is simply calculated by taking the difference in the two groups' means divided by their combined standard deviations, as shown in the following equation: 1-2 Cohen' s d Spooled 2 2 s1 s2 Spooled 2 where μ 1 is the mean for one sample, μ 2 is the mean for the second sample, and s1 and s2 are the standard deviations for the first sample and second samples, respectively. The Cohen's d interpretation depends on the research question. It covers the whole range, from no relationship (0) to a perfect relationship (3, or -3), with 3 indicating a perfect negative relation, and 3 indicating a perfect positive relation. Thus, effect-size meaning varies by situation. 35

47 Cohen suggests that d=0.2 be considered as a 'small' effect size, 0.5 a 'medium' effect size, and 0.8 a 'large' effect size (Cohen 2013). This means that a d of.2 shows that the two groups' means differ by 1/5 a standard deviation, a d of.5 shows that the two groups' means differ by 1/2 a standard deviation, a d of.8 shows that groups' means differ by 8/10 standard deviation, and so on R's Statistics Package Mann-Whitney statistic test and the effect-size test analyses were conducted using R software packages. In particular, wilcox.test function that implemented by R's statistics base-package was used to carry out Mann-Whitney tests analyses. On the other hand, cohensd function from lsr R package was used to carry out effect- size tests analyses (Navarro 2014) Analyses And Results This section describes analyses that were conducted based on the Mann-Whitney statistic test and the effect-size test. This section also reports the results obtained from both approaches Mann-Whitney Test Analyses Analysis 1: The aim of this analysis is to compare features in terms of relative effectiveness at distinguishing users. In this analysis, the Mann Whitney statistic test is used to compare the mean values for each two pairs of user feature items (i.e. letters). Each feature item of the first user is compared to its related feature item of the second user and observing the p-values that were measured from the statistical test recorded. This analysis was conducted on 8 users (who collected the data from at first stage of the collecting the data process) and 8 items for each feature, the 8 most frequently appearing English alphabet items for each feature. In this case, since there were 8 users, 36

48 there were 28 unique statistic tests and thus 28 p-values for each feature item. Since there were 8 items for each feature, there were 224 unique statistic tests and thus 224 p-values for each feature. In this analysis, the accuracy of each feature calculated as the total number of p-values greater than 0.05 divided by the total number of statistical tests for the underlying feature. The p-value >= 0.05 indicates that there is no evidence to reject the underlying null hypothesis, and thus an indication of no statistically significant differences between twosamples, i.e., users. Therefore, features that have greater numbers of p-values less than 0.05 are supposed to be more discriminative. Analysis 2: The aim of this analysis is to compare features effectiveness in distinguishing users with involvement of additional feature items and users. This analysis was conducted on 28 users (who collected the data from at second stage of the collecting the data process) and 20 items for each feature, the 20 most frequently appearing English alphabet items for each feature. In order to compare the data sets collected for two pairs of users, Mann Whitney statistic test is used to compare the mean values of the feature item (i.e. letters) for each of two pairs of users. Each feature item of the first user was compared to the related feature item of the second user, and the p-values measured by the statistical test recorded. Based on the generated p-values, each item will have one decision value, either "reject" or "do not reject". Since there were 20 items for each feature, there were 20 comparisons for each feature and thus 20 p-values for each feature with one p-value for each item comparison. Next, vote counting schemes were applied on these p-values to get a consensus and a finalized decision as to whether users were the same person or not; that is, either "reject" or "do not reject. In particular, two pairs of samples were accepted as belonging to the 37

49 same person if the ratio of the accepted feature item to total feature items accounted for more than 50 percent. For this analysis, we were interested in small p-values, i.e., the significant level of The p-value <= 0.05 indicates that the null hypothesis is rejected and thus an indication of statistically significant differences between two populations, i.e., users. Since there were 28 users, there were 378 imposter match attempts and thus 378 finalized decisions for each feature. Thus, the performance of the each feature was measured as the percentage of imposter attempts mistakenly accepted by the system. This analysis had two main benefits: First, the ability to compare the performance of individual features by measuring the accuracy of each feature individually. Second, because it measured the accuracy of each feature item individually, we could determine which items weighed and contributed more to the accuracy of the feature. In particular, the accuracy of each feature items calculated as the total number of p-values greater than 0.05 divided by the total number of statistical tests for the underlying item. The p-value >= 0.05 indicates that there is no evidence to reject the underlying null hypothesis and thus an indication of no statistically significant differences between two-samples, i.e., users. Therefore, items that have greater numbers of p-values less than 0.05 are supposed to be more discriminative. Analysis 3: The purpose of this analysis is to evaluate the performance of keystroke dynamics on the selected features against the same user authentication, i.e., the legitimate users of a computer system. For this analysis, only the data samples collected from 6 users were studied, 6 users provided three typing samples of the same text. The three samples gathered from each 38

50 user were pooled and treated as one large sample in order to obtain sufficient frequencies for the feature items concerned. To evaluate the performance of the proposed method on the same user, some unseen data were introduced. For this purpose, the processed data of each user from the second data set were partitioned into two sets: training and test data sets. The methods of generating these two data sets could be any partitioning mechanism often employed in a typical cross-validation assessment, e.g., leave-one-out cross validation, random sampling, and K-fold cross validation. We used K-fold cross-validation in our experiments, where a data set was split into K partitions. The major advantage of K-Fold cross validation is that all the instances in the data set are ultimately used for both training and testing purposes. The K-fold cross validation creates K partitions of the data set and uses K-1 partitions to form training data; the remaining one is used for testing and evaluating of the developed model. Popular choices for K-Fold cross validation are K=5 and K=10. For each item of the selected features of the target user, we ran K-fold cross validation choosing K=5, then for each i=1 to K we ran Mann-Whitney statistic test on the randomly generated partitions. In other words, each feature item had five unique statistical tests producing five p-values for each feature item. In the end, each feature item was tested five times against the same user, which produced five legitimate user match attempts for each feature of the same user. Since there were six legitimate users, there were 30 legitimate match attempts and thus 30 finalized decisions for each feature. Thus, the performance of the each feature was measured as the percentage of legitimate user attempts mistakenly rejected by the system, i.e., false negatives. 39

51 Analysis 4: The purpose of this analysis is to evaluate and study the influence of various keystroke feature combinations on the performance accuracy of the keystroke dynamics system. This analysis was conducted on 8 users (who collected the data from at first stage of the collecting the data process) and 8 items for each feature, the most 8 frequently appearing English alphabet items for each feature. In this case, since there were 8 users, there were 28 unique statistic tests and thus 28 p-values for each feature item. Since there were 8 items for each feature, there were 224 unique statistic tests and thus 224 p-values for each feature. In this analysis, the accuracy of combined features was calculated as the total number of p-values greater than 0.05 divided by the total number of statistical tests for the underlying features. The p-value >= 0.05 indicates that there is no evidence to reject the underlying null hypothesis, that is, the mean differences between the two-samples, and thus an indication of no statistically significant differences between two-samples, i.e., users. Therefore, combined features that have greater numbers of p-values less than 0.05 are supposed to be more discriminative Mann-Whitney Test Results Results of Analyses 1 and 2: This section reports the results for two cases: first, a comparison between the performances of four different keystroke features using data of 8 users and 8 items for each feature. In this case, the FAR value of each feature was calculated as the total number of p-values greater than 0.05 divided by the total number of statistical tests for the underlying feature. Second, a comparison between the performances of four different keystroke features with data of 28 users and 20 items for each feature. In this case, the FAR value of each feature was measured as the percentage of imposter attempts mistakenly accepted by the system based on the generated p-values and a voting schema. Figure 5.1 shows the results obtained for the two cases, respectively. 40

52 For the first case, observing Figure 5.1 (a), we notice that by using F1, we are able to obtain a better result compared to the other keystroke features (F2, F3, and F4). The FAR value for F1 is slightly above 6% followed by F4 with FAR value of 10%. This result is consistent with those of other published studies (Pin et al. 2012, Robinson et al. 1998). Of particular interest in Figure 5.1(a) is that F4 is shown to perform better than F2 and F3. The FAR values measured for F1, F2, F3, and F4 were 6.57%, 17.1%, 18%, and 10.28%, respectively. The FAR value of F4 was 10.28%, which shows that the total typing time for the most frequently used English words might contain valuable timing information that could be utilized in order to differentiate users. For the second case, in Figure 5.1 (b), we notice that by using F3 feature, we are able to obtain a better result compared to the other keystroke features (F1, F2, and F4). The FAR value for F3 is slightly above 3% followed by F2 with FAR value of 6%. Thus, based on this experiment, we can conclude that latency times are a more effective timing feature than hold times. (a) 8 users data and 8 items for each feature. (b) 28 users data and 20 items for each feature. Figure 5.1: A comparison between the performances of four different keystroke features. (a) Results obtained using Mann-Whitney statistic test. (b) Results obtained using Mann- Whitney statistic test and Voting Schema. 41

53 Figure 5.2 illustrates the FAR values for all feature items individually based on 28 users data. FAR value of each feature item was calculated as the total number of p- values greater than 0.05 divided by the total number of statistical tests for the underlying item. We can observe that some items performed better than others. For instance, the FAR value measured for E was 5%, while a FAR value observed for M was 28%. Moreover, as we can observe from Figure 5.2 (a), letters with higher frequencies performed better than letters with lower frequencies. (a) F1 items. (a) F2 items. (c) F3 items. (a) F4 items. Figure 5.2: A comparison between the performance of four different keystroke features items using 28 participants data with Mann-Whitney statistic test. Analysis 3 Results: For the third analysis designed to evaluate the performance of keystroke dynamics on the selected features against same user authentication, i.e., the legitimate users of a computer system, all four features achieved 0% FRR for all 42

54 legitimate users, indicating that the underlying features and the use of the most frequently used letters and words accurately distinguished the 6 users. Analysis 4 Results: We have already reported that F1 showed the best result among the four keystroke features. Our experiment results show that the combinations of features did not improve the performance of keystroke authentication. The FAR of combined features was calculated as the total number of p-values greater than 0.05 divided by the total number of statistical tests for the underlying combined features. As shown in Figure 5.3, the FAR values increase for combinations of all of the features. The most interesting finding we can note from Figure 5.3 is that if F1 combines with F4, it will lead to the best result out of all other combinations. Figure 17 reports the performance of different combinations of different keystroke features performed on eight users' data. While the combination of F1 and F4 produces the best performance among all four keystroke feature combinations, F1 offers the optimal result if used independently. Figure 5.3: The performance of different combinations of different keystroke features performed using 8 participants data Effect-Size Analyses Analysis: The aim of this analysis is measure the accuracy of each feature item (i.e., letter), individually based on its average effect sizes. 43

55 In order to apply this approach on our data sets, we performed the following: First, for each single item (i.e., letter), we calculated the effect size for each pair of users with the formula in section Since there were 28 users, there were 378 unique effect- size values for each item. Second, we calculated the average effect sizes of each item individually by using these 378 values. These averages provide us a summary estimate of multi-dimensional outcomes for feature s among all 28 users. Therefore, items that have larger averages of effect sizes were supposed to be more discriminative than those with smaller averages Effect-Size Analyses Results This subsection reports the effect sizes for feature items based on 28 users, so we can determine which feature items have a greater effect in terms of the difference between two mean measures. Table 5.1 shows the averages of effect sizes for all feature items based on data from 28 users. We notice that the feature F3 items have bigger effect sizes compared to the other keystroke features, i.e., F1, F2, and F4. The item s effect-size values for feature F3 range from 0.54 to 1.39, followed by feature F2 with the items effect-size values range from 0.14 to 1. In particular, we demonstrate the averages of effect sizes for feature item as follows: F3 item effect sizes range from a medium effect size to a large effect size, with the majority of these items having large effect sizes. F2 item effect sizes range from a small effect size to a large effect size, with the majority of these items having medium effect sizes. The majority of F1 item have medium effect sizes. F4 item effect sizes range from a small effect size to a medium effect size, with the majority of these items having medium effect sizes. 44

56 Table 5.1: Average of effect sizes for all feature items using 28 users data, sorted ascendingly. Item ES Item ES Item ES Item ES W 0.79 ON 1.02 OU 1.39 HIM 0.9 A 0.70 HI 1.01 ON 0.98 FOR 0.78 C 0.67 IN 0.92 HI 0.97 HAVE 0.69 P 0.67 TO 0.9 EA 0.9 OF 0.68 B 0.66 EA 0.88 IN 0.85 SHE 0.67 S 0.66 AT 0.8 TO 0.84 AND 0.64 H 0.65 OR 0.8 OR 0.74 TO 0.6 U 0.65 EN 0.74 IT 0.73 THAT 0.6 E 0.63 RE 0.74 AT 0.72 THEY 0.55 F 0.61 IT EN 0.71 THE 0.53 N 0.61 ER 0.7 RE 0.66 THIS 0.53 D 0.59 IS 0.68 IS 0.65 BE 0.52 I 0.59 HA 0.67 TH 0.64 ALL 0.52 O 0.59 AN 0.66 HA 0.63 YOU 0.48 Y 0.59 ND ND 0.62 IT 0.45 M 0.54 TH 0.62 AN 0.6 WILL 0.41 R 0.52 HE 0.56 ED 0.59 IS 0.39 T 0.52 TE 0.4 ER 0.57 A 0.34 L 0.50 ED 0.36 HE 0.54 HE 0.31 G 0.49 OU TE 0.36 I 0.26 AVG Subset Feature Performance This section reports the feature's FAR values for the 5, 10, and 15 most discriminative items based upon the two approaches applied (p-values and effect sizes). Moreover, this section reports the FAR values of the 5, 10, and 15 most frequent items of each feature. This finding is important because it indicates some features to be more dependable than others because they originated from a larger sample set or had a generally higher recurrence in the written language. For instance, a, e, and t ought to constitute more metering than q or z. Figure 5.4 illustrates the trade-off of the FAR values for various subsets of feature items based on the applied approaches (p-values and effect sizes). The performance of all 45

57 features improved using both approaches. For instance, the FAR value of feature F4 was reduced from 12% to 6% when using the 15 most discriminative items based on the effect-size approach, and to 5% based on the p-values approach. (a) based on p-values (b) based on effect-size values Figure 5.4: FAR values for the 5, 10, and 15 most discriminative items based upon two approaches applied (p-values and effect sizes) using 28 user s data. Figure 5.5 shows the trade- off of the FAR values for the 5, 10, and 15 most frequent items of each feature. As shown, the performance of features F1, F2, and F4 somewhat improved, but that of feature F3 did not. Figure 5.5: FAR values for the 5, 10, and most frequent items of each feature using 28 users data. 46

58 5.5. Conclusion We return to the questions posed in Section 5.2 and intend to address them: Q1. Which keystroke timing feature(s) among key duration, flight time latency, digraph time latency, and word total time duration timing performs better in keystroke dynamics? The results reported variation with reference to this question based on the method employed and the number of participants involved in the analysis. When using Mann Whitney statistics test applied to 8 users and 8 items of each feature, key duration (F1) obtained a better result among the four features with the FAR value equal to 6.5%, followed by word total time duration (F4) with FAR value of 10%. On the other hand, when applying a voting schema on p-values generated using Mann Whitney with data of 28 users and 20 items of each feature, the digraph time latency (F3) feature obtained a better result among the four features, with the FAR value equal to 3%, followed by F2, with FAR value of 6%. Both approaches achieved 0% FRR for all features. Therefore, we can conclude that when applying voting schema on p-values generated using Mann Whitney, the digraph time latency (F3) feature obtains a better result among the four features with the FAR value equal to 3% and FRR value equal to 0%. Q2. Does any combination of timing features improve the accuracy of authentication scheme? The results reported that using Mann-Whitney statistics test applied on 8 users and 8 items of each feature, the combination of features does not improve the accuracy of authentication. 47

59 Q3. Which feature items contribute more to the accuracy of the model, where feature items are defined as the exact instances of letters in each feature, e.g. a, th, etc.? The results reported in section show that some items perform better than others. For F1 feature, E, A, I, T, O items performed the best among the 20 items of F1 feature s items. For F2 feature, OF, OU, IN, TH, TO items performed the best among the 20 items of F2 feature s items. For F3 feature, OF, OU, IT, IN, TH performed the best among the 20 items of F3 feature items. For F4 feature, OF, IS, AND, ALL, WILL items be performed the best among the 20 items of F4 feature items. Q4. How word total time duration feature performs comparing to key duration, flight time latency, and digraph time latency? The results reported in Section show that F4 performed about the average of the other features. When Mann- Whitney statistics test is applied on 8 users and 8 items of each feature, word total time duration (F4) ranked second after key duration (F1) with the FAR value of 10% and FRR value of 0%. 48

60 CHAPTER 6 KEYSTROKE FEATURE SELECTION USING MACHINE LEARNING-BASED CLASSIFICATIONS While chapter 5 focuses on the application of statistical methods based on finding statistical parameters (mean and standard deviation) differences between two-samples in order to find out which features are most effective in distinguishing users, this chapter focuses on the application of Machine learning-based Classification Methods in order to compare feature performance in terms of distinguishing users. More specifically, five machine learning techniques are adapted for keystroke authentication in this chapter. The selected classification methods are Support Vector Machine (SVM), One Class Support Vector Machine (OC-SVM), Linear Discriminate Classifier (LDC), K-Nearest Neighbors (K-NN), and Naive Bayesian (NB). The adapted techniques are selected because of their prevalence in the literature. Furthermore, these techniques are accompanied by a good and comprehensive set of methods and implementations that enables the conducting of various forms of classification analysis. This chapter is organized as follows. Section 6.1 provides research questions to be addressed in this chapter. Section 6.2 describes and explains the analysis methods used. Section 6.3 explains the conducted experiments and their obtained results. Section 6.4 provides a conclusion and answers to questions posed Research Questions In addition to the research questions addressed in the previous chapter (outlined in section 5.2), this chapter will address the following research question: Q1. Which authentication algorithm(s) among (SVM, LDC, NB, and KNN) performs better in keystroke dynamics? This is a critical question to address since the algorithm employed plays an important role in influencing of classifier error rates of the continuous authentication detector [5]. 49

61 6.2. Machine Learning Methods Five machine learning techniques are adapted for keystroke authentication in this chapter. The selected classification methods are Support Vector Machine (SVM), One Class Support Vector Machine (OC-SVM), Linear Discriminate Classifier (LDC), K- Nearest Neighbors (K-NN), and Naive Bayesian (NB). The following section subsections describe these methodologies and they were applied on the collected data Support Vector Machine (SVM) Support vector machines (SVMs) are an excellent technique for data classification. SVM is a discriminative classifier based on the concept of separating hyperplane. In other words, given a set of labeled of training data, SVM objective is to find an optimal separating hyperplane that separates the training data set by a maximal margin. Given two classes labeled training data, of data either to (+1) or (-1). the SVM outputs an optimal hyperplane that categorizes new observation Generally, SVM classification is a two-phase process. In the first phase, the training phase, SVM classifier builds two models using the training vectors. The algorithm projects two classes of training data into a higher dimensional space, then finds a linear separator between the two classes. Thus, all vectors located on one side of the hyperplane are labeled as 1, and all vectors located on the other side are labeled as -1. In the second phase, the testing phase, the testing vectors are mapped into the same high-dimensional space and it is predicted as to which category they belong to, according to which side of the hyperplane they fall. In addition to the traditional linear classification, SVMs can effectively execute a non-linear classification using what is called the kernel functions. The two most used kernel functions with SVM are the polynomial kernel and the radial basis. 50

62 While the polynomial kernel produces polynomial boundaries of degree d in the original space, the radial basis function (RBF) produces boundaries by putting weighted Gaussians upon key training instances.rbf kernel nonlinearly projects training data samples into a higher dimensional space and can therefore deal with the cases in which the relation between class labels and attributes is nonlinear. Thus, this kernel is distinct upon the linear kernel Linear Discriminate Classifier (LDC) Linear Discriminate classifier (LDC) is a simple and mathematically strong classification method that is based upon the concept of searching a linear combination of variables which can best separate two or more classes. LDC tries to find a linear combination of variables (predictors) that maximizes the difference between the means of the data while at the same time minimizing the variation within each group of the data. Therefore, LDC usually looks for a projection where samples from the same class are projected close to each other and, at the same time, the projected means are as separate as possible. Assume that we have C classes with the mean vector μ i of each class, we are then interested in finding a scalar by projecting the samples onto a line: where x is a sample vector, Wk is a weight vector that represents coefficients, and W0 is the bias, also called threshold, which determines the decision surface. In order to find the optimum W, first the within-class scatter matrix SW is computed by the following equation: 51

63 where which represents scatter matrix for every class, and is the mean class vector. Similarly, the between-class scatter matrix is computed by the following equation: where m is the overall mean, and Mi and Ni are the sample mean and sizes of the respective classes. Finally, linear discriminates can be expressed as: K-Nearest Neighbor K-nearest neighbor classifier is one of the simplest and most well-known classifiers in pattern recognition. The main principle behind it is to calculate the distance from a query point to all of its neighbors in the training set and to select the k closest ones. Specifically, in the training phase, the KNN classifier builds two models using the training vectors. In the testing phase, the distances between the query point and all its neighbors are calculated. A majority vote is then used to classify a new point to the most prevalent class between its K-nearest neighbors. In other words, KNN classifies new points based on a similarity measure (e.g., distance functions). In general, the distance can be any metric measure; however, the Euclidean distance is the most frequent choice. The Euclidean distance is given by the following equation: 52

64 Naïve Bayes The Naive Bayes classifier technique is a well-known machine learner based on Bayes s rule of conditional probability with a naive independence assumption between every pair of variables. Specifically, Naive Bayes classifier provides a method of calculating the posterior probability, P(c x), from P(c), P(x), and P (x c) as seen in the following equation: where: P(c x) is the posterior probability of target class given predictor (attribute). P(c) is the prior probability of class. P(x c) is the likelihood, which is the probability of predictor given class. P(x) is the prior probability of predictor One Class SVM (OC-SVM) One Class SVM (OC-SVM) is an extension of SVM that can train a classification model from only one class without the need for negative samples. Basically, this SVM variant was developed for the purpose of anomaly detection. It maps the data from one class into high-dimensional space and finds a linear separator between the mapped and the origin. In the training phase, the detector builds only a one-class model using the training samples. In the testing phase, the testing vector is mapped into the same high- 53

65 dimensional space. The anomaly score is then calculated as a sign-inverted distance between the testing vector and the linear separator R Packages Analyses of machine learning-based classification methods were conducted using R scripts. In particular, knn and lda functions from MASS R package (Venables and Ripley 2013) were used to carry out K-Nearest Neighbors (K-NN), and Linear Discriminate Classifier (LDC) analyses respectively. On the other hand, svm and naivebayes functions from e1071 R package were used to carry out Support Vector Machine (SVM) and Naive Bayesian (NB) analyses respectively Machine Learning Analyses And Results This section explains analyses and results that were obtained based on machine learning methods. Two types of analyses were conducted in this setting. In the first type, the accuracy of each feature item is measured as the percentage of positive and negative testing cases were correctly classified by the classifier. In the second type, the accuracy of each feature is measured as the percentage of false positive and false negative error rates. In the first type of analysis, the selected classification methods are support vector machine (SVM), one-class support vector machine (OC-SVM), k-nearest neighbor s classifier (K-NN), and Naive Bayesian (NB). In the second type of analysis, the selected classification methods are support vector machine, linear discriminate classifier (LDC), k-nearest neighbor s classifier (K-NN), and Naive Bayesian (NB) Feature s Items Performance Analysis: The aim of this analysis is to measures the accuracy of each feature item individually, so we can determine which items contribute more for the precision of the features. For this analysis, all feature items from the data set were compared using the four proposed methods, (SVM), (OC-SVM), (K-NN), and (NB). For each user, the first 25 54

66 typed timing repetitions of all feature items were selected to build the user file. The first 20 repetitions of each feature item (of the 25) for each user were chosen for the training phase, and the remaining 5 repetitions of each feature item were used for the testing phase. For each pair of users, classifiers were trained and tested to measure the accuracy of each feature item individually according to its ability to classify users. The accuracy of each feature item is measured as the percentage of positive and negative testing cases correctly classified by the classifier. In other words, accuracy of the features is measured as the percentage of patterns that were correctly mapped to their true users, and is evaluated by the following formula: where T is the number of sample cases correctly classified and N is the total number of sample cases. Tables 6.1 to 6.4 illustrates the accuracy values for all feature items in each experiment based on using (SVM), (OC-SVM), (K-NN), and (NB), respectively. We can observe that some items perform better than others. For instance, the accuracy value measured using (SVM) for B was 90% while the accuracy value measured for R was 73%. The high accuracy values for feature items suggest that these items might be good indicators for distinguishing users. Moreover, in Tables 6.1 to 6.4 it is observed that some feature items appear more than once in the topmost of all classifiers results. For instance: for F1 feature, B, H, T, I, Y items appear in the 10 topmost-appearing items of F1 feature item. For F2 feature, IN, OU, ED, EA, TH items appear in the 10 topmost-appearing items of F2 feature items. For F3 feature, IN, OU, ED, HI, EA items appear in the 10 topmost-appearing items of F3 feature items. For F4 feature, WILL, ALL, A, AND, HAVE items appear in the 10 topmost-appearing items of F4 feature items. These feature items that appear in the topmost suggest that these items are good indicators for distinguishing users. 55

67 The more interesting finding demonstrated in tables is that F4 items perform better than F2 and F3 items, which shows that the total time of typing the most frequently used English words might contain valuable timing information that could be utilized to differentiate users. Apart from the accuracy values of feature items, Tables 6.1 to 6.4 report the overall accuracy of each feature individually by calculating the average of its items accuracy values. We use these averages to compare the four different types of keystroke features. In tables 6.1 to 6.4 it is shown that we were able to obtain a better result using F1 compared to the other keystroke features (F2, F3, and F4). Three classifiers (SVM, OC- SVM, and KNN) out of the four used classifiers show that F1 is performed better than other features, followed by F4. For instance, the average accuracy value measured using SVM for F1 was 84%, followed by F4, with the accuracy value of 81%. Thus, we can conclude that hold times are a more effective timing feature than latency times. This result is consistent with existing work [20] Table 6.1: Item performance using SVM Item Accuracy Item Accuracy Item Accuracy Item Accuracy B 0.90 IN 0.85 IN 0.86 WILL 0.87 H 0.88 OU 0.80 HI 0.82 ALL 0.86 P 0.88 ED 0.79 ED 0.81 A 0.85 T 0.88 HI 0.79 OU 0.81 AND 0.84 Y 0.88 EA 0.78 EA 0.8 HAVE 0.84 I 0.86 TH 0.77 HE 0.8 THE 0.84 N 0.86 ER 0.76 TH 0.8 TO 0.84 U 0.86 RE 0.73 TO 0.8 IS 0.83 W 0.86 TO 0.73 EN 0.79 OF 0.83 G 0.85 AT 0.70 IS 0.79 BE 0.83 L 0.85 HE 0.70 ER 0.78 FOR 0.81 M 0.84 AN 0.68 HA 0.78 IT 0.81 O 0.82 EN 0.68 OR 0.76 SHE 0.81 S 0.82 IS 0.67 RE 0.76 THIS 0.80 C 0.81 IT 0.67 AN 0.75 HE 0.79 D 0.81 ND 0.67 IT 0.75 I 0.79 F 0.80 HA 0.66 AT 0.74 YOU

68 Table 6.1 Continued Item Accuracy Item Accuracy Item Accuracy Item Accuracy A 0.79 OR 0.66 ND 0.72 THEY 0.77 E 0.76 ON 0.64 ON 0.7 THAT 0.74 R 0.73 TE 0.62 TE 0.7 HIM 0.73 AVG Table 6.2: Item performance using OC-SVM Item Accuracy Item Accuracy Item Accuracy Item Accuracy Y 0.80 OU 0.73 ED 0.72 TO 0.79 L 0.79 EA 0.71 IN 0.71 HAVE 0.77 M 0.79 IN 0.70 EA 0.69 OF 0.77 T 0.79 HI 0.68 OU 0.69 THE 0.76 G 0.78 TH 0.68 HA 0.65 A 0.76 H 0.77 RE 0.67 OR 0.65 ALL 0.75 I 0.77 ER 0.65 TH 0.65 AND 0.74 N 0.77 ED 0.63 TO 0.65 YOU 0.74 O 0.77 AN 0.60 HE 0.64 WILL 0.73 U 0.76 HA 0.60 HI 0.64 FOR 0.72 W 0.75 ND 0.60 IS 0.64 IS 0.72 S 0.74 TO 0.60 IT 0.64 IT 0.72 B 0.73 IS 0.59 ND 0.64 BE 0.72 C 0.73 IT 0.59 TE 0.62 HE 0.72 F 0.73 AT 0.58 EN 0.61 THIS 0.72 D 0.72 EN 0.57 AN 0.60 I 0.72 A 0.71 TE 0.57 RE 0.59 SHE 0.71 R 0.71 HE 0.54 AT 0.58 THEY 0.69 P 0.70 ON 0.53 ER 0.58 HIM 0.69 E 0.69 OR 0.53 ON 0.54 THAT 0.64 AVG

69 Table 6.3: Item performance using KNN Item Accuracy Item Accuracy Item Accuracy Item Accuracy B 0.86 IN 0.86 IN 0.86 A 0.91 M 0.86 OU 0.85 IS 0.84 WILL 0.9 H 0.85 EA 0.81 TH 0.83 ALL 0.88 L 0.85 ED 0.81 ED 0.82 BE 0.87 O 0.85 TH 0.81 HE 0.82 TO 0.87 A 0.84 HI 0.8 HI 0.82 I 0.87 I 0.84 RE 0.8 OU 0.82 AND 0.86 S 0.84 ER 0.79 EN 0.81 OF 0.86 T 0.84 TO 0.74 TO 0.81 THE 0.86 Y 0.84 AT 0.73 EA 0.80 HAVE 0.84 C 0.83 AN 0.72 AN 0.78 IS 0.83 D 0.83 ND 0.71 ER 0.78 IT 0.83 F 0.83 EN 0.7 RE 0.78 FOR 0.82 G 0.83 HE 0.7 AT 0.77 YOU 0.82 N 0.83 IT 0.7 HA 0.77 HE 0.82 P 0.83 HA 0.69 IT 0.77 SHE 0.82 R 0.83 IS 0.69 OR 0.76 THIS 0.82 E 0.82 OR 0.69 ND 0.74 THEY 0.82 U 0.82 ON 0.65 TE 0.71 THAT 0.74 W 0.82 TE 0.65 ON 0.70 HIM 0.74 AVG Table 6.4: Item performance using NB Item Accuracy Item Accuracy Item Accuracy Item Accuracy F 0.71 IN 0.82 IN 0.83 BE 0.79 B 0.7 TH 0.74 ED 0.78 TO 0.79 C 0.7 ED 0.73 TH 0.78 WILL 0.79 P 0.7 ER 0.73 HE 0.77 ALL 0.79 H 0.69 HI 0.73 HI 0.77 HAVE 0.78 O 0.69 EA 0.7 TO 0.76 THE 0.78 T 0.69 HE 0.7 EA 0.75 OF 0.77 I 0.68 OU 0.7 OU 0.75 A 0.77 G 0.67 RE 0.68 HA 0.74 AND 0.76 R 0.67 TO 0.67 EN 0.73 IS 0.76 S 0.67 HA 0.65 ER 0.73 THIS 0.75 U 0.66 IS 0.65 IS 0.73 HIM 0.74 L 0.65 AN 0.63 AN 0.71 FOR 0.73 W 0.65 IT 0.63 IT 0.71 IT

70 Table 6.4 Continued Item Accuracy Item ntinued Accuracy Item Accuracy Item Accuracy Y 0.65 OR 0.63 OR 0.71 YOU 0.73 D 0.64 AT 0.62 AT 0.69 SHE 0.73 M 0.64 EN 0.61 RE 0.69 I 0.71 A 0.63 ND 0.61 ND 0.67 THAT 0.7 E 0.63 ON 0.59 ON 0.66 HE 0.7 N 0.61 TE 0.59 TE 0.65 THEY 0.66 AVG Features Performance Analyses: The aim of this analysis is to compare features in terms of relative effectiveness at distinguishing users. In this analysis, authentication system performance was evaluated by assigning each of the 28 users in the training data set the role of a genuine user. Then, each of the remaining users played the role of imposter. Classifiers were then trained and tested to measure their ability to classify genuine user and imposters. On each repetition, we keep track of the number of false positives and false negatives classified by the classifier. The false acceptance rate (FAR) and false rejection rate (FRR) were then computed as follow: We ran two analyses in this context. The aim of the first analysis was to compare features individually in order to find out which features are most effective in distinguishing users. The purpose of the second analysis is to evaluate and study the influence of various keystroke feature combinations on the performance accuracy of the keystroke dynamics authentication system. Specifically, we compared the performance of four features used individually against ten different feature combinations using the proposed classifiers. 59

71 For both analyses, all features from the data set were compared using the four proposed methods, (SVM), (LDC), (K-NN), and (NB). The first 25 typed timing repetitions of all feature items (20 items of each feature) were selected to create a user profile. Then, the first 20 repetitions of each feature item (of the 25) for each user were chosen for the training phase, and the remaining 5 repetitions of each feature item were used for the testing phase. We compared four different keystroke features (F1, F2, F3, and F4) based on the four classification methods. Tables 6.5 illustrate the FAR and FRR values for all features based on the use of (SVM), (LDC), (K-NN), and (NB), respectively. We observe that by using F4, we are able to obtain a better result compared to the other keystroke features (F1, F2, and F3). In particular, the results are ascertaining noticeably better with the use of the KNN classifier. The FAR and FRR values when F4 was used were 3% and 2%, followed by F3 with 5% and 4% of FAR and FRR, respectively. The low FAR and FRR values for F4 and F3 suggest that these two timing features might be good indicators for distinguishing users. Table 6.5: Performance Comparison of four different keystroke features on SVM, LDC, NB, and KNN. Feature TC-SVM LDC NB KNN FAR,FRR FAR,FRR FAR,FRR FAR,FRR F1 0.08, , , , 0.08 F2 0.14, , , , 0.10 F3 0.09, , , , 0.04 F4 0.14, , , , 0.02 Since KNN and SVM outperformed the other classification methods, we use only KNN and SVM in the second analysis that aims to evaluate and study the influence of various keystroke feature combinations on the performance accuracy of the keystroke dynamics authentication system. It can be seen from Figures 6.1 and 6.2 that for both methods, the error rates values of FAR and FRR were reduced by combining more than one feature. In particular, using KNN classifier, we were able to achieve 2% of FAR and 1% of FRR when combining the four features. More notably, while using KNN classifier, 60

72 the results show that the combination of F4 with any other three features produced a better result as compared to other combinations without F4. Furthermore, the worst results were obtained from combining F2 and F3 for both classifiers. It is possible that because the latency times are less stable than duration times, combining of the two latency times in combination might prove insufficient for distinguishing users FAR FRR Figure 6.1: Performance comparison of four keystroke features used independently against the ten different combinations performed on SVM FAR FRR Figure 6.2: Performance comparison of four keystroke features used independently against the ten different combinations performed on KNN Conclusion We return to the questions posed in Section 6.2 and intend to address them; questions 1-4 are taken from previous chapter section 5.2: 61

73 Q1. Which keystroke timing feature(s) among key duration, flight time latency, digraph time latency, and word total time duration timing performs better in keystroke dynamics? The results reported that when applying the four classification methods (SVM, LDC, NB, and KNN) on 28 users and 20 items of each feature, word total time duration (F4) obtains a better result among the four features. In particular, the results are noticeably better while using the KNN classifier. The FAR and FRR values when using F4 achieved were 3% and 2% followed by F3 with 5% and 4% of FAR and FRR, respectively. Q2. Does any combination of timing features improve the accuracy of authentication scheme? The experiment results show that, when using KNN and SVM classification methods using 28 users and 20 items of each feature, the combination of features enhances the performance accuracy. While the best FAR and FRR values obtained by F4 among the four features when they used individually were 3% and 2% respectively, the combinations of the four features achieved 2% of FAR and 1% of FRR when using KNN classifier. Q3. Which feature item contributes more to the accuracy of the model? Where feature item are defined as the exact instances of letters in each feature, e.g. a, Th, etc.? The results reported in section show that some items are performing better than others. In Tables it is observed that some feature item appear more than once in the topmost of each classifier results. However, to answer this question, the 5 feature item appear in the 10 topmost-appearing items obtained for each feature in all classifiers are selected. Based on this approach, we conclude that, for F1 feature, H, T, I, B, Y items are performed the best among the 20 items of F1 feature item. For F2 feature, IN, OU, ED, 62

74 HI, EA are performed the best among the 20 items of F2 feature item. For F3 feature, IN, OU, ED, HI, EA are performed the best among the 20 items of F3 feature item. For F4 feature, WILL, ALL, A, AND, HAVE items are performed the best among the 20 items of F4 feature item. Q4. How word total time duration feature performs comparing to key duration, flight time latency, and digraph time latency? The results reported in Section shows that F4 obtains a better result among the four features. In particular, the results are noticeably better using the KNN classifier. The FAR and FRR values achieved when using F4 were 3% and 2%, respectively. Q5. Which authentication algorithm(s) among (SVM, LDC, NB, and KNN) performs better in keystroke dynamics? For comparison of the classification algorithms mentioned in section 6.3.2, each of these learning algorithms was applied on the processed user patterns. The digraph latency (F3) was used to compare the classification algorithms performance since it is the feature most utilized by researchers in literature. By observing Table 6.5, we notice that KNN and SVM perform the best among the four proposed classifiers. However, using KNN classifier, we are able to obtain a better result compared to the other learning classifiers (SVM, LDC, and NB) for authentication error rates of FAR and FRR. The FAR and FRR values achieved were 5% and 4%, respectively. 63

75 CHAPTER 7 KEYSTROKE FEATURE SELECTION USING WRAPPER-BASED APPROACHES While the main focus in the last two chapters (chapters 5 and 6) was on comparison of features in terms of effectiveness at distinguishing users, this chapter focuses on reducing the dimensionality of features by identifying a small subset of all possible features that represents the most discriminating features, thereby decreasing the processing time. In particular, wrapper-based feature subset selection approach is used in this chapter to reduce the dimensionality by identifying a small subset of features that represent the most discriminating features in the keystroke dynamics environment. The main advantage of wrapper approaches is that it takes into account the interaction between feature search and model selection, which can yield better classification performance because the feature selection procedure is integrated by the classification algorithm. Experimental results have shown that the wrapper methods can obtain good performance, albeit with the disadvantage of high computational cost. Therefore, wrapper-based feature subset selection approach is used in this dissertation. This chapter is organized as follows. Section 7.1 provides an overview of feature selection methods. Section 7.2 describes and explains the analysis methods to be used. Section 7.3 explains obtained results Feature Selections Feature selection is a major problem that has been addressed in many areas including statistics, pattern recognition, and data mining. There are two main purposes for feature selection: reducing the dimensionality by identifying a small subset of all possible features that represent the most discriminating features, i.e. subset selection, or ranking the features according to their importance, i.e. feature ranking. 64

76 Ultimately, the main objective of feature selection is to choose a small (subset) of features from a large pool of features to reduce the cost execution time of a given system, as well as to improve its accuracy. There are three main categories or approaches of feature selection, as shown in Figure 7.1 (Guyon, Isabelle, and André Elisseeff 2003) [32]. 1. Filter-based feature selection: this category ranks the features according to their importance and eliminates all features that do not achieve a sufficient score. 2. Wrapper-based feature selection: in this category, the process of the feature selection search is integrated with classification algorithm. Thus, the importance of features depends on the classifier model used. 3. Embedded-based feature selection: this category performs feature selection as part of the training process. Figure 7.1: Three main categories of feature selection Filter Approach In Filter approach, the feature subset selection is performed independently of the classification algorithm. A computed weight score is assigned to a feature, and then those features with better weight scores are selected (Khushaba et al. 2008). The selected features are then presented as input to the classifier algorithm. In this approach, the process of feature selection is performed only one time, and then the selected features can be evaluated using different classifiers. This selection approach may be regarded as a preprocessing step. Thus, this approach requires no feedback from classifiers. 65

77 The main advantage of the filter methods is that they are computationally simple and fast, because only a small number of features are used in classification. However, the main limitation of this approach is that it ignores the interaction between feature subset search algorithm and classification algorithm. This means that all features are considered separately from the target classifier, which may lead to the worst classification performance Wrapper Approach In Wrapper approach, the feature selection search is integrated with the classification algorithm. Thus, the importance of features depends on the classifier model used (Kohavi and John 1997). In this setup, wrapper-based feature selection has two main steps: first, it searches the space of possible features and generates a set of subset features by adding and removing features. Several search methods can be used to guide the search for an optimal subset. The most popular model selection algorithm is a greedy algorithm that adds the best features one at a time using approaches such as forward selection and backward elimination. In the second step, the selected features are assigned to a classifier to evaluate the resulting feature set. The classifier itself is used to determine the characteristic of the generated subset features. The main advantage of wrapper approaches is that it takes into account the interaction between feature search and model selection, which can yield better classification performance than filter methods because the feature selection procedure is integrated by the classification algorithm. Experimental results have shown that the wrapper methods can obtain better performance, albeit with the disadvantage of high computational cost. Therefore, wrapper-based feature subset selection approach is used in this dissertation. 66

78 7.2. Wrapper-based Feature Subset Selection Techniques Different feature selection techniques such as Genetic Algorithm and Greedy Algorithm are used to search for the best subset features. These selection techniques are integrated with different machine learning classifiers such Support Vector Machine (SVM) for a feature subset selection procedure that automatically selects the most appropriate subset of features and remove the rest, resulting in a more efficient model and a decrease in the training time and testing time required. Figure 7.2 shows the wrapper-based approach that is used in this dissertation. Genetic Algorithm, Greedy Algorithms, Best-first search Algorithm, and Particle Swarm Optimization is integrated (Wrapped) with different machine learning classifiers, namely Support Vector Machine (SVM), Naive Bayesian, and K-Nearest Neighbors for feature subset selection procedure. As explained in the previous sub-section, wrapper-based feature selection has two main steps: in the first, it uses a search method such as genetic algorithm to search the space of possible features and generates a set of subset features by adding and removing features. In the second step, the selected features are assigned to a classifier such as SVM to evaluate the resulting feature set. 67

79 Figure 7.2: Feature subset selection using Wrapper approach. The next subsections discuss the proposed wrapper-based feature subset selection algorithms used in this dissertation Greedy Selection Algorithm The Greedy search strategies are the simplest among all the wrapper selection algorithms. The search strategies under this category can be divided into two different directions: Forward Selection (FS) and Backward Selection (BS). Forward Selection starts with an empty set of features and follows greedy approach so that in each iteration it can add one feature until a predefined number of features are selected, or until no possible single feature addition would lead to a higher evaluation of the criterion function, usually classification accuracy. On the other hand, Backward Selection starts with the full feature set and deletes one feature in each iteration until a predefined number of features are left or until no possible single feature deletion would lead to a higher evaluation of the criterion function. The flow chart of the proposed Forward and Backward Selection wrapper approach for feature subset selection is given in figure

80 (a) Forward Selection (b) Backward Selection Figure 7.3: Forward and Backward Selection Wrapper Approach Algorithm. The general algorithms of Forward and Backward Selection are given in algorithms 1 and 2, respectively. In the proposed wrapper approach, duration, flight time latency, digraph time latency, and the proposed word total time duration are given as input features. Algorithm 1: Forward Selection Algorithm. Input: Letter duration, Flight Time Latency, Digraph Time Latency, Word Total Time Duration Output: The best Subset feature values. Step 1: Start with empty Set. Step 2: Compute weights of all features using some criterion function. Step 3: Select the best single feature and add it to generated pool of candidate feature subset. Step 4: Repeat Steps 2 to 3 until a predefined number of features are selected, or until no possible single feature addition would lead to a higher evaluation of the criterion function. 69

81 Algorithm 2: Backward Selection Algorithm. Input: Letter duration, Flight Time Latency, Digraph Time Latency, Word Total Time Duration. Output: The best Subset feature values. Step 1: Start with full Set Step 2: Compute weights of all features using some criterion function. Step 3: Delete the worst feature from the generated pool of candidate feature subset. Step 4: Repeat Steps 2 to 3 until a predefined number of features are left or until no possible single feature deletion would lead to a higher evaluation of the criterion function Best-First Search Algorithm Best-first search is a general heuristic graph-based search technique that allows backtracking along the search path. In best-first, search space of the given problem can be seen as a series of nodes connected by paths, and the algorithm goal explores a graph to find the most promising nodes. Best-first search uses evaluation function to assign a score value to each candidate node and to decide which nodes to visit next. The algorithm maintains two lists, a list containing all visited nodes so far (CLOSED), and a list containing all unvisited candidates nodes (OPEN). Best-first search may search forward by starting with the empty set of features, or search backward by starting with the full set of features. Additionally, best-first search offers the ability to search in both directions by starting at any point. The flow chart of the proposed best-first search selection wrapper approach for feature subset selection is given in Figure

82 Figure 7.4: Best-First Search Wrapper Approach Algorithm. The general algorithm of best-first search selection is given in algorithm 3. In the proposed wrapper approach, duration, flight time latency, digraph time latency, and the proposed word total time duration are given as input features. Algorithm 3: Best-First Selection Algorithm Input: Letter duration, Flight Time Latency, Digraph Time Latency, Word total Time Duration. Output: The best Subset feature values. Step 1: Define OPEN list consisting of a start single node (s), and CLOSED empty list. Step 2: Remove n from OPEN and add it to CLOSED, n to be the best node in OPEN list with highest evaluation. Step 3: Expand node n. Step 4: If any child to n is the goal node, return success and the path to n. Step 5: For each child node a) Apply the evaluation function to the node. b) If the node has not been in either CLOSED or in OPEN list, add it to OPEN. Step 6: go back to step 2. The algorithm terminates when a goal node is chosen for expansion, or there are no more nodes remaining in the OPEN list. 71

83 Genetic Algorithm Genetic Algorithm (GA) is a general probabilistic optimization search technique based on the ideas of natural selection and genetics. GA consists of three main operations: selection, crossover and mutation. It works on populations of individuals rather than single solutions. Thus, the search is performed in an analogous way. In order to find best solution to a given optimal problem, at first GA randomly produces a set of individuals called population, where each individual represents a potential solution of the given problem. In each generation (iteration), GA calculates the fitness values of all individuals in the population and several individuals according to their fitness values, and then generates a new pool of candidate features with better characteristics. The selected individuals are chosen to be parents to the next generation, which involves two genetic operators, namely, crossover and mutation. A new individual called Offspring is produced by combining two individuals or parents, which is done by cross over operator. Mutation operator randomly alters some genes to individual parents and results in creating new gene values. With these newly generated gene values, genetic algorithm may gain the ability to reach a better solution. The flow chart of the proposed Genetic search selection wrapper approach for feature subset selection is given in figure 7.5. Figure 7.5: Genetic Search Wrapper Approach Algorithm. 72

84 The general algorithm of Genetic search selection is stated in algorithm 4. In the proposed wrapper approach, duration, flight time latency, digraph time latency, and the proposed word total time duration are given as input features. Algorithm 4: Genetic Algorithm Selection Input: Duration, Flight Time Latency, Digraph Time Latency, Word Total Time Duration Output: Subset feature values. Step 1: Initialize Number of generations, Initial population, Crossover rate and Mutation rate. Step 2: Evaluate the features by fitness function. Step 3: Selected feature subset. Step 4: Perform crossover and mutation operations Step 5: Generate new pool of candidate feature subset. Step 6: Repeat Steps 2 to 5 until the best solution or maximum iteration is reached. The best solutions are achieved in the end Particle Swarm Optimization Particle Swarm Optimization (PSO) is a simple and computationally efficient optimization technique that is used to search the space problem to find the optimal parameters or features to maximize a particular objective. This technique was inspired by the social behavior of bird gathering or fish schooling. At first, PSO algorithm randomly generates candidate solutions (called particles) within the search space. The position of each particle represents a candidate solution to the given optimization problem. All the particles move in the search space to find better positions. Each particle keeps track of two values: pbest value, that is, its best (highest fitness) position in the problem space, and gbest value, that is, the overall best value obtained by all the particles in the population. On each iteration, each particle adjusts its movement velocity based on its best individual position as well as the entire population best position. 73

85 Recently, PSO has been successfully used in various research areas and it has been demonstrated that PSO can obtain better results with a faster and cheaper computation time compared to other methods. The flow chart of the proposed Particle Swarm Optimization search selection wrapper approach for feature subset selection is given in Figure 7.6. Figure 7.6: Particle Swarm Optimization Wrapper Approach Algorithm. The general algorithm of PSO search selection is given in algorithm 5. In the proposed wrapper approach, duration, flight time latency, digraph time latency, and the proposed word total time duration are given as input features. Algorithm 5: PSO Algorithm. Input: Letter duration, Flight Time Latency, Digraph Time Latency, Word total Time Duration. Output: The best Subset feature values. Step 1: Initialize particles, Step 2: Calculate fitness for each particle using fitness function. Step 3: Step three consists of three sub-steps: 74

86 1. Check if the current fitness value >=pbest; then, pbest= current fitness value, or pbest= pbest. 2. Check If pbest >=gbest; then, gbest= pbest, or gbest=gbest. 3. Update particles velocities and positions Step 4: Repeat steps 2 to 3 until gbest optimum value is reached Weka Analyses Tools Wrapper-based feature selection evaluation was conducted by the Weka machine learning software package (Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009) [61]. All algorithms were run using the default Weka parameters for each algorithm Results Genetic Algorithm,Greedy Algorithm, Best-first search Algorithm, and Particle Swarm Optimization are integrated (Wrapped) with different machine learning classifiers; namely, Support Vector Machine (SVM), Naive Bayesian, and K-Nearest Neighbors for feature subset selection procedure. Table 7.1 illustrates the number of selected features after the execution of the proposed methods. From the table, it is clear that the proposed Best Forward Selection (BFS) with K- Nearest Neighbors (KNN) reduces 62.25% of the total features (80 features) while improving the accuracy from 90.47% to 92.85% when using only 31 features. Therefore, using Best Forward Selection (BFS) with K-Nearest Neighbors would allow us to achieve quickest detection while maintaining good detection accuracy with the use of a lesser number of appropriate features. 75

87 Table 7.1: Comparison of features selected. Method No. of selected Accuracy % Feature Reduction % features SVM NB KNN SVM NB KNN SVM NB KNN All features GFS GBS BFS BBS GA PSO

88 CHAPTER 8 CONTINUOUS USER AUTHENTICATION BASED ON SEQUENTIAL CHANGE-POINT ANALYSIS Detecting the malicious activity caused by another person (impostor) who takes over an active session of an authorized user often requires predefined typing models for legitimate users and impostors. However, in most cases, it is difficult or even impossible to build prior typing models of the users (legitimate or impostors). This chapter introduces a new anomaly-detection approach based on sequential change-point methods for early detection of attacks in computer authentication. The proposed detection algorithms have several advantages. First, these techniques can be implemented online, and hence, enable the building of continuous user authentication systems without the need for any user model in advance. Second, they minimize the average delay of attack detection while maintaining an acceptable detection accuracy rate. This chapter is organized as follows. Section 8.1 provides motivation of the chapter. Section 8.2 provides research questions to be addressed in this chapter. Section 8.3 provides an overview of intrusion detection. Section 8.4 explains change-point detection methods to be used. Section 8.5 explains conducted experiments and their obtained results. Section 8.6 provides a conclusion Motivations Existing schemes of keystroke dynamics require the building of predefined typing models for either legitimate users or impostors, or even both. However, it is difficult or even impossible in some situations to have access or collect typing data of users (legitimate or impostors) in advance. For instance, consider a personal computer that a user carries to his college or to a cafe. In this case, only the computer owner (legitimate user) is known in advance. For another example, consider a computer at a public library that allows guest users to log on. In this case, none of the system users are known in 77

89 advance. To address these types of cases, a new automated and flexible technique that has the ability to authenticate users without the need for any prior user typing model should be built. The main challenge in such scenarios is to build a profile of valid users while simultaneously trying to determine whether or not the active session has been taken over by another person Research Questions The main research questions that will be addressed in this chapter are: Q1. Can we detect the impostor who takes over from the legitimate user during the computer session when NO predefined typing models are available in advance either for legitimate users or impostors? Q2. Is it possible to detect an imposter in the early stages of an attack? If so, what amount of typing data is needed for a system in order to be able successfully detect the imposter? 8.3. Intrusion Detection A continuous authentication system can be considered as a kind of intrusiondetection application that monitors a series of activities with the aim of detecting any malicious activities or abrupt changes in the system behavior. Existing intrusion detection systems fall into two categories: i) Signature-Detection Systems, and ii) Anomaly-Detection Systems. In Signature- Detection systems, the system attempts to detect any attacks by comparing the observed patterns (patterns under the test) with known large databases of attack signatures. Therefore, such a system only looks for specific kinds of attacks that have already been identified and listed in the database. In anomaly detection, the system defines a normal baseline of a series of activities and then tries to compare the parameters of new observed activities with the learned normal activities. The attack is announced once a certain deviation from the normal activities is observed. 78

90 The problem of detecting an impostor attack that is taking over an active session of a valid user, when it is impossible to gain prior knowledge of impostor activities, can be formulated and solved using anomaly-detection approach. In particular, the approach adopted in this chapter is based on change-point detection theory that provides the ability to detect a change in the model distribution in the early stages of an attack by specifying a particular point where the regular behavior starts to deviate. The idea behind this approach is based on the observation that taking over the active session of a valid user by another person (impostor) would lead to relatively abrupt changes in statistical parameters. These changes occur at unknown points in time and should be detected as soon as possible Change-Point Detection The main idea behind change-point detection is that the structure of time-series data can be formulated or described by a stochastic model, and that modification or transformation of the data leads to abrupt changes of the data structure. Modifications in data can occur due to a number of causes such as a change in means, variances or correlation function. Generally, such changes in distributions occur at unknown points in time, and then the objective of change-point detection is to detect these changes as soon as possible. Traditionally, in change-point problems, one assumes that the sequence of time-series data that contain n observations is independent and identically distributed (iid) corresponding to some distribution. If the change occurs at some point of time, then the observations are independent and identically distributed, but to some another distribution, where, and can be written as: Figure 8.1 illustrates the general procedure of change-point detection as outlined in (Hawkins et al. 2003). Generally, change-point detection algorithms identify where there is a change in the data sequence. 79

91 Figure 8.1: An illustration of change-point detection. Change-point detection approach is typically characterized by three metrics to measure its performance, namely: false alarm rate, which occurs when the alarm is flagged before the change occurred (Type I error); misdetection rate, which occurs when a system cannot detect a change (Type II error); and detection delay, which refers to the time period from the start of the change until the alarm is flagged. Detection delay can also be measured by the amount of data a system needed until signaling a change. Generally, there are two main methods to detect changes of events: batch detection and sequential detection (Basseville and Nikiforov 1993). In the former setting, there is a fixed length of sequence of time-series observations, and it is necessary to determine whether or not the observations are statistically homogeneous, and in the case of a sequence that is not homogeneous, to determine a particular point in time when the change occurred. This type of change-point detection is also called retrospective because it makes the decision as to whether the change occurred or not based on all observations. In the latter setting, the sequence of time-series of observations does not have a fixed length. The observations are arrived at and tested sequentially over time. Whenever a new observation is added, the test is made to determine if the change happened by using only the observations arrived at thus far. If the test decides no change has occurred, then the next observation is added to the observations set and tested. 80

92 Traditionally, change-detection methods are based on the change-detection problem. For batch-detection problems, the most generally used methods are likelihood ratio testing (Hinkley and Hinkley 1970) and Bayesian inference (Stephens 1994). On the other hand, control charts methods such as the Cumulative Sum (CUSUM) (Page 1954), Exponential Weighted Moving Averages (EWMA) (Roberts 1959), or sequential Bayesian methods (Fearnhead and Liu 2007) are used to detect changes in sequential change-point problems. Sequential change-detection problems vary based on the assumption used about prechange and post-change distributions. The traditional control charts methodologies, such as Cumulative Sum (CUSUM), and Exponential Weighted Moving Averages (EWMA), usually assume that data follows some parametric distribution, mostly normal distribution, and require a complete knowledge regarding the pre-change and post-change distribution parameters (mean, variance or both). However, in many feasible cases, these parameters are unknown, and in many situations the information about data distribution may not even be available. Recently, advanced works in the field of change-point model framework have been developed to detect changes in cases where the available information about pre-change and post-change distributions is very restricted. The developed methodologies employ several statistical tests from parametric and nonparametric techniques to enable the detection of changes in diversity of sequences (Hawkins et al.2010, Hawkins et al. 2003, Hawkins and. Zamba 2005, Ross and Adams 2012, Ross et al 2011). The initial work was introduced by Hawkins et al. (2003) and concentrated on detecting changes in mean shift or variance of normal random distribution variables. Hawkins et al. (2003) assumed that no prior information is available regarding distributions parameters. Hawkins et al. (2003) developed approach based on the logic of two-sample hypothesis testing, where the null hypothesis assumes that all observations in the sequence come from the same distribution and no change point occurs, against the 81

93 alternative hypothesis, which assumes that there is a change point in the sequence of observations that split them into two sections. This work has been expanded by other researchers to detect more complicated changes, including those assuming that the ongoing data observations distribution is unknown (Ross and Adams 2012, Hawkins and Deng 2010, Zou 2010). The change-point model techniques developed in Hawkins et al. (2003) can be used in both batch detection and sequential detection settings. In the next two subsections, we will explain in detail how the developed techniques work. First, we provide a review of the batch-detection scenario, and then we explain sequential extension Batch Change-Point Detection In the batch change-point setting, there is a fixed length of sequence of time-series observations, and it is necessary to decide if the observations involve any change point or not. If change exists, the sequence observations are independent and identically distributed corresponding to some distribution function. If the change occurs at some point of time T, then the observations are independent and identically distributed, but to some another distribution function, where. In this case, the change point will be determined instantly after some specific observation. In particular, this approach to the batch-change detection problem follows the form: where k represents an observation that a change point occurs at an unknown point of time. This is a classical problem that can be solved by using a two-sample hypothesis test, where the option of statistical test relies on what is assumed to be known about the observations distribution. For instance, if the data observations are assumed to follow a 82

94 normal distributional, then a Student-t test would be suitable to detect changes in mean, and an F test to detect a change in scale. For some other cases when no assumption is made about the data observations distributional, then a nonparametric test such as Man- Whitney test can be used to detect changes in location shifts, and the Mood test to detect changes in scale shifts. Once the two-sample statistical test is chosen, its statistic computed value can then be compared to some appropriate threshold, if the computed value is greater than the threshold, then the null hypothesis that assumes the two-samples come from the same distribution is rejected, and we say that the change has occurred. Since, an abrupt change-point in statistical model, is not known in advance, a twosample statistical test is estimated for every, and the maximum value is used. In particular, every potential way of dividing a sequence of observations into two non-overlapping sets is considered, and a two-sample statistical test is applied at the splitting point to test for a change. The statistical test is then: where is statistics means values, is statistics standard deviations. then has standardized mean of 0 and variance of 1. If, the null hypothesis that assumes no change has occurred then is rejected, where is some suitable selected threshold that limits the Type 1 error rate and is constrained by the given value α. At last, the maximum likelihood estimator is used to estimate the best location of the change point that immediately follows the observation that maximized, i.e.: 83

95 Sequential Change-Point Detection The traditional two-sample hypothesis testing approach used in the batch- detection scenario can be extended to sequential change-detection scenario where new observations are arrived at and tested sequentially over time (Ross 2013). The main idea behind this approach is that whenever a new observation is arrived at, it will be added to the observations sequence and the sequence treated as being a fixed length sample, then the change-point statistic is computed using the above batch methodology. To be specific, suppose that indicates the number of observations that has been arrived at so far, where and is increasing over time. Whenever a new observation is arrived, it will be added to the sequence and treated as being a fixed size sample, and computed using the methodology from the previous section. If for some appropriate selected threshold, the change is detected. If no change was detected, the subsequent observation is added to the sequence, and then is computed and compared to, and so on. While it is possible to determine the sequence of that limits the Type 1 error rate value with a fixed length sample, such a sequence is not appropriate for sequential change detection. In this setting, Type 1 error rate is linked with Average Run Length (ARL), that is, a number of observations are checked before the first signal of change is indicated, which is equal to 1/α.. Thus, is selected so that the probability of Type 1 error is fixed over time, and under the null hypothesis that assumes no change-point occurs: 84

96 However, the conditional distribution in the previous equation is computationally intolerable, and Monte Carlo simulation method is used to compute the sequence of ht values that correspond to the selected α. The choice of Average Run Length (ARL) impacts the detection delay and false alarm rate. Low value of ARL will result in quicker change detection at the cost of obtaining higher false alarms (Ross et al. 2011). Therefore, selecting an appropriate value of ARL is based on the security level of the considered application. While the major work of this approach concentrated on detecting changes in the shifts in mean or variance of normal random distribution variables, other researchers expanded it to detect more complicated changes including nonparametric change detection of nonnormal distribution, when no prior information is available regarding even the sequence distribution. Hawkins et al. (2003) used Man Whitney, and Rose et al. (2011) used Mood statistic tests to detect changes in location and scale parameters respectively, when no assumptions about the sequence distribution are made. Rose and Adams (2012) used Lepage, and Cramer-von-Mises statistics tests to detect more general changes where no assumptions about the sequence distribution are made CMP R Package A recently published R package called cpm (Ross 2013) implements most of this research works to detect changes in the distribution of a sequence of time-series observations for both batch detection and sequential detection settings. In this dissertation, we employ cpm package to detect changes in sequences of time-series keystroke dynamic observations. In particular, we use detectchangepoint function from CPM package to detect changes in a stream of observations that were arrived at and tested sequentially over time. 85

97 8.5. Analyses Several analyses have been conducted to determine the effectiveness of the sequential change- detection methods in continuous keystroke-based systems with the aim of detecting impostors automatically. Additionally, the analyses results report how quickly the impostor is detected. This section represents data set used, experiment method, and the results obtained by the analyses Data Set In the previous chapters, the experiments focused on evaluating and studying the influence of various keystroke features on the performance accuracy of the keystroke dynamics system; therefore, typed timing repetitions of all feature items were used to build user profiles in order to find which features perform the best among the four features (key duration, flight time latency, digraph time latency, and word total time duration). However, this chapter focuses only on the extracted feature item of key press duration (F1) for the most 11 frequently appearing English alphabet letters (e, a, r, i, o, t, n, s, h, d, l) (Gaines 1956)]. Sequences of time-series data for each user of 28 users are created based on those 11 letters, which are supposed to represent about 77% of any written text [8] Analyses Methods To evaluate the effectiveness of the proposed sequential change-detection method on continuous user authentication system that is keystroke-based, we use four performance matrices: False Acceptance Rate (FAR): it refers to the percentage of times a test does not signal a change-point in the sequence when it is there. This can also be called misdetection rate. False Rejection Rate (FRR): it refers to the percentage of times a test signals a change point in the sequence when it is not there. This is also can be called false alarm rate. 86

98 Equal Error Rate (EER): it can be defined as the point where FAR value is equal to FRR value, which can be estimated from Receiver Operating Characteristics (ROC) which plot FAR values against FRR values under different threshold values. Different thresholds are linked with choosing different values of Average Run Length that is equal 1/α. Average Detection Delay (ADD): it refers to the average amount of data a test needed to signal a change point in the sequence when it is there. Authentication system performance was evaluated by assigning each of the 28 users in the data set the role of a genuine user. Then, each of the remaining users plays the role of the imposter. On each repetition, the first 50 typed timing repetitions of the underlying letter, for both genuine user and impostor, were selected to create the sequence of observations data. Thus, a sequence of observations consists of 100 timing repetitions: the first 50 timing repetitions represent a genuine user, and the last 50 timing repetitions represent an imposter. The detectchangepoint function from CPM package then is used to detect a single point of change in the sequence of observations, and to estimate the best location of the observation at which the change was detected; in our case, occurring after 50 observations. The detectchangepoint function processes the entire sequence. If the change was detected, then the observation at which the change was detected is specified. If no change was detected, then the function returns 0. Therefore, authentication system performance was evaluated as follows: If the test signals a change point before 50 observations, then we say that an authorized user was rejected from the system. If a test does not signal a change point in the sequence, then we say that a test was not able to detect a change, and an imposter was accepted by the system. 87

99 If the test signals a change point after 50 observations, we record the location where it is detected and we calculate the detection delay in terms of the amount of data needed by the system to signal a change point in the sequence. Figure 8.2 illustrates the change-point detection approach used in this dissertation. The first 50 typed timing repetitions belong to the genuine user, and the last 50 belong to the impostor. In this case, the attack begins at observation 51. Figure 8.2: An illustration of sequential change point detection. The detection delay is measured in the amount of data a test needed to signal a change point in the sequence. This procedure was repeated for all 28 users, and thus we have 756 unique sequences of data for each letter of the 11 letters that needed to be tested to detect the change. For an ideal case, both of error rates (FAR, FRR) should be equal to 0%. However, there is a trade-off between these two metrics; i.e. increasing one might not be possible without decreasing the other. Thus, choosing which threshold should be used is a very important issue in biometric authentication systems that may depend heavily on the security level of the application. For instance, in the domains where a high security application is required, FAR needs to be as low as possible to detect as many impostors as possible. Therefore, to compare the performance of our detection system using different thresholds, various values of ARL are used. In particular, we test the performance of our 88

100 detection system for choices of ARL values ranging from 100 to 1000 with increments of 100 for each round. In each round, we calculate False Acceptance Rate (FAR), False Rejection Rate (FRR), and Average Detection Delay (ADD) measured in the amount of data a test needed to detect a change point. Calculation of False Acceptance Rate (FAR) and False Rejection Rate (FRR) using different thresholds will allow us to estimate Equal Error Rate (EER) values to compare a letter s performance. For instance, Figure 8.3 shows the performance of detection system using different values of ARL of letter E. The point where FAR value crosses FRR value is used to estimate Equal Error Rate (EER) and Average Detection Delay (ADD) from Table 8.1 that reports the corresponding values of ARL, FAR, FRR and ADD. In this case, as shown in Figure 9.3, the estimated value of ERR is equal to 16%, and looking at Table 8.1 we can see that this value falls within 13% of FAR and 21% of FRR, respectively. Therefore, we can see that the best FAR, FRR and ADD can be obtained when ARL=200, and in this case, the estimated ERR is equal to 16% FAR FRR Figure 8.3: Performance of detection system using different ARL of letter E 89

101 Table 8.1: Performance of detection system using different ARL of letter E. ARL FAR FRR ADD In this setting, we ran several different experiments using different statistical test techniques, both parametric and nonparametric, to detect different kinds of changes; in particular, Mann-Whitney and Mood statistic tests to detect changes in location and scale parameters, respectively, and Lepage, and Cramer-von-Mises statistics to detect more general changes Analyses Results This section reports the results obtained by the conducted experiments. We ran three experiments. The purpose of the first experiment was to determine which letter performs better among all eleven letters. The aim of the second experiment was to investigate how the number of observations that are available from the pre-change distribution will impact the performance. The aim of the third experiment was to compare different statistical tests techniques performance. 1. Letter performance This section reports the accuracy measures of each letter to determine which letter performs better among all eleven letters. Table 8.2 illustrates the performance values for performance of the 11 most frequent letter performances using sequential change-point detection along with Lepage statistical tests. For each letter, the table shows the estimated EER and its FAR and FRR interval that ERR falls within, and its estimated Average 90

102 Detection Delay (ADD) measured by the average number of letters a test needed to signal a change point in the sequence when it was there. We can observe that some letters perform better than others. For instance, the estimated EER of letter D was 12% while the estimated EER of letter N was 22%. The low estimated EER of letters suggests that these letters might be good indicators for distinguishing users among the eleven In Table 8.2 it is observed that letter D performs the best among all eleven letters with 12% estimated EER, and estimated Average Detection Delay equal to 13 letters. Letter A comes in second place with 13% estimated EER, and estimated Average Detection Delay equal to 14 letters. Table 8.2: Letter performance using sequential change-point detection when employing Lepage statistical tests, the change occurs after 50 observations..letter Estimated EER Interval [FAR, FRR] Estimated ADD E 0.16 [0.13, 0.21] A 0.13 [0.17, 0.07] R 0.15 [0.10, 0.21] I 0.15 [0.16, 0.17] O 0.18 [0.21, 0.14] T 0.15 [0.13, 0.17] N 0.22 [0.21, 0.25] S 0.15 [0.14, 0.17] H 0.15 [0.17, 0.14] D 0.12 [0.13, 0.10] L 0.25 [0.26, 0.25]

103 2. Pre-Change Number of Observations Impact In the previous experiment, authentication system performance was evaluated by selecting the first 50 typed timing repetitions of the underlying letter, for both genuine user and impostor, to create the sequence of observations. In this section, we report the results when increasing the number of observations from 50 to 100 to investigate how the number of observations that are available from the pre-change distribution will impact the performance. Table 8.3 illustrates the performance values for the most frequent 11 letters using sequential change-point detection with Lepage statistical tests while increasing the number of observations from 50 to 100. We can observe that changes that occur after the 100 th observation are easier to detect than those occurring after the 50 th. Thus, we can conclude that the changes that occurred later are easier to detect than those which occurred earlier. Table 8.3: Letter performance using sequential change-point detection when employing Lepage statistical tests, the change occurring after 100 observations. Letter Estimated EER Interval [FAR, FRR] Estimated ADD E 0.14 [0.10, 0.17] A 0.13 [0.12, 0.14] R 0.1 [0.12, 0.10] I 0.11 [0.11, 0.14] O 0.12 [0.11, 0.14] T 0.1 [0.11, 0.10] N 0.17 [0.18, 0.17] S 0.1 [0.11, 0.10] H 0.14 [0.13, 0.14] D 0.09 [0.10, 0.07] L 0.18 [0.19, 0.17]

104 3. Techniques Comparison We ran several different experiments using different statistical test techniques, both parametric and nonparametric, to detect different kinds of changes in particular, a Student-t test to detect changes in mean Mann-Whitney, Mood statistic tests to detect changes in location and scale parameters, respectively, and Lepage, and Cramer-Mises statistics to detect more general changes. Table 8.4 illustrates the performance values using these techniques. The table shows the results when the changes occur after the 50 th and 100 th observations. We can observe that some techniques perform better than others. In Table 8.4 it is shown that Lepage statistical tests perform the best among all five techniques with an average 16% of estimated EER when the changes occur after the 50 th and with an average 12% of estimated EER when the changes occur after the 100 th observations. Table 8.4: : Letter performance using sequential change-point detection when employing different statistical test techniques. Letter Student Mann Whitney Cramer- Mises Lepage Mood E A R T=50 I O T N S H D L Average E

105 Table 8.4 Continued Letter Student Mann Whitney Cramer- Mises Lepage Mood A R T=100 I O T N S H D L Average Conclusion This chapter presented in detail a new anomaly detection approach based on sequential change-point theory designed to detect the impostor who takes over from the legitimate user during the computer session when NO predefined typing models are available in advance, either for legitimate users or impostors. The new approach provides the ability to detect the impostor in early stages of attacks. Letter D performs the best among all eleven letters with 12% of estimated EER, and estimated Average Detection Delay equal to 13 letters when Lepage statistical tests are employed. Further, the chapter shows that the changes that occurred later are easier to detect than those which occurred earlier. Finally, the Lepage statistical test performs the best among all techniques. 94

106 CHAPTER 9 CONCLUSION AND FUTURE WORK A major challenge in keystrokes analysis lies in identifying the major factors that influence performance of the keystroke authentication detector. Two of the most Influential factors that may impact the performance of the keystroke authentication detector include the classifier employed and the choice of features. This dissertation focuses on improving continuous user-authentication systems (based on keystroke dynamics) designed to detect the malicious activity caused by another person (impostor) who takes over the active session of a valid user by 1) studying the impact of the selected features on the performance of the keystroke continuous authentication system; 2) proposing new timing features that can be useful in distinguishing between users in continuous authentication systems; 3) comparing the performance of the keystroke continuous authentication system when applying different algorithms; 4) investigating the possibility of improving the accuracy of continuous user authentication systems by combining more than one feature; 5) proposing a new detector that does not require predefined typing models either for legitimate users or impostors. The new keystroke timing features based on the utilization of the most frequently used English words can be useful in distinguishing between users in continuous authentication systems based on keystroke dynamics. The results of empirical studies show that word total time duration performs at about the average of the other features (hold key, digraph time, flight time) with the use of the Mann-Whitney statistics test, and obtains the better result among the four features when applying the four classification methods (SVM, LDC, NB, and KNN). In particular, word total time duration provides better results with the use of the KNN classifier. This dissertation also investigates the possibility of improving the performance of the keystroke continuous authentication system by performing various combinations of the extracted keystroke features. In particular, this dissertation compares the independent performance of four keystroke features, including) key duration, ii) flight time latency, 95

107 iii) digraph time latency, and iv) word total time duration against ten different combinations of these keystroke features. The results reported are various based on the employed method. Use of the Mann- Whitney statistics test applied to 8 users and 8 items of each feature, showed that the combination of features does not improve the accuracy of authentication, while experiment results using KNN and SVM classification methods with 28 users and 20 items from each feature show that the combination of features enhances performance accuracy. Moreover, this dissertation demonstrates that some feature items perform better than others at distinguishing between users. The empirical results show that H, T, I, B, Y letters perform the best among the 20 letters of key duration feature. IN, OU, ED, HI, EA perform the best among the 20 pairs of letters of flight time latency. IN, OU, ED, HI, EA perform the best among the 20 pairs of letters of digraph time latency. And the words WILL, ALL, A, AND, HAVE perform the best among the 20 words of word total time duration. In this dissertation, we also introduce and compare the performance of several anomaly detection techniques in the terms of their ability to distinguish between users i n continuous authentication systems. In particular, four machine learning techniques are adapted for keystroke authentication in this dissertation. The selected classification methods are Support Vector Machine (SVM), Linear Discriminate Classifier (LDC), K- Nearest Neighbors (K-NN), and Naive Bayesian (NB). Experimental studies show that KNN and SVM perform the best among the four proposed classifiers. However, using KNN classifier, we are able to obtain a better result compared to the other learning classifiers (SVM, LDC, and NB). This dissertation introduces and applies feature selection techniques that can be useful in selecting the most influential features in continuous authentication systems based on keystroke dynamics, which maintain sufficient classification accuracy of the system and decrease the training and testing times required. The process of subset feature selection is 96

108 performed by various searching algorithms; namely, Genetic Algorithm,Greedy Algorithm, Best-First Search Algorithm, and Particle Swarm Optimization that are integrated (Wrapped) with different machine learning classifiers: Support Vector Machine (SVM), Naive Bayesian, and K-Nearest Neighbors for feature subset selection procedure. The results show that Best Forward Selection (BFS) with K-Nearest Neighbors (KNN) reduces 62.25% of the total features (80 features) while improving the accuracy from 90.47% to 92.85% when using only 31 features. Therefore, Best Forward Selection (BFS) with K-Nearest Neighbors (KNN) would allow us to achieve the quickest detection while maintaining good detection accuracy when using a lesser number of appropriate features. Finally, this dissertation presents a new anomaly detection approach based on sequential change-point theory in order to detect the impostor who takes over from the legitimate user during the computer session when no predefined typing models are available in advance, either for legitimate users or impostors. The new approach provides the ability to detect the impostor in early stages of attacks. The empirical results show that letter D performs the best among all eleven letters with 12% of estimated EER, and estimated Average Detection Delay equal to 13 letters when Lepage statistical tests are employed. Additionally, the dissertation shows that the changes that occurred later are easier to detect than those which occurred earlier. Future work might involve introducing and comparing the performance of more anomaly detection techniques such as neural network and random forest hills in terms of their ability to distinguish between users. We are also interested in trying to train classifiers using smaller samples sizes and testing the impact of that on the performance accuracy in order to determine the minimum amount of typing training data needed for building a continuous typist authentication model of a user with the possibility of maintaining good performance accuracy of the system. In addition to applying sequential change-point analysis on a univariate sequence of data (letter by letter), we are also interested in applying this approach on multivariate 97

109 observations. Moreover, we are also interested on studying the performance of other features such as flight time latency, digraph time latency, and word total time duration using sequential change-point analysis. My further research also may focus on the user s habitual behaviors on mobile devices. Mobile devices such as cellular phones have become common, exceeding 3 billion users. Nowadays, mobile devices are used to provide assistance to access Internet, as well as to meet important business and personal needs such as transferring money, accessing , and managing bank accounts. Therefore, the continuous user s authentication for mobile devices has become an important issue. 98

110 BIBLIOGRAPHY Araújo, Lívia CF, Luiz HR Sucupira, Miguel Gustavo Lizarraga, Lee Luan Ling, and João Baptista T. Yabu-Uti. "User authentication through typing biometrics features." Signal Processing, IEEE Transactions on 53, no. 2 (2005): Ashbourn, Julian. Biometrics: Advanced Identity Verification:the complete guide. London: Springer, Basseville, Michèle, and Igor V. Nikiforov. Detection of abrupt changes: theory and application. Vol Englewood Cliffs: Prentice Hall, Bergadano, Francesco, Daniele Gunetti, and Claudia Picardi. "User authentication through keystroke dynamics." ACM Transactions on Information and System Security (TISSEC) 5, no. 4 (2002): Bleha, Saleh, Charles Slivinsky, and Bassam Hussien. "Computer-access security systems using keystroke dynamics." Pattern Analysis and Machine Intelligence, IEEE Transactions on 12, no. 12 (1990): Brown, Marcus, and Samuel Joe Rogers. "User identification via keystroke characteristics of typed names using neural networks." International Journal of Man-Machine Studies 39, no. 6 (1993): Cho, Sungzoon, Chigeun Han, Dae Hee Han, and Hyung-Il Kim. "Web-based keystroke dynamics identity verification using neural network." Journal of organizational computing and electronic commerce 10, no. 4 (2000): Cohen, Jacob. Statistical power analysis for the behavioral sciences. Academic press, Curtin, Mary, Charles Tappert, Mary Villani, Giang Ngo, Justin Simone, Huguens St Fort, and S. Cha. "Keystroke biometric recognition on long-text input: A feasibility study." Proc. Int. MultiConf. Engineers & Computer Scientists (IMECS) Dowland, P., H. Singh, and S. Furnell. "A preliminary investigation of user authentication using continuous keystroke analysis." In Proceedings of the International Conference on Information Security Management and Small Systems Security pp Fearnhead, Paul, and Zhen Liu. "On line inference for multiple changepoint problems." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69, no. 4 (2007): Fry, Edward B., and Jacqueline E. Kress. The reading teacher's book of lists. Vol. 55. John Wiley & Sons, Gaines, Helen F. Cryptanalysis: A study of ciphers and their solution. Vol. 97. Courier Corporation, Gaines, R. Stockton, William Lisowski, S. James Press, and Norman Shapiro. Authentication by keystroke timing: Some preliminary results. SANTA MONICA CA: RAND CORP,

111 Giot, Romain, Mohamad El-Abed, and Christophe Rosenberger. "Keystroke dynamics authentication." (Biometrics) 2011: chapitre--8. Gunetti, Daniele, and Claudia Picardi. "Keystroke analysis of free text." ACM Transactions on Information and System Security (TISSEC) 8, no. 3 (2005): Guyon, Isabelle, and André Elisseeff. "An introduction to variable and feature selection." The Journal of Machine Learning Research 3, 2003: Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. "The WEKA data mining software: an update." ACM SIGKDD explorations newsletter 11, no. 1 (2009): Hawkins, Douglas M., and K. D. Zamba. "A change-point model for a shift in variance." Journal of Quality Technology 37, no. 1 (2005): Hawkins, Douglas M., and Qiqi Deng. "A nonparametric change-point control chart." Journal of Quality Technology 42, no. 2 (2010): 165. Hawkins, Douglas M., Peihua Qiu, and Chang Wook Kang. "The changepoint model for statistical process control." Journal of quality technology 35, no. 4 (2003): Hinkley, David V., and Elizabeth A. Hinkley. "Inference about the change-point in a sequence of binomial variables." Biometrika 57, no. 3 (1970): Hu, Jiankun, Don Gingrich, and Andy Sentosa. "A k-nearest neighbor approach for user authentication through biometric keystroke dynamics." In Communications, ICC'08. IEEE International Conference on. IEEE, pp Joyce, Rick, and Gopal Gupta. "Identity authentication based on keystroke latencies." Communications of the ACM 33, no. 2 (1990): Khushaba, Rami N., Ahmed Al-Ani, and Adel Al-Jumaily. "Differential evolution based feature subset selection." In Pattern Recognition, ICPR th International Conference on. IEEE, pp Killourhy, Kevin S., and Roy Maxion. "Comparing anomaly-detection algorithms for keystroke dynamics." In Dependable Systems & Networks, DSN'09. IEEE/IFIP International Conference on. IEEE, pp Killourhy, Kevin, and Roy Maxion. "Why did my detector do that?! predicting keystrokedynamics error rates." In Recent Advances in Intrusion Detection (Springer ),vol of Lecture Notes in Computer Science (2010): Kohavi, Ron, and George H. John. "Wrappers for feature subset selection." Artificial intelligence 97, no. 1 (1997):

112 Leggett, John, and Glen Williams. "Verifying identity via keystroke characterstics." International Journal of Man-Machine Studies 28, no. 1 (1988): Monrose, Fabian, and Aviel D. Rubin. "Keystroke dynamics as a biometric for authentication." Future Generation computer systems 16, no. 4 (2000): Monrose, Fabian, and Aviel Rubin. "Authentication via keystroke dynamics." In Proceedings of the 4th ACM conference on Computer and communications security. ACM, pp Navarro, Daniel Joseph. Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners. Lulu. com, Obaidat, Mohammad S., and Balqies Sadoun. "Verification of computer users using keystroke dynamics." Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 27, no. 2 (1997): Page, E. S. "Continuous inspection schemes." Biometrika Pin S. Tech, Shigang Yue, Andrew B.J. Teoh. "Feature Fusion Approach on Keystroke Dynamics Efficiency Enhancement." nternational Journal of Cyber-Security and Digital Forensics (IJCSDF) 1, no. 1 (2012): Ratha, Nalini K., Jonathan H. Connell, and Ruud M. Bolle. "Enhancing security and privacy in biometrics-based authentication systems." IBM systems Journal 40, no. (3) (2001): Revett, Kenneth, Florin Gorunescu, Marina Gorunescu, Marius Ene, Sergio Magalhaes, and Henrique Santos. "A machine learning approach to keystroke dynamics based user authentication." International Journal of Electronic Security and Digital Forensics 1, no. 1 (2007): Roberts, S. W. "Control chart tests based on geometric moving averages." Technometrics 1, no. 3 (1959): Robinson, John, Vicky M. Liang, J. Chambers, and Christine L. MacKenzie. "Computer user verification using login string keystroke dynamics." Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on 28, no. 2 (1998): Ross, Gordon J. "Parametric and nonparametric sequential change detection in R: The cpm package." Journal of Statistical Software, 2013: 78. Ross, Gordon J., and Niall M. Adams. "Two nonparametric control charts for detecting arbitrary distribution changes." Journal of Quality Technology 44, no. 2 (2012): 102. Ross, Gordon J., Niall M. Adams, Dimitris K. Tasoulis, and David J. Hand. "A Nonparametric change point model for streaming data." Technometrics 53, no. 4 (2011):

113 Sang, Yingpeng, Hong Shen, and Pingzhi Fan. "Novel impostors detection in keystroke dynamics by support vector machine." In Parallel and distributed computing: applications and technologies. Berlin Heidelberg: Springer, pp Spillane, R. Keyboard apparatus for personal identification. IBM Technical Disclosure Bulletin 17, 1975, Stephens, D. A. "Bayesian retrospective multiple-changepoint identification." Applied Statistics Umphress, David, and Glen Williams. "Identity verification through keyboard characteristics." International journal of man-machine studies 23, no. 3 (1985): Venables, William N., and Brian D. Ripley. Modern applied statistics with S-PLUS. Springer Science & Business Media, Zhao, Ying. "Learning user keystroke patterns for authentication." In Proceeding of World Academy of Science, Engineering and Technology, 2006: pp Zou, Changliang, and Fugee Tsung. "Likelihood ratio-based distribution-free EWMA control charts." Journal of Quality Technology 42, no. 2 (2010):