User Authentication/Identification From Web Browsing Behavior

User Authentication/Identification From Web Browsing Behavior US Naval Research Laboratory PI: Myriam Abramson, Code 5584 Shantanu Gore, SEAP Student, Code 5584 David Aha, Code 5514 Steve Russell, Code 5584 first.last@nrl.navy.mil DARPA AAUTH Meeting 09/19/13 The views, opinions, and/or findings contained in this article/presentation are those of the author/presenter and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense. 1

Outline Objective Human Subject Research Web behavior features User Authentication with Ensemble of One-class SVMs Associative Patterns of Web Browsing Behavior Future work 2

Active Authentication Performer Overview and Status Naval Research Laboratory (funded by NRL) PROGRAM OVERVIEW AND STATUS BIOMETRIC: Identification of users through Web browsing. behavior. Develop the theoretical foundations and supporting algorithms for the detection, tracking and prediction of Web browsing behavior using information available from the address line of the browser. The biometric is the activity patterns that can be captured in the browser including the timing of clicks, the type of page visited, the length of a session, the revisit rate, etc. Behavioral Web Analytics Key Objectives Identify, extract and analyze features of Web behavior collected in a user study. Investigate structured prediction methods to authenticate users based on their Web browsing behavior. Develop a genre palette to categorize webpages for authentication and identification purposes. Analyze a large clickstream dataset obtained from comscore, Inc. Status Browser extensions for tracking and monitoring user Web behavior completed and deployed in an ongoing user study. Completed detailed analysis of initial user study dataset and identified key features of Web browsing behavior. Completed user authentication approach using ensemble of one-class SVMs and random subspace method with best FRR: 11%; and best FAR: 7% ; average FRR: 17% and average FAR: 18% Investigation of spatio-temporal models of Web browsing behavior with structured prediction methods under way. Team Members Principal Investigator: Myriam Abramson, Code 5584 David W. Aha, Code 5514 Steve Russell, Code 5584

Clickstream data Clickstream data: UserId, Time, URL visited, browsing agent Click Server Clickstream data Internet Access Log 4

Human Subject Research 12 volunteers! 5

Web Behavior Features Sessions: series of consecutive clicks delimited by pauses of 30 mins or longer Global session features Session duration Session length Day-of-week Time-of-day Number of unique hosts Time-variant distributions Time-between-revisit distribution Pause distribution Burstiness distribution Genre distribution 1 1 http://www.diffbot.com 6

Time-variant distributions Time-between-revisit Time between webpage revisits within a certain timeframe Pauses Time interval between 2 consecutive clicks Burstiness Difference between 2 consecutive pauses 7

User Authentication Task: One-class SVMs Unsupervised learning problem Like clustering but solves a discriminative problem (self or not self) Moves the data to a highdimensional space with a kernel (e.g. Gaussian Kernel) LibSVM: Takes the origin as the only support vector from the complement class Authentication Metric: false rejection rate (FRR) and false acceptance rate (FAR) 8

Ensemble Learning: Random Subspace (Abramson, et al., FLAIRS-26) Varies the set of features of an ensemble of learners (one-class SVMs) for diversity Pool of learners with different feature sets Select subset of learners with weighted sampling on internal 2- fold cross-validation Weighted vote Findings: No best feature(s) across all volunteers (a profile-based approach should work) Shorter time spans with high resolution are better discriminator 9

Empirical results 10

Associative Patterns of Web Browsing Behavior (Abramson et al., AAAI Fall Symp) 11

Temporal ordering matters! Shuffling clicks and partitioning into training and test sets preserves the original distribution and gives 100% prediction accuracy using Hamming distance in NRL study dataset! But preserving the temporal order of the clicks gives only 75% prediction accuracy. Volunteer 1 Train Volunteer 1 Test Volunteer 2 Test 12

Hopfield Identification Approaches Identification Methods Temporal Sessions NRL study comscore 1 st Top 2 1 st Top 2 Tournament 75 83 72 75 All-pairs 75 100 73 81 Hamming 75 100 72 79 Tournament Approach No significant difference with Hamming distance metric 13

Future Work Temporal predictive models (CRFs) Genre classification with categories pertinent to identification/authentication Robustness of predictive analytics Concept drift (context change) Label noise Partially-labelled sequences Intent recognition e.g. evasive behavior 14