STEPHAN JOU, CTO ISSA TORONTO What s Behind Big Data and Behavorial Analytics
Hey. I m Stephan Jou CTO at Interset Previously: IBM s Business AnalyBcs CTO Office Big data analybcs, visualizabon, cloud, predicbve analybcs, data mining, neural networks, mobile, dashboarding and semanbc search M.Sc. in ComputaBonal Neuroscience and Biomedical Engineering, and a dual B.Sc. in Computer Science and Human Physiology, all from the University of Toronto Email: sjou@interset.com TwiTer: @eeksock 2
Catching Bad Guys With Math Threat Detection (Insider and Compromised Machine Attack) Through the Science of Behavioral Analytics 3
Who Is This? Lessons: There were limited systems in place and we sbll do not know all that he took His acbons were highly anomalous - Volumes of data - Access to improper accounts - Usage of USB storage devices There was plenty of evidence and Bme if only it was visible! 4
Who Are These Two? Lessons: Disgrunted insiders employees can be at risk What were the anomalies? Copied 16,000 documents within five days of receiving severance There was plenty of evidence and Bme if only it was visible! 5
And This Guy? There was plenty of evidence and Bme if only it was visible! Lessons: Most atacks are from users/idenbbes with proper access ATacker stayed under the radar for years Third parbes (US Intelligence) most ocen uncovers the atack What were the anomalies? Accessing data not related to his job Moving data in ways that same role users were not over Bme Money problems 6
And these guys? if we do this right, we will make a million dollars each we could have already sold them for Bitcoins which would have been untraceable if we did it right. It could have already been easily an easy 50 grand. Lessons: Make sure your partners are secure Hacked (SQL InjecBon) a partner with a weak network Stole user names and passwords IdenBBes & machines are enbbes They acted in highly anomalous ways Moved large amounts of data Moved data to exfiltrabon points At four companies and the US Army! There was plenty of evidence and Bme if only it was visible!
How Do You Catch the Authorized User? 75% of material loss via insiders with approved access 70% of IP thec cases, insiders steal informabon within 30 days of announcing their resignabons 62% of employees believe it acceptable to transfer work documents to personal devices or cloud- based file sharing services, even if a company police prohibits it 60% of employees believe informabon they had been involved in developing is theirs regardless of the IP protecbon policy of the company 51% of employees say their company does not strictly enforce policies, so feel it more than OK to take corporate data. 20% of loss involved collaborabon with one or more employees Source: Symantec & 2011 Cyber Watch Survey, Carnegie Mellon University CERT Program 8
Enterprise Where s Bad Waldo 2014 Interset, a FileTrek Company
Enterprise Where s Bad Waldo 2014 Interset, a FileTrek Company
Kung Fu Move #1: Big Data Source: OliverMunday.com 11
The Four V s of Big Data (Sorry) Transactional Machine Social Reputation Volume Velocity Variety Veracity 12
Kung Fu Move #2: Math New Methods Traditional New Data Adaptive Analysis Continual Analysis Optimization under Uncertainty Optimization Predictive Modeling Simulation Forecasting Alerts Query/Drill Down Ad hoc Reporting Standard Reporting Entity Resolution Relationship, Feature Extraction Annotation and Tokenization Responding to context Responding to local change/feedback Quantifying or mitigating risk Decision complexity, solution speed Causality, probabilistic, confidence levels High fidelity, games, data farming Larger data sets, nonlinear regression Rules/triggers, context sensitive, complex events In memory data, fuzzy search, geo spatial Query by example, user defined reports Real time, visualizations, user interaction People, roles, locations, things Rules, semantic inferencing, matching Automated, crowd sourced Source: Competing on Analytics, Davenport and Harris, 2007 13
Venn Diagram of Data Science Hacking meaning computer science skills The problem if you chose the wrong math you will have false posibves and an ineffecbve systems Source: Drew Conway, http://drewconway.com/zia/2013/3/26/thedata-science-venn-diagram
Standard Thresholds Approach A Pattern for Increased Monitoring for Intellectual Property Theft by Departing Insiders, Andrew Moore et al., Carnegie Mellon, 2011
The Threshold Approach Challenge
The Threshold Approach Challenge
The Threshold Approach Challenge
Behavioral Analytics A simple example Edward Snowden was an contractor, sysadmin with privileged access User The volume of copying is large, compared to Snowden s past 30 days, and compared to other analysts Ac8vity Edward Snowden is copying an unusually large number of sensibve files to an external USB drive. These files have a high risk and importance value Asset USB drives are marked as high risk channels Method 19
Use Appropriate Math to Assemble the Data & ( '( R behavior = P(event y) w y AcBvity w u u U User File Method ) 2 i R u[i] + w f 2 j R f [ j] + w m 2 k R m[k] + f F m M * + w u + w f + w m Risk scores are percentages between 0% (no risk) and 100% (extreme risk) P(event y) is probability that the behavior occurred, either observed or predicted Aggregate risk values combine risks associated with the activity, people, assets and end points Model based on Expected Utility Theory and standard risk model (Risk = Probability * Impact) Mathematical weighting is used to tune and train model for specific activities, people, assets and end points on a per-behavior pattern basis 20
Important Questions Who or what is behaving abnormally? Who is stealing my stuff? Where is my important, at risk stuff? Who is going to leave the company? 21
Some Simple Anomaly Models Who or what is behaving abnormally? Who is going to steal my stuff? Person Name is accessing informabon during unusual working hours. Person Name accessed a storage volume, path, an unusually large number of Bmes Person Name accessed an important file type an unusually large number of Bmes Riskiest Users Person Name accessed an abnormally large amount of data. Person Name performed an abnormally large number of file exits. Where is my important, at risk stuff? Who is going to leave the company? Riskiest Files 22
More Sophisticated Anomaly Models Who or what is behaving abnormally? Person Name is using an unexpected file, filename. Person Name is touching an unexpected set of files. Person Name is consistently accessing higher amounts of data than similar users. Person Name is consistently accessing an important file type more than similar users. Person Name is accessing informabon during different working Bmes compared to similar users. An applicabon accessed an unexpected file type. Who is going to steal my stuff? Person Name has accessed an unusual amount of total file value. Person Name is consistently performing more file exits than similar users. Person Name's amount of file exits varies more than similar users. Person Name has replicated a large amount of source code Where is my important, at risk stuff? Who is going to leave the company? Highest at- risk machines, file shares, and source code repositories The file, Filename, is highly valuable compared to similar files. The following source code projects are most at- risk. Similar users visualizabon Similar files visualizabon Similar machines visualizabon Person Name is hoarding an unusual amount of source code. Person Name has been accessing unexpected source code repositories Person Name is engaging in job search acbvibes. The proporbon of Bme spent by Person Name on non- work acbvibes has changed. Person Name has emailed themselves. 23
Computing Probability of an Anomalous Event Each term in the aggregate behavior risk equabon has analybcs behind it Highly anomalous acbvibes, compared to baseline, should result in a high value How to compute the probability of an anomalous event? & ( '( R behavior = P(event y) w y w u u U ) 2 i R u[i] + w f 2 j R f [ j] + w m 2 k R m[k] + f F m M * + w u + w f + w m 24
Model: Unusual volumes Computes probability that a value in a given hour is anomalous - Bayesian approach Explicitly models both normal and abnormal distribubons - Gaussian, Gamma EsBmators for both normal and abnormal based on observabon
Example: Modeling unusual times Monitor, for each user, start Bmes of when a file or window is brought into focus AcBve Bmes used as input into Gaussian kernel density esbmators Times that contain 95% of acbvity deemed to be normal P(y is bad) at a given Bme is rabo of expected acbvity to 95% acbvity line 26
Model: Unusual Working Days User 1 Regularly works six days a week (takes Sundays off) Slight dip during lunches User 2 Works five days a week ParBcularly acbve on Thursdays 27
Model: Unusual Working Hours User 1 Starts work fairly early in morning Early lunch break SomeBmes works past midnight User 2 Doesn t work as long hours as User 1 9 to 5 er Has occasionally worked a litle bit acer 8pm 28
Model: Clustering Unusual Entities Clusters are created based on observed behaviors of a target set of enbbes - Users, Machines, Assets Clusters are created for like behaviors & outliers are anomalous - User acbons - Access to data - ApplicaBons open/run - File acbons
Reduce False Positives Increase risk of an entity (e.g. user) based on probability, severity, risk and recency of observed behavioral events (anomalies, violations, exfiltrations) Allows real-time aggregation or correlation of multiple event models Reduces false positives and noise John Sneakypants is accessing an unusual, important network share at a time of day he was almost never active at before and took from a source code project that has been inactive for months and just copied an unusual amount of sensitive files to a USB drive 25 46 80 96 30
Real World Example Analyzed a large semiconductor developer community (>20,000 developers) to look for behavioral indicators of risk Identified 2 known source code thieves and leavers Identified 11 previously unknown threats - 2 confirmed: terminated - 1 confirmed: is currently under investigation - 8 Chinese employees replicating 600,000 to nearly 15,000,000 files per day. Currently under investigation Visualization of Interset Cluster Leaver 1 Dots = source code projects Lines connecting dots = developers using those projects 31
Effective Behavioral Analytics Bad Rules- based alerts alone ClassificaBon systems alone Simple mean/standard deviabon based thresholds, generic anomaly detecbon Hard decision boundaries Good Probability- based anomaly + cost- based models Machine learning models Robust models (handle outliers, big data, responds to change) Numerical scores à Flood of alerts, hard to deploy, scale and maintain à Less noise, easier to deploy and scale, ability to focus on top n incidents, POI, etc. 32
Pulling it all together 2014 Interset, a FileTrek Company 33
Big Data Analytics in Security Adaptive Analysis Continual Analysis Optimization under Uncertainty Optimization Predictive Modeling Simulation Forecasting Alerts Query/Drill Down Ad hoc Reporting Standard Reporting Entity Resolution Relationship, Feature Extraction Annotation and Tokenization Responding to context Responding to local change/feedback Quantifying or mitigating risk Decision complexity, solution speed Causality, probabilistic, confidence levels High fidelity, games, data farming Larger data sets, nonlinear regression Rules/triggers, context sensitive, complex events In memory data, fuzzy search, geo spatial Query by example, user defined reports Real time, visualizations, user interaction People, roles, locations, things Rules, semantic inferencing, matching Automated, crowd sourced We are here. Source Competing on Analytics, Davenport and Harris, 2007 34
Future of Big Data Analytics in Security Intelligent Sensors and Ubiquitous Data Sources Desktops and Servers Mobile Cloud Social Networks Open Data, External Data, IOCs ReputaBon and Risk Services Enterprise to Global Systems Behavioral and Threat Analy8cs PlaSorm Forensic Analysis Risk Modeling Anomaly DetecBon EnBty ResoluBon Behavioral SimulaBon Behavioral PredicBon Threat Response OpBmizaBon Advanced Threat Detec8on and Response What happened? How many, how ocen? Where is the risk and threat? How can this threat be contained? How can we prevent this? What will happen next? What is the best possible response to this threat? 35
Thank You! Questions? Upload your logs, try out our math Cloud- hosted Threat Analysis