BIG DATA & Forensics Katrin Franke, PhD Norwegian Information Security Laboratory, Gjøvik, Norway 1
Computational Forensics Katrin Franke, PhD Norwegian Information Security Laboratory, Gjøvik, Norway 2
Katrin Franke Professor of Computer Science, 2010 PhD in Artificial Intelligence, 2005 MSc in Electrical Engineering, 1994 Industrial Research and Development (20+ years) Financial Services and Law Enforcement Agencies Courses, Tutorials and post-graduate Training: Police, BSc, MSc, PhD Funding Chair IAPR/TC6 Computational Forensics IAPR * Young Investigator Award, 2009 * International Association of Pattern Recognition 3 kyfranke.com
NISlab @ Gjøvik University College 52 Persons 12 Permanent staff, 15 Part timers, 2 Post docs, 20 Ph.D. students, and 3 Administrative staff 4 Study programs B.Sc. (40), M.Sc. (80) and Ph.D. in Information Security B.Sc. (30) in Network and System Administration 1 National Research School of Computer and Information Security COINS..Oslo Gjøvik University College Externally funded projects NFR, EU FP7, NIST 2 Focus laboratories BiometricsLab og TestimonLab 4 NISlab TM
Joint Forces: Center for Cyber & Information Security Norwegian National Security Authority (NSM), Directorate of Police (Politidirektoratet), National Criminal Investigation Service (Kripos), National Police Computing and Material Service (PDMT) Norwegian National Authority for Investigation and Prosecution of Economic and Environmental Crime (Økokrim), Norwegian Police Security Service (PST), Police Academy (Politihøgskolen), National ID Centre (NID), Norwegian Cyber Force (Cyberforsvaret), Norwegian Defence Research Establishment (FFI), Telenor, Pricewaterhouse Coopers (PwC), Statkraft, Statnett and Eidsiva, Oppland County.! Publicly announced: 11. June 2013 5
NISlab Working Areas Biometrics User Authentication BTA Protocol Forensics Forensic Readiness Incidence Response Investigation/Analysis Security Management Risk-based Design Security Economics System/Adversary Modeling Human Factors, Policies Security Technology Software Security System Administration Network and Critical Infrastructure Protection 6 Testimon (lat. evidence) Computational & Digital Forensics: Fraud Detection, Analysis and Prevention NISlab TM
Underlying thoughts "Without deviation from the norm, progress is not possible." [Frank Zappa] "Trust is good control is better" [Lenin] "Freedom is the state in which an individual can not be exposed to despotism of others" [Anonymous] "Tell me and I forget, teach me and I may remember, involve me and I learn." [Benjamin Franklin] NISlab TM 7
Internet Adoption 247 billion email per day 234 million websites 5 billion mobile-phone user 50 billion smart things with sensing and communication capabilities that collect data BIG Data Phenomenon Volume, Velocity, Variety NISlab TM
9
Cyber Crime 10
Cyber Crime Offenses & Costs Report of the Belgian Economic and Financial Crimes Division (DJF) Online crime complaints and dollar loss in the United States (IC3, 2010) European Commission, Directorate-General Home Affairs, Directorate Internal Security Unit A.2: Organised Crime! RAND Corporation, Feasibility Study for a European Cybercrime Centre, Technical Report 1218," 2012. Prepared for the EC German Annual Federal Criminal Police Office Situation Report on Cybercrime 2009 and 2010 11
Alkaabi, A., G. M. Mohay, A. J. McCullagh and A. N. Chantler (2010). "Dealing with the problem of cybercrime", Conference Proceedings of 2nd International ICST Conference on Digital Forensics & Cyber Crime, 4 6 October 2010, Abu Dhabi. Types of Cyber Crime NISlab TM 12
Forensic Science Forensic methods consist of multi-disciplinary approaches to perform the following tasks: Investigate and to Reconstruct a crime scene or a scene of an accident, Collect and Analyze trace evidence found, Identify, Classify, Quantify, Individualize persons, objects, processes, Establish linkages, associations and reconstructions, and Use those findings in the prosecution or the defense in a court of law.! So far, mostly dealt with previously committed crime, greater focus is now to prevent future crime. 13
Challenges & Demands in Forensic Investigations Challenges Tiny Pieces of Evidence are hidden in a mostly Chaotic Environment, Trace Study to reveal Specific Properties, Traces found will be Never Identical, Reasoning and Deduction have to be performed on the basis of Partial Knowledge, Approximations, Uncertainties and Conjectures. Demands Objective Measurement and Classification, Robustness and Reproducibility, Secure against Falsifications.! NISlab TM 14
Strengthening Forensic Science in the United States: A Path Forward Committee on Identifying the Needs of the Forensic Sciences Community, National Research Council ISBN: 0-309-13131-6, 352 pages, 6 x 9, (2009) This PDF is available from the National Academies Press at: http:// www.nap.edu/catalog/12589.html NISlab TM
Cyber Crime and Forensics Knowledge and intuition of the human plays a central role in daily casework. Courtroom forensic testimony is often criticized by defense lawyers as lacking a scientific basis. Evidence increasingly data intensive and widely distributed Common practice to seize all data carriers; amounts to many terabytes of data Enrich with data available on the Internet, Social networks, etc. Huge amount of data, tide operational times, and data linkage pose challenges Implement Legal Framework and Standards Add Efficiency and Intelligence to Investigations Computational Forensics, aka applying Artificial Intelligence in Forensic Sciences NISlab TM 16
Computational Forensics - Objectives Study and development of computational methods to Assist in basic and applied research, e.g. to establish or prove the scientific basis of a particular investigative procedure, Support the forensic examiner in their daily casework. Modern crime investigation shall profit from the hybrid-intelligence of humans and machines.
Computational Forensics - Definition It is understood as the hypothesis-driven investigation of a specific forensic problem using computers, with the primary goal of discovery and advancement of forensic knowledge.! CF works towards: 1. In-depth Understanding of a forensic discipline, 2. Evaluation of a particular scientific method basis and 3. Systematic Approach to forensic sciences by applying techniques of computer science, applied mathematics and statistics.! It involves Modeling and computer Simulation (Synthesis) and/or computer-based Analysis and Recognition
Computational vs. Computer (Digital) Forensics Computational Forensics uses computational sciences to study any type of evidence: Computer forensics Crime Scene Investigation Forensic paleography Forensic anthropology Forensic chemistry! Computer Forensics studies digital evidence: File-system forensics Live-system forensics Mobile-device forensics etc.
Requirement of Forensic-Computing Infrastructure KEY FEATURES security - scalability - flexibility Cell-Level Security (being one element of the ecosystems end2end trust assurance framework) Unprecedented Scale (tens of PBs) Multi-Structured Data Analytics Automated Indexing @Ingest msec. Ingest rates 3 in1 database: Column, Document & Graph Store Statistics, SQL plus Full-Text & Graph Search NISlab TM 20
WANDA Architecture
Plug-In Concept
Testimon FDS 3 :(Forensic-Data Store & Secure Services) FDS 3 end2end TA Encryption at Rest Encryption-in-Motion e2eta Audit Policy & Labeling Engines e2eta IdM Integration Data Structures Documents (JSON) ReLIfE Languages FDS 3 CORE Analytics Graphs Thrift FDS 3 Iterators Interfaces Processing D3 Demos Indexing Tools FDS 3 Data Loaders Flume FDS 3 Ingest Lucene PDS-QL MapReduce Connector Pig Connector Apache Accumulo Advanced Analytics Hadoop Distributed File System (HDFS) Commodity Hardware Private Cloud Public Cloud
Requirement of Adapted Computational Methods Proactive, Ultra-large scale Forensic Investigations: Computational Forensics Situation-aware methods Quantified, measurable indicators Adaptive, self-organizing models Distributed, cooperative, autonomous Brain NN FL Imprecision, Uncertainty, Partial Truth EC Natural Evolution Reasoning Computational Intelligence NN: Neuronal Networks FL: Fuzzy Logic EC: Evolutionary Computation 24
Data-driven Approaches BIG DATA Analytics Inter-relation of feature complexity and expected recognition accuracy. (Franke 2005) 25
Application Example: Network Intrusion Detection 10% of the overall (5 millions of instances) KDD CUP 99 test data set for Intrusion Detection; Systems, which have normal traffic and 4 attack classes (DoS, Probe, U2R, R2L). Consider 4 data subsets of the KDD CUP 99: Data Set Number of Instances Normal & DoS 488.736 Normal & Probe 138.391 " " Normal & U2R 97.33 Normal & R2L 98.404 Feature selection: Opt-CFS & Opt-mRMR C4.5 Classifier & Bayesian Network Number of Selected Features Reference: Nguyen, Franke, Petrovic (2009-2012) Achieved Recognition Performance 26
Towards a Generic Feature-Selection Measure for Intrusion Detection Hai Thanh Nguyen, Katrin Franke and Slobodan Petrović Norwegian Information Security Laboratory (NISlab) Gjøvik University College www.nislab.no
Model for Pattern Recognition Test pattern Classification Training Preprocessing Feature Measurement Classification Training pattern Preprocessing Feature Extraction / Selection Learning
Feature Selection Methods Wrapper Methods Filter Methods Correlation Feature Selection (CFS) measure Minimal-Redundancy-Maximal-Relevance (mrmr) measure Generic Feature Selection measure (GeFS) Embedded Methods
Motivation A lot of popular algorithms are not principled and it is difficult to understand what problem they seek to solve and how optimally they solve it. Isabelle Guyon, 2005 Many feature selection algorithms and that perform well in many applications, yet should we delay to create new ones or try to get better understandings, e.g. regarding: The ability of generalization of feature selection measures. The impact of feature selection methods, such as filter methods, to the accuracy of classifiers. Need of more effective procedures that ensures the globally feature subsets.
Our Research Focus 1. Generalization of several feature selection measures. 2. Optimization to derive globally optimal feature subsets. Considering the CFS measure (Hall, 1999) and the mrmr measure (Peng, 2005) for intrusion detection because: Filter methods are usually used to select features from highdimensional data sets, such as intrusion detection systems. Relevance of features and relationship between features are considered The relevance and relationship are usually characterized in terms of correlation (CFS) or mutual information (mrmr).
GFS and mrmr Feature Selection Correlation featureselection (CFS) measure " Class-feature correlation Feature-feature correlation Feature-selection measure based on mutual information (mrmr) Class-feature mutual inform. Feature-feature mutual inform. M. Hall. Correlation Based Feature Selection for Machine Learning. Doctoral Dissertation, University of Waikato, Department of Comp. Science, 1999. H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and minredundancy. IEEE Transactions on PAMI, Vol. 27, No. 8, pp.1226-1238, 2005.
Generic Feature Selection (GeFS) Question: Can the CFS measure and the mrmr measures be fused and generalized into a generic feature selection measure? Definition 1: A generic feature selection (GeFS) measure is defined as follows: " " " Proposition 1: The CFS and the mrmr measures are instances of the GeFS measure. Proposition 2: The feature selection by means of the GeFS measure is a polynomial mixed 0-1 fractional programming (PM01FP) problem.
Problem Transformation Chang s method for solving PM01FP Linearizing PM01FP problem into mixed 0-1 linear programming problem (M01LP). The number of variables & constraints: n 2 " Branch and Bound algorithm. Our method for solving PM01FP Differently linearizing PM01FP problem into mixed 0-1linear programming problem (M01LP). The number of variables & constraints: 4n+1 " Branch and Bound algorithm. C-T. Chang. On the polynomial mixed 0-1 fractional programming problems, European Journal of Operational Research, vol. 131, issue 1, pages 224-227, 2001.
Application Example: Malicious Code Detection Static analysis System artifacts Dynamic analysis Debugging Analyzing malicious content PDFs JavaScripts Office documents Shell code Network traffic " Behavioral Malware Analysis (dynamic) via Information-based Dependency Matching : 98.4% Detection rate " Malicious PDF detection Data set: 7,454 unique benign, 16,296 unique malicious PDF: 97.7% Detection rate Reference: Sand, Kittilsen, Franke (2011-2013) 35
Application Example: Author Identification from Text-based Communications Determining authorship of an anonymous text Enron dataset: real emails of Enron employees, contains 255,636 email 87,474 authors. Reference: Chitrakar, Franke (2011-2013) 36
Demand: Automatization, Standardization, and Benchmarking Increase Efficiency and Effectiveness Perform Method / Tool Testing regarding their Strengths/Weaknesses and their Likelihood Ratio Gather, manage and extrapolate data, and to synthesize new Data Sets on demand. Establish and implement Standards for data, work procedures and journal processes Fulfillment of Daubert Standard http://en.wikipedia.org/wiki/daubert_standard 37
Demand: Joint Research & Development Education and training, Revealing the state-of-the art in *each* domain Sources of information on events, activities and financing opportunities International forum to peer-review and exchange, e.g., IWCF workshops Performance evaluation, benchmarking, proof and standardization of algorithms Resources in forms of data sets, software tools, and specifications e.g. data formats 38
Demand: Legal Framework Law as framework for ICT Evidence acquisition and storage Culture, social behaviors, privacy aspects Cross-jurisdiction cooperation, European / International cyberlaw Law as content of ICT Automation, programming of legal rules Methods for dimensionality reduction loss of relevant information Questions on extracted numerical parameters loss of information due to inappropriate features Reliability of applied computational method / tool Dealing with final conclusion based on wrong computational results 39
Perspectives on Forensics & Digital Evidence Legal / Regulations Technological / Security / Archival Knowledge / Capacity Building / Training Public Awareness (pedagogical methods) Organizational / Information Management / Procedures
Cloud Forensics - Vision Reactive Proactive Discrete Event Continuos Monitoring Accountability/Auditing Forensic Readiness Forensic Readiness Retro-fitted Forensics-by-Design "
Concluding Remarks "It is better to know some of the questions than all of the answers." [James Thurber] "We all do better when we work together. Our differences do matter, but our common humanity matters more." [Bill Clinton] "You are never given a wish without also being given the power to make it come true." [R. Bach] NISlab TM 42