ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013



Similar documents
Data Discovery, Analytics, and the Enterprise Data Hub

How To Create An Insight Analysis For Cyber Security

INTRUSION PREVENTION AND EXPERT SYSTEMS

3D Interactive Information Visualization: Guidelines from experience and analysis of applications

Business Case Outsourcing Information Security: The Benefits of a Managed Security Service

Test Automation Architectures: Planning for Test Automation

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

2 SYSTEM DESCRIPTION TECHNIQUES

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Strategic HR Partner Assessment (SHRPA) Feedback Results

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

Inputs and Outputs of the Intelligence Cycle: a highway to shared definitions and a knowledge base.

Roadmap for the Development of a Human Resources Management Information System for the Ukrainian civil service

Foundations of Business Intelligence: Databases and Information Management

Best Practices for Architecting Taxonomy and Metadata in an Open Source Environment

Fundamentals of Measurements

Data Management Implementation Plan

Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects

Delivering Smart Answers!

Position Classification Standard for Management and Program Clerical and Assistance Series, GS-0344

Healthcare, transportation,

INTERNATIONAL FRAMEWORK FOR ASSURANCE ENGAGEMENTS CONTENTS

Bottomline Healthcare. Privacy and Data Security

Forward Thinking for Tomorrow s Projects Requirements for Business Analytics

White Paper April 2006

Exhibit F. VA CAI - Staff Aug Job Titles and Descriptions Effective 2015

Realize That Big Security Data Is Not Big Security Nor Big Intelligence

Machine Data Analytics with Sumo Logic

Agile Manufacturing for ALUMINIUM SMELTERS

An ESRI White Paper May 2007 Mobile GIS for Homeland Security

Tax data analytics A new era for tax planning and compliance

Data quality and metadata

AN INTRODUCTION TO THE GLOBAL DOCUMENT TYPE IDENTIFIER (GDTI) TABLE OF CONTENTS

IBM SPSS Direct Marketing

Introduction. A. Bellaachia Page: 1

Short-Term Forecasting in Retail Energy Markets

USING DATA DISCOVERY TO MANAGE AND MITIGATE RISK: INSIGHT IS EVERYONE S JOB

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

<no narration for this slide>

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

CONTEXT AWARE CONTENT MARKETING

What is Windows Intune? The Windows Intune Administrator Console. System Overview

Threat intelligence visibility the way forward. Mike Adler, Senior Product Manager Assure Threat Intelligence

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

BIM. the way we see it. Mastering Big Data. Why taking control of the little things matters when looking at the big picture

Introduction to Data Mining

Social Media Monitoring, Planning and Delivery

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Appendix B Data Quality Dimensions

OBSERVATIONS FROM 2010 INSPECTIONS OF DOMESTIC ANNUALLY INSPECTED FIRMS REGARDING DEFICIENCIES IN AUDITS OF INTERNAL CONTROL OVER FINANCIAL REPORTING

itanalyzer Data Protection Module

Enterprise Data Quality

Random Forest Based Imbalanced Data Cleaning and Classification

Sponsor-CRO Collaboration Study. Executive Summary

EST.03. An Introduction to Parametric Estimating

Foundations of Business Intelligence: Databases and Information Management

RSA Adaptive Authentication For ecommerce

Business Architecture: a Key to Leading the Development of Business Capabilities

INFO Koffka Khan. Tutorial 6

COMPUTING DURATION, SLACK TIME, AND CRITICALITY UNCERTAINTIES IN PATH-INDEPENDENT PROJECT NETWORKS

Executive Summary of Mastering Business Growth & Change Made Easy

Business information management software that moulds to your needs...

Adobe Insight, powered by Omniture

Government Technology Trends to Watch in 2014: Big Data

EPIC 1.08 Distribution System Safety and Reliability through New Data Analytics Techniques. John Carruthers, PG&E

How to Run a Successful Big Data POC in 6 Weeks

PREPARATION OF TECHNICAL APPRAISAL REVIEW REPORTS

A Survey on Association Rule Mining in Market Basket Analysis

Introduction to Systems Analysis and Design

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

FFIEC Cybersecurity Assessment Tool

Space project management

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

How To Manage A Network Security Risk

Monte Carlo analysis used for Contingency estimating.

STSG Methodologies and Support Structure

Master of Science in Health Information Technology Degree Curriculum

The integrated leadership system. ILS support tools. Leadership pathway: Individual profile EL1

ITIL, the CMS, and You BEST PRACTICES WHITE PAPER

Aperture VISTA and the CMDB: An Enterprise Best Practices Approach

ETCIC Internships Open to Sophomores:

Using big data in automotive engineering?

WHITE PAPER SPLUNK SOFTWARE AS A SIEM

with Managing RSA the Lifecycle of Key Manager RSA Streamlining Security Operations Data Loss Prevention Solutions RSA Solution Brief

Select the right configuration management database to establish a platform for effective service management.

Concepts of digital forensics

What a Vulnerability Assessment Scanner Can t Tell You. Leveraging Network Context to Prioritize Remediation Efforts and Identify Options

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Transcription:

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION, Fuel Consulting, LLC May 2013 DATA AND ANALYSIS INTERACTION Understanding the content, accuracy, source, and completeness of data is critical to the successful analysis of intelligence information. Without a comprehensive and systematic documentation of all relevant data, an analyst will be incapable of creating effective search strategies and making accurate intelligence assessments. This problem is compounded exponentially by the use of automated tools and techniques, which essentially treat all data as equal. Although data sets are the foundation of all intelligence analysis, they are the least understood and the most overlooked aspects of the process. In part this phenomenon could be attributed to the assumption that all data acquired though legitimate sources are essentially the same, albeit influenced by bias due to the analyst s individual experience and knowledge of specific intelligence disciplines. The human tendency to favor that which we know and understand affects the intelligence analyst's choices at each stage of his or her analysis and therefore may impact the accuracy of the overall assessment. This paper presents insights regarding ways to improve data characterization and thus the accuracy of intelligence analysis. DATA: THE BEGINNING Data characterization begins long before the analyst even sees the data or any tool manipulates it: it starts at the source and continues throughout the transmission, ingest, formatting, standardization, processing, documentation, and methods of manipulation, presentation,

search, and analysis. It involves a myriad of skill disciplines, including those of the intelligence collection manager, data manager, software engineer, extract-transform-load technician, hardware engineer or architect, documentation specialist, infrastructure manager, analytic tools and techniques implementer, computer support specialist, and last but not least intelligence analyst. The multiple influences on the data set before the analyst even sees it are, in fact, part of the problem; there is an assumption that the analyst need only define his or her analytic requirements and tools and thereafter other specialists need only meet those specifications using best judgment. Understanding data is like analysis, however it is an interactive process. It is impossible to successfully define data requirements without analyzing and understanding the variety of potential sources of similar if not identical data, particularly given the ever-expanding worldwide global communications network infrastructure. As with data characterization, this is not a static process. Rather, it is an iterative process that involves all the various individuals who touch that data or make any decision that impacts the data available to the analyst. DATA CHARACTERIZATION PROCESS PHASES The first phase of data characterization involves determining what detailed information should be systematically retained for all acquired intelligence data. As noted, this is not an incidental phase, and it may be dynamic over time as techniques evolve and knowledge is gained about the value of specific data and relational correlations across data sets. This data documentation phase must include the participation of end users of that data--the intelligence analysts as well as the technical specialists. Also important is ensuring that the skill sets of the intelligence 2

analysts involved are representative of the types of analysis performed by the organization: current threat analysis; strategic or long term trend analysis; combat support; situational awareness or alerts for newest information; target watch listing; geo-locational or geospatial support; etc. Each organization will have a subset of these types of analytic functions, and while there will be an overlap of some data characterization documentation requirements, there will also be some of unique value for that function. As a result, the priority of what is most important will change accordingly. Examples of data documentation that should be retained include the following: date of data collection and date of data delivery source of data confidence factor for data source (direct observation, second or third hand, analytic assumption, document-derived, collection bias, etc.) dataset completeness size of data set data attributes contained in data (phone numbers, names, passport serials, etc.) standardization of specific data fields and if so, which standard employed countries or nationalities represented and quantities of each attribute specific restrictions on data handling (time limitations, U.S. person, etc.) classification of data analytic category of data (travel, financial, identity, biometric, etc.) potential redundancy of data source frequency of data delivery (live streaming, daily, weekly, etc.) any observed operational, system, or processing issues relevant to analysts graphical displays of data to enhance analyst s ability to grasp characteristics of large data volume quickly (heat maps, bar charts of geographical coverage, etc.) any other information that would help the analyst to accurately interpret data. 3

The second, but non-sequential, phase of data characterization consists of determining how the data may be manipulated by the analyst as well as what tool or technique will be employed to assist the analyst in deriving knowledge from the data. Ensuring that the data is being processed and maintained in a way that extracts the maximum intelligence value requires an understanding of how an analyst will search data repositories, correlate key data attributes across diverse data sets, identify new, timely data facts, or create relational linkages among a variety of attributes or data sets. While there are no guarantees that important intelligence facts will not be missed, the probability that intelligence assessments will be incomplete increases if data characterization is not comprehensive or if analytic functions and techniques and not tailored to the data. Methods of data manipulation include both manual and automated tools and techniques. An analyst manually creates a search query by determining how to structure a question to retrieve a subset of relevant data needed to contribute to an intelligence assessment. While the tool may be composed of algorithms that automatically process a search query, it is the analyst who must build that query to return all the relevant data. That process could include using variations in the spelling of a name or the use of wild card symbols. Some tools may return name spelling variations or minor misspelling errors via the use of fuzzy logic, but others will not. Consequently, it is important when first ingesting and processing data sets that contain personal or place names to determine how name variations will be handled and how much automation will be built into the capabilities. 4

Another important factor in these decisions is having a sound understanding of the level of risk acceptable to the organization. Is it critical to not miss any possibilities (false negatives) or more important to not have too many false positives when returning search query results to the analyst? For example, the former could result in missing a potential terrorist given a name misspelling, while the latter could return too many possible terrorist candidates for an analyst to sort through. Each judgment regarding risk has an associated cost, and these must be balanced in the data characterization and processing stages. An example of automated tool manipulation is the use of entity resolution tools to correlate similar attributes across diverse data sets. In this instance, the effectiveness of the correlation will be partially dependent on the standardization employed for the identical attributes incorporated into different data sets. Standardization or normalization should be as universal as possible and established when data is ingested and formatted. While software may compensate for some variations, it is best to establish normalization criteria as early as feasible to enhance the effectiveness of entity resolution tools; otherwise, legitimate correlations could be precluded (variation in calendars or formats of date events, for example) when trying to identify a set of activities within a set data timeframe. The use of relational tools will also pose some challenges for data specialists, not least of which is having some understanding of the reliability of the source of the data. Although it is optimal to have those closest to the actual data collection determine the likely validity of the "raw" data facts, too frequently this judgment is not made by the intelligence collector for a variety of reasons. Consequently, the analyst is left to sort out the validity of relationships made by 5

automated tools and deal with any obvious conflicts. An example is variations in a passport number; although only one is likely valid for the same country and date; in such a circumstance, knowing that one number may have been garbled in a long line of communications while another is derived from an actual scanned document is important. Finding ways to flag such data with accuracy indicators is critical to the determining the confidence level the analytic conclusions deserve. This principle also applies to the history of the data: an analyst may need to know whether the data are "raw"--not previously manipulated by tools, techniques, or other analysts or instead derived from either automatically created relationships (tool derived) or other analyst's assertions. The more this type of information can be tracked along with the data, the more likely the analyst will be able to make accurate intelligence assessments. KNOWLEDGE BASE As noted, analytic or tool-based assertions are different from actual "raw" data. The latter is what is generally subjected to data characterization; the former are derived data or intelligence assertions. These too should be stored given their value for other analysts, particularly when the analyst is looking for "non-obvious" personal or organizational relationships (connecting the dots), long term trend analysis, historical context, or a myriad of other analytic functions. Such derived data facts or assertions should be maintained in a knowledge data base that is as widely accessible as possible given clearances, accesses, and analytic roles across a broad spectrum of intelligence and law enforcement organizations. The ability of analysts to build on the knowledge acquired by their compatriots is essential to advancing analytic success against a highly dynamic and decentralized set of evolving intelligence targets. 6

CONCLUDING OBSERVATIONS Comprehensive data characterization for raw data combined with knowledge bases for derived data assertions will continue to grow in importance as data proliferate and analytic resources are constrained by budgets and relevant experience. Understanding and making sense of all that data ultimately contributes to the effectiveness of the analytic process. Data characterization is not the most exciting aspect of the analytic cycle, nor is it all that is necessary, but it is the basic foundation for all that is to come. The ongoing challenge in the intelligence world is not just to acquire all the relevant information, but to manage and track it once it is acquired, because we all understand the danger in potentially possessing the "golden nuggets" but being unable to find them or use them effectively to get the answers critical to thwarting national security threats and navigating dangerous environments. Data characterization alone is not enough, but it is a huge step forward and one that we cannot afford to minimize or overlook. 7