ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION, Fuel Consulting, LLC May 2013 DATA AND ANALYSIS INTERACTION Understanding the content, accuracy, source, and completeness of data is critical to the successful analysis of intelligence information. Without a comprehensive and systematic documentation of all relevant data, an analyst will be incapable of creating effective search strategies and making accurate intelligence assessments. This problem is compounded exponentially by the use of automated tools and techniques, which essentially treat all data as equal. Although data sets are the foundation of all intelligence analysis, they are the least understood and the most overlooked aspects of the process. In part this phenomenon could be attributed to the assumption that all data acquired though legitimate sources are essentially the same, albeit influenced by bias due to the analyst s individual experience and knowledge of specific intelligence disciplines. The human tendency to favor that which we know and understand affects the intelligence analyst's choices at each stage of his or her analysis and therefore may impact the accuracy of the overall assessment. This paper presents insights regarding ways to improve data characterization and thus the accuracy of intelligence analysis. DATA: THE BEGINNING Data characterization begins long before the analyst even sees the data or any tool manipulates it: it starts at the source and continues throughout the transmission, ingest, formatting, standardization, processing, documentation, and methods of manipulation, presentation,

search, and analysis. It involves a myriad of skill disciplines, including those of the intelligence collection manager, data manager, software engineer, extract-transform-load technician, hardware engineer or architect, documentation specialist, infrastructure manager, analytic tools and techniques implementer, computer support specialist, and last but not least intelligence analyst. The multiple influences on the data set before the analyst even sees it are, in fact, part of the problem; there is an assumption that the analyst need only define his or her analytic requirements and tools and thereafter other specialists need only meet those specifications using best judgment. Understanding data is like analysis, however it is an interactive process. It is impossible to successfully define data requirements without analyzing and understanding the variety of potential sources of similar if not identical data, particularly given the ever-expanding worldwide global communications network infrastructure. As with data characterization, this is not a static process. Rather, it is an iterative process that involves all the various individuals who touch that data or make any decision that impacts the data available to the analyst. DATA CHARACTERIZATION PROCESS PHASES The first phase of data characterization involves determining what detailed information should be systematically retained for all acquired intelligence data. As noted, this is not an incidental phase, and it may be dynamic over time as techniques evolve and knowledge is gained about the value of specific data and relational correlations across data sets. This data documentation phase must include the participation of end users of that data--the intelligence analysts as well as the technical specialists. Also important is ensuring that the skill sets of the intelligence 2

analysts involved are representative of the types of analysis performed by the organization: current threat analysis; strategic or long term trend analysis; combat support; situational awareness or alerts for newest information; target watch listing; geo-locational or geospatial support; etc. Each organization will have a subset of these types of analytic functions, and while there will be an overlap of some data characterization documentation requirements, there will also be some of unique value for that function. As a result, the priority of what is most important will change accordingly. Examples of data documentation that should be retained include the following: date of data collection and date of data delivery source of data confidence factor for data source (direct observation, second or third hand, analytic assumption, document-derived, collection bias, etc.) dataset completeness size of data set data attributes contained in data (phone numbers, names, passport serials, etc.) standardization of specific data fields and if so, which standard employed countries or nationalities represented and quantities of each attribute specific restrictions on data handling (time limitations, U.S. person, etc.) classification of data analytic category of data (travel, financial, identity, biometric, etc.) potential redundancy of data source frequency of data delivery (live streaming, daily, weekly, etc.) any observed operational, system, or processing issues relevant to analysts graphical displays of data to enhance analyst s ability to grasp characteristics of large data volume quickly (heat maps, bar charts of geographical coverage, etc.) any other information that would help the analyst to accurately interpret data. 3

The second, but non-sequential, phase of data characterization consists of determining how the data may be manipulated by the analyst as well as what tool or technique will be employed to assist the analyst in deriving knowledge from the data. Ensuring that the data is being processed and maintained in a way that extracts the maximum intelligence value requires an understanding of how an analyst will search data repositories, correlate key data attributes across diverse data sets, identify new, timely data facts, or create relational linkages among a variety of attributes or data sets. While there are no guarantees that important intelligence facts will not be missed, the probability that intelligence assessments will be incomplete increases if data characterization is not comprehensive or if analytic functions and techniques and not tailored to the data. Methods of data manipulation include both manual and automated tools and techniques. An analyst manually creates a search query by determining how to structure a question to retrieve a subset of relevant data needed to contribute to an intelligence assessment. While the tool may be composed of algorithms that automatically process a search query, it is the analyst who must build that query to return all the relevant data. That process could include using variations in the spelling of a name or the use of wild card symbols. Some tools may return name spelling variations or minor misspelling errors via the use of fuzzy logic, but others will not. Consequently, it is important when first ingesting and processing data sets that contain personal or place names to determine how name variations will be handled and how much automation will be built into the capabilities. 4

Another important factor in these decisions is having a sound understanding of the level of risk acceptable to the organization. Is it critical to not miss any possibilities (false negatives) or more important to not have too many false positives when returning search query results to the analyst? For example, the former could result in missing a potential terrorist given a name misspelling, while the latter could return too many possible terrorist candidates for an analyst to sort through. Each judgment regarding risk has an associated cost, and these must be balanced in the data characterization and processing stages. An example of automated tool manipulation is the use of entity resolution tools to correlate similar attributes across diverse data sets. In this instance, the effectiveness of the correlation will be partially dependent on the standardization employed for the identical attributes incorporated into different data sets. Standardization or normalization should be as universal as possible and established when data is ingested and formatted. While software may compensate for some variations, it is best to establish normalization criteria as early as feasible to enhance the effectiveness of entity resolution tools; otherwise, legitimate correlations could be precluded (variation in calendars or formats of date events, for example) when trying to identify a set of activities within a set data timeframe. The use of relational tools will also pose some challenges for data specialists, not least of which is having some understanding of the reliability of the source of the data. Although it is optimal to have those closest to the actual data collection determine the likely validity of the "raw" data facts, too frequently this judgment is not made by the intelligence collector for a variety of reasons. Consequently, the analyst is left to sort out the validity of relationships made by 5

automated tools and deal with any obvious conflicts. An example is variations in a passport number; although only one is likely valid for the same country and date; in such a circumstance, knowing that one number may have been garbled in a long line of communications while another is derived from an actual scanned document is important. Finding ways to flag such data with accuracy indicators is critical to the determining the confidence level the analytic conclusions deserve. This principle also applies to the history of the data: an analyst may need to know whether the data are "raw"--not previously manipulated by tools, techniques, or other analysts or instead derived from either automatically created relationships (tool derived) or other analyst's assertions. The more this type of information can be tracked along with the data, the more likely the analyst will be able to make accurate intelligence assessments. KNOWLEDGE BASE As noted, analytic or tool-based assertions are different from actual "raw" data. The latter is what is generally subjected to data characterization; the former are derived data or intelligence assertions. These too should be stored given their value for other analysts, particularly when the analyst is looking for "non-obvious" personal or organizational relationships (connecting the dots), long term trend analysis, historical context, or a myriad of other analytic functions. Such derived data facts or assertions should be maintained in a knowledge data base that is as widely accessible as possible given clearances, accesses, and analytic roles across a broad spectrum of intelligence and law enforcement organizations. The ability of analysts to build on the knowledge acquired by their compatriots is essential to advancing analytic success against a highly dynamic and decentralized set of evolving intelligence targets. 6

CONCLUDING OBSERVATIONS Comprehensive data characterization for raw data combined with knowledge bases for derived data assertions will continue to grow in importance as data proliferate and analytic resources are constrained by budgets and relevant experience. Understanding and making sense of all that data ultimately contributes to the effectiveness of the analytic process. Data characterization is not the most exciting aspect of the analytic cycle, nor is it all that is necessary, but it is the basic foundation for all that is to come. The ongoing challenge in the intelligence world is not just to acquire all the relevant information, but to manage and track it once it is acquired, because we all understand the danger in potentially possessing the "golden nuggets" but being unable to find them or use them effectively to get the answers critical to thwarting national security threats and navigating dangerous environments. Data characterization alone is not enough, but it is a huge step forward and one that we cannot afford to minimize or overlook. 7