Cognitive and Organizational Challenges of Big Data in Cyber Defense Nathan Bos & John Gersh Johns Hopkins University Applied Laboratory nathan.bos@jhuapl.edu, john.gersh@jhuapl.edu The cognitive and organizational challenges associated with Big Data have not received much research attention. We have begun an interview study of analysts who work in the computer network (cyber) defense (CND) area and have experienced changes in data scale affecting their analytical work. We used a qualitative inquiry method, starting with relatively open-ended questions. Our interview protocol also asked analysts to describe critical incidents related to data use, and probed for previously-identified cognitive biases that may affect analysis in this domain. What does Big Data mean in CND? In an important sense, data size has not changed for CND analysts: they ve always had more than they can handle. This is the cognitive equivalent of an aspect of computer scientists original concept of big data: data too big to be memory-resident (Cox & Ellsworth, 1997). Raw data in CND can t fit in analysts long-term memory (let alone working memory!). Analysis has always required cognitive artifacts (Norman, 1992) to deal with their data. While one could start with words on paper, we could consider the (computer) spreadsheet as the ur-artifact in this domain. CND analysts still use spreadsheets in some cases to handle log entries from computers or defensive systems like intrusion-detection systems (IDSs). Today analysts cognitive artifacts are much more capable at accessing, correlating and presenting data. Nonetheless, the presentation may still rely on lists of events and objects. These artifacts have always played the same role: providing a representation that reduces the size and complexity of data to something that human cognition can handle. As a result, analysts view of data, and more importantly, their view of the world they re trying to understand are defined by the tools they use. The cyber defense domain has particular attributes that affect the relationship between analysts and their data: important data-driven decisions must be made in real time or near-real time. The domain is non-physical; almost all thinking is about abstractions. Information requirements and sensor development are driven by external actors: their capabilities, tactics, and strategies. This leads to a cycle of growth in data size: New threat capabilities and strategies New defensive strategy More and more diverse sensors with more, faster, and more diverse data More complex
technology to handle new kinds of and bigger data New threat capabilities in response Methods We present some initial findings based on N=6 analysts representing multiple organizations. Our qualitative analysis involves identifying common challenge themes and coding interview notes. We use the common Big Data dimensions of Volume, Variety and Velocity as both a way to structure interview questions and a tag for responses. We found, though, that analysts think more in terms of the challenge themes than in those data dimensions. The dimensions utility lies more in understanding how data size may affect of analysts thinking than in characterizing how analysts view their domain. We additionally coded issues as to whether they were primarily Technological, Cognitive or Organizational challenges. (Second level coding categories are not represented in this abstract.) Challenge Theme Type of issues Big Data dimension Challenge Cognitive Technological Organizational Volume Velocity Variety Tools and Automation Archiving Monitoring alerts Cognitive bias: WYSIATI Cognitive bias: Judgments of risk Pace of work Increasing coordination costs Tools and Automation Characteristics of the CND domain, its data, and the technology analysts use come together to raise complex challenges to analysts use of data to understand what s happening in their networks. In the past, threats were seen as intrusions, attempts to break through a defensive perimeter; IDSs issued alerts on attempts, and anti-virus software issued alerts on malware observed on computers. Data were seen as events with attributes happening on an individual computer. For a network, data also include packets, objects with attributes. These alerts were seen as high-level data (events); packets (objects) and, in some cases, log entries were seen as the raw data that generated them. Evolving threats have added complexity to the situation. Advanced persistent threats (APTs) represent an adversary s long-term presence in computers or other devices on a network through a variety of means. From that presence the adversary attempts to extract valuable information and exfiltrate it. Threat activity extends over time and involves coordinated action among many pieces of hidden software. Nevertheless, analysts can
still view their data as a collection of events and objects. These events and objects, however, are now represented at a higher level of abstraction. Examples might be malware-exfiltrating-a-file, or a-command-and-control-message. Each of these is evidenced by a collection of lower-level events like log entries or packets. This is the sense in which analysts data size has not changed: they still deal with all the alerts and objects that they can handle. Now, though, analysts tools present them with alerts and objects that are more distant from the data. Here distance refers to their cognitive artifacts, representational distance from what was previously considered as fundamental data. (This distance is also relative to the current technology. Distance from data really means distance from what we formerly saw as the raw data for our analysis. Even in the past, for example, analysts were becoming distant from the individual bits making up a packet.) Senior analysts refer to this distance in terms of novice analysts lack of understanding of what alerts really mean: the specific kinds of lower-level events that generate them and their significance for the network and for the organizational mission. Similar concerns apply to forensic analyses. Even with more time to investigate a problem, they feel that novices may not have the skill to drill down from more abstract data to more concrete and then understand the data s connections and implications. These issues are exacerbated by an increase in variety of the underlying sensor data and the complexity of relationships among different kinds of data. (We found that variety and its concomitant complexity were the most challenging data dimension for analysts work. Dealing with this variety (as well as data volume and velocity) has required automation of increased complexity and span of action. For example, automation in an IDS involves the recognition of a particular set of attributes (signature) in a sequence of packets. Automation for dealing with APTs may require correlation among several kinds of data appearing in a particular sequence over an extended period of time. This evolution of automation brings with it the potential for changes in analysts roles and for operational errors that have been observed in using automation in other domains, for example in aviation or health care (Parasuraman & Wickens, 2008). Careful design of analysts tools can help to prevent such errors (Norman, 1990) Automated detection of (and perhaps response to) significant CND events has tended to put analysts into a supervisory role over the automation, which can be a significant change for them and their organizations. Analysts referred to a new rule-writing skill requirement. In addition to responding to alerts or analyzing situations, some analysts must now define rules for correlating complex sets of data and for alerting to complicated situations. Developing the skill needed formalize one s understanding of situations from data sets and the place of skilled rule-writers in organizations are new issues facing CND teams. In other cases rules may be invisible to analysts because they re inaccessibly embedded in commercial tools, which may constrain analyst s supervisory role. (Analysts may become distant from their analytics as well as from their data.) Various conceptual models have been developed for thinking about cognition in the supervisory control of dynamic systems. Concepts from such models include (among others) users comparing rules actions to expected outcomes (Hamill & Gersh, 1992) and the role in joint (human-system) cognitive systems of an artificial construct of knowledge
about the world (Hollnagel & Woods, 2005). Such concepts from other domains may prove useful in building similar models for analysts interactions with big data. Archiving When asked about challenges of data volume, most analysts we talked to first volunteered information related to archiving. For example, despite ever-cheaper storage, it is not considered cost effective to store full-packet data for very long. Different kinds of data and metadata are archived for different lengths of time. Organizations set archiving policies for different data types, with implications for what kinds of investigations can be conducted in the future. Monitoring alerts First-tier monitoring tasks are often based on automated alerts generated by monitoring tools. Increases in network traffic contribute to sometimes-overwhelming numbers of alerts. Analysts use experience and intuition to make risk judgments about alerts and decide which require further investigation. Changes in data volume and velocity require constant adaption in these judgments, (e.g. when a type of alert previously judged worthy of investigation becomes 100x more prevalent) which presents challenges for both novice and experienced analysts. Cognitive bias: WYSIATI Psychologist Daniel Kahneman describes a cognitive bias he calls What You See Is All There Is (Kahneman. 2011). This occurs when an analyst makes an implicit assumption that available data is complete, and may make incorrect inferences from absence of data (i.e. treating absence of evidence as evidence of absence). Logically, however, size of a dataset is not an indicator of completeness. Increasing data size may lead analysts to make more errors of this type. Cognitive bias: Judgments of risk Novice analysts tend to react to automated alerts as if each one represents a significant threat. They see the cyber world as an inherently dangerous, risky place. Judgments of probability and risk are the source of well-known biases (Kahneman, 2011).These biases can affect assessment of threat likelihood both in rapid response to alerts and in more deliberate investigation of incidents. Novice analysts tend to consider all alerts as representing real threats, considering the cyber domain as an inherently risky place, leading to more false alarm reports. More experienced analysts tend to think in terms of the explicit relationship between data from events and automated alert criteria and thresholds. They realize that these settings might produce many false alarms and react accordingly. Novices work to prove that each alert represents a threat; experts work to prove that it doesn t. This has been attributed to data size-driven distance from data and automation: novices don t have a clear picture of the relationships among events, data, and automation. The progression from novice to expert may be described as movement along a human-system receiver operating characteristic (ROC) curve (Sorkin & Woods, 1985)
Pace of work Increasing velocity of data corresponds to increasing rate of data collection. However, analysts report that the pace of work expected is dictated by the organization, not the data. Organizations respond to increased variety by expanding staff, changing analytical priorities, and when possible, adopting new tools. In response to increased volume, organizations often deprioritized certain types of attacks (e.g. nuisance software) and devote less time to open-ended exploration of the data. In the words of one analyst, I used to spend more time hunting. Increasing coordination costs The easiest way to scale up is to recruit more novice analysts and put them to work at Tier 1 analytical work of monitoring traffic and triaging alerts. Expanding organizations incurs well-understood increases in levels of management, training costs, and coordination costs between analysts, across shifts and specializations. Coordination also increases the value of standardized notation and procedures, which are not always present in the relatively new field of cyber-defense. References Cox, M., & Ellsworth, D. (1997). Managing big data for scientific visualization. ACM Siggraph: International Conference on Computer Graphics and Interactive Techniques. Hamill, B. W., & Gersh, J. R. (1992). Decision-making performance in rule-based supervisory control: Empirical development of a cognitive process model. Presented at the Joint Directors of Laboratories Basic Research Group Symposium on C2 Research. Hollnagel, E., & Woods, D. D. (2005). Joint cognitive systems: Foundations of Cognitive Systems Engineering. Boca Raton, FL: Taylor and Francis. Kahneman, D. (2011). Thinking Fast and Slow. New York NY, Farrar, Straus, and Giroux. Norman, D. A. (1990). The 'Problem' with Automation: Inappropriate Feedback and Interaction, not Over-Automation. Philosophical Transactions of the Royal Society of London. Series B Biological Sciences, 327(1241), 585 593. Norman, D. A. (1992). Design principles for cognitive artifacts. Research in Engineering Design, 4(1), 43 50. Parasuraman, R., & Wickens, C. D. (2008). Humans: Still vital after all these years of automation. Human Factors, 50(3), 511 520. Sorkin, R., & Woods, D. D. (1985). Systems with human monitors: A signal detection analysis. Human-Computer Interaction, 1(1), 49 75.