Data Science and Big Data: Below the Surface and Implications for Governance Randy Soper The views expressed are those of the author and do not reflect the official position or policy of the Defense Intelligence Agency, the Department of Defense or its components, or the United States Government. 1
A (Typical?) Data Science/Big Data Story From Scott Adams s Dilbert Pointy Haired Boss technically (and managerially) clueless, always chasing the latest buzzword Dogbert high-paid consultant, questionable ethical framework 2
A (Typical?) Data Science/Big Data Story Companies seem to be really excited about the Big Data -thingy maybe I should contract out for some of that? Dogbert the Data Scientist 3
A (Typical?) Data Science/Big Data Story I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist 4
A (Typical?) Data Science/Big Data Story I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Is the P.H.B. happy??? 5
A (Typical?) Data Science/Big Data Story I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Of course!!! 6
A (Typical?) Data Science/Big Data Story But I used big data and machine learning to (beyond our understanding build a predictive of the analytics capability for personalities involved your inventory here) flows for just-in-time delivery and I also developed a dashboard There are concepts we need to based on customer sentiment analysis of understand social media feeds to push alerts to your And questions sales staff we about should real-time be asking regional trends of interest in your product line. 7
What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist 8
What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist What s a data scientist? 9
Data Science is a Team Sport Subject matter knowledge/domain expert IT skills (development/ infrastructure) Statistics/mathematical skills The Data Science Venn Diagram by D. Conway; Booz, Allen, Hamilton; and others 10
Data Science is a Team Sport Subject matter knowledge/domain expert IT skills (development/ infrastructure) Statistics/mathematical skills The Unicorn (rare and wonderful) The Data Science Venn Diagram by D. Conway; Booz, Allen, Hamilton; and others 11
What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist Is this what we need? What s our requirement? 12
Start with the Requirements Data science/big data is about infrastructure, data, data pre-processing and aggregation, analytic tools, data scientists, analytic techniques, actionable deliverable product Buy systems, buy data, buy tools, hire talent? The data and the tools are the shiny objects First step what are my business objectives? These should drive everything (architecture, data, tools) 13
What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. What s big data? Dogbert the Data Scientist 14
Big Data is??? Big data may be more than just a lot of data Big data isn t just unstructured data/nosql/hadoop (Although these are frequently powerful components!) Big data is fundamentally about the three (four) V s Volume, Variety, Velocity, (Veracity) 15
The V s of Big Data Volume Corporate data warehousing Log data, sensor data ( IoT) Social media Document corpus Speech, image, video Etc., etc., etc. Variety Structured, semi-structured, unstructured Velocity Rate of ingest, rate of analysis, decision automation Veracity Untrusted/unknown source, untreated data (Doesn t define big data like the others, but frequently accompanies ) 16
Not Only SQL (NoSQL) Rational Database Management System (RDBMS) Emphasis on ACID properties (Atomicity, Consistency, Isolation and Durability) NoSQL Schema-free ( V = variety!) High performance (no joins!) Scalable ( V = volume!) NoSQL does not address the velocity V 17
Not Only SQL (NoSQL) Rational Database Management System (RDBMS) Emphasis on ACID properties (Atomicity, Consistency, Isolation and Durability) NoSQL Schema-free ( V = variety!) High performance (no joins!) Scalable ( V = volume!) NoSQL does not address the velocity V NoSQL couchdb accumulo 18
Hadoop / MapReduce Master node Cluster 19 Distributed computation on commodity hardware (Intel/AMD x86 processors) across cluster against key-value pair operations Data/compute collocated Scalable, schema-free suitable for NoSQL computation Redundant storage resistant to node failure
Big Data vs. Data Science Data science application of IT capability, domain knowledge, and/or statistics to obtain new business value from (conglomerations of) data Big Data data challenges involving one or more of the V s 20
Big Data vs. Data Science Data science application of IT capability, domain knowledge, and/or statistics to obtain new business value from (conglomerations of) data Big Data data challenges involving one or more of the V s 21 All but the most pedestrian of big data problems use data science. Not all data science problems involve the V s of big data.
What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist What are data science techniques? 22
The Work of Data Science Data acquisition Internal (policy, transfer) Purchase Stream e.g., social media, etc., exposed via Application Programming Interface (API) Data manipulation, extract, transform, load (ETL), aggregation Data lake Natural language parsing (including sentiment analysis) Statistics Characteristics: ordinal/likert data, mixed inputs categories, geospatial data, binary/yes-no results, etc. Special regressions (e.g., logistic) Numerical techniques including supervised/unsupervised machine learning Random forests, clustering, Bayesian analysis, deep-learning neural nets, Monte Carlo simulations Visualization, sense-making 23
Data Science Ecosystem Written report Alerts/dashboards Exposed API Analytic tools and discoverable data Product delivery Machine learning Natural language processing Regression Visualization Analytic tools Data lake ETL Manual munging & wrangling Parsing Tagging Data conditioning & aggregation Owned batch data Owned streaming data (log, sensor, etc.) External/purchased data (batch) External streaming data Data sources Infrastructure 24
Social Media as Customer Data Twitter exposes 1% of all tweets on a public, no charge API 100% of tweets are available, live, stream through cost service if you tweet, you are Twitter s product! Companies use for real-time, geolocated information about customer (and competitor) behavior 25
26 Comparative Word Clouds of ISACA International and ISACA NCAC Official Twitter Feeds
What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist What are big data/data science products or enterprise delivery options? 27
Variety of Mechanisms to Deliver Data Science Value Streaming information, processed at scale in real-time? May want to consider real-time alerting for immediate decision But, need to make sure decision-making framework and personnel are prepared to capitalize Other more traditional options may be just as viable 28
Predictive Analytics Business/government moving from using data for retrospective understanding Patterns, sense-making, visualization to predictive tools for proactive response Predictive models built from statistical analysis Still primarily a future state 29
Some Thoughts on Governance/Controls I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist Warning big data may mean new data sources, data sharing, and data policies 30
Data Science and Big Data Can Mean Unprecedented sharing of data Unprecedented accumulations of data Investment to purchase data Corporate recognition of increased business value in data and in more kinds of data Direct sale of or exposure of data or direct derivatives What are your use-case specific best controls? 31
Data Science and Big Data Can Mean Unprecedented sharing of data I want to lay out three things that the private sector can Unprecedented do today that will protect accumulations them from the vast of majority data of attacks, from the Chinese and elsewhere. One: Patch IT software obsessively. Investment to purchase data Two: Segment your data. A single breach shouldn t give attackers access to a mother lode of proprietary data Corporate Three: Pay recognition attention to the threat of bulletins increased that DHS business and value FBI in put data out. and in more kinds of data And, if there s a fourth commandment, it s this: Teach folks what spear phishing looks like. Direct sale of or exposure of data or direct - Director of National Intelligence Clapper at the 2015 derivatives International Conference on Cyber Security What are your use-case specific best controls? 32
Data Science and Big Data Can Mean Unprecedented sharing of data I want to lay out three things that the private sector can Unprecedented do today that will protect accumulations them from the vast of majority data of attacks, from the Chinese and elsewhere. One: Patch IT software obsessively. Investment to purchase data Two: Segment your data. A single breach shouldn t give attackers access to a mother lode of proprietary data Corporate Three: Pay recognition attention to the threat of bulletins increased that DHS business and value FBI in put data out. and in more kinds of data And, if there s a fourth commandment, it s this: Teach folks what spear phishing looks like. Direct sale of or exposure of data or direct - Director of National Intelligence Clapper at the 2015 derivatives International Conference on Cyber Security What are your use-case specific best controls? 33
Some Thoughts on Governance/Controls I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist Warning uncontrolled data science = development tools in production environment against live data 34
Doing Data Science Many commercial and open source data science/big data capabilities available: Focused on log-file analysis, visualization, business analytics, data integration, democratization of analytics, etc. 35
it s about playing with the data! Initially and for evolution/maintenance, data scientists will want to bring flexible analytics to real business data 36
it s about playing with the data! Initially and for evolution/maintenance, data scientists will want to bring flexible analytics to real business data The domain of business value discovery for data science 37
it s about playing with the data! Initially and for evolution/maintenance, data scientists will want to bring flexible analytics to real business data Excel Hadoop: - MapReduce - Pig - Hive MATLAB Python 38 R
Some Thoughts on Governance/Controls I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist Warning application of illdefined/broad concepts could lead inconsistent/non-repeatable results in key business processes 39
Some Thoughts on Governance/Controls I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist Warning analysis of RoI on start-up big data/data science efforts especially challenging, but needs to be baked in 40
41 Questions?
42 Backup
Governance/Controls Bonus What s this Internet of Things (IoT)??? Imagine your car navigation, calendar, clock, and coffeemaker having the ability to communicate. You have a high priority, early morning meeting. Your navigation system knows that there s a major traffic accident and your commute will be longer than normal. Therefore your clock automatically resets your wakeup alarm earlier and your coffee maker resets your auto-brew time earlier to get you to your meeting on time! Now ask: what are the IT security implications of this degree of connectedness? 43
Governance/Controls Bonus What s this Internet of Things (IoT)??? Imagine your car navigation, calendar, clock, and coffeemaker having the ability to communicate. You have a high priority, early morning meeting. Your navigation system knows that there s a major traffic accident and your commute will be longer than normal. Therefore your clock automatically resets your wakeup alarm earlier and your coffee maker resets your auto-brew time earlier to get you to your meeting on time! Now ask: what are the IT security implications of this degree of connectedness? 44