Data Science and Big Data: Below the Surface and Implications for Governance



Similar documents
Integrating a Big Data Platform into Government:

This Symposium brought to you by

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Are You Ready for Big Data?

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Understanding traffic flow

How To Handle Big Data With A Data Scientist

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Transforming the Telecoms Business using Big Data and Analytics

The 4 Pillars of Technosoft s Big Data Practice

Are You Ready for Big Data?

The Internet of Things and Big Data: Intro

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Data Refinery with Big Data Aspects

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Information Builders Mission & Value Proposition

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

How To Make Data Streaming A Real Time Intelligence

The Future of Data Management

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Testing Big data is one of the biggest

HDP Hadoop From concept to deployment.

Business Intelligence for Big Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

2015 Analyst and Advisor Summit. Advanced Data Analytics Dr. Rod Fontecilla Vice President, Application Services, Chief Data Scientist

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Big Data Analytics Roadmap Energy Industry

How the oil and gas industry can gain value from Big Data?

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Big Data. Fast Forward. Putting data to productive use

Big Data and Data Science: Behind the Buzz Words

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Big Data Big Data/Data Analytics & Software Development

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Microsoft SQL Server 2012 with Hadoop

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

How To Scale Out Of A Nosql Database

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Cloud Big Data Architectures

Cloud Integration and the Big Data Journey - Common Use-Case Patterns

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Fight fire with fire when protecting sensitive data

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Hadoop implementation of MapReduce computational model. Ján Vaňo

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Data Modeling for Big Data

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

Big Data Are You Ready? Thomas Kyte

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

SAP and Hortonworks Reference Architecture

An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP Oracle ESG Data Systems Architecture

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

North Highland Data and Analytics. Data Governance Considerations for Big Data Analytics

Analyzing Big Data with AWS

Big Data for Investment Research Management

Big Data and Analytics in Government

P4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland

BIG DATA STRATEGY. Rama Kattunga Chair at American institute of Big Data Professionals. Building Big Data Strategy For Your Organization

A New Era Of Analytic

The Future of Data Management with Hadoop and the Enterprise Data Hub

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

How To Use Big Data For Business

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Internals of Hadoop Application Framework and Distributed File System

BIG DATA TECHNOLOGY. Hadoop Ecosystem

PALANTIR CYBER An End-to-End Cyber Intelligence Platform for Analysis & Knowledge Management

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Towards Smart and Intelligent SDN Controller

An Oracle White Paper October Oracle: Big Data for the Enterprise

This Conference brought to you by

Tap into Hadoop and Other No SQL Sources

Transcription:

Data Science and Big Data: Below the Surface and Implications for Governance Randy Soper The views expressed are those of the author and do not reflect the official position or policy of the Defense Intelligence Agency, the Department of Defense or its components, or the United States Government. 1

A (Typical?) Data Science/Big Data Story From Scott Adams s Dilbert Pointy Haired Boss technically (and managerially) clueless, always chasing the latest buzzword Dogbert high-paid consultant, questionable ethical framework 2

A (Typical?) Data Science/Big Data Story Companies seem to be really excited about the Big Data -thingy maybe I should contract out for some of that? Dogbert the Data Scientist 3

A (Typical?) Data Science/Big Data Story I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist 4

A (Typical?) Data Science/Big Data Story I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Is the P.H.B. happy??? 5

A (Typical?) Data Science/Big Data Story I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Of course!!! 6

A (Typical?) Data Science/Big Data Story But I used big data and machine learning to (beyond our understanding build a predictive of the analytics capability for personalities involved your inventory here) flows for just-in-time delivery and I also developed a dashboard There are concepts we need to based on customer sentiment analysis of understand social media feeds to push alerts to your And questions sales staff we about should real-time be asking regional trends of interest in your product line. 7

What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist 8

What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist What s a data scientist? 9

Data Science is a Team Sport Subject matter knowledge/domain expert IT skills (development/ infrastructure) Statistics/mathematical skills The Data Science Venn Diagram by D. Conway; Booz, Allen, Hamilton; and others 10

Data Science is a Team Sport Subject matter knowledge/domain expert IT skills (development/ infrastructure) Statistics/mathematical skills The Unicorn (rare and wonderful) The Data Science Venn Diagram by D. Conway; Booz, Allen, Hamilton; and others 11

What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist Is this what we need? What s our requirement? 12

Start with the Requirements Data science/big data is about infrastructure, data, data pre-processing and aggregation, analytic tools, data scientists, analytic techniques, actionable deliverable product Buy systems, buy data, buy tools, hire talent? The data and the tools are the shiny objects First step what are my business objectives? These should drive everything (architecture, data, tools) 13

What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. What s big data? Dogbert the Data Scientist 14

Big Data is??? Big data may be more than just a lot of data Big data isn t just unstructured data/nosql/hadoop (Although these are frequently powerful components!) Big data is fundamentally about the three (four) V s Volume, Variety, Velocity, (Veracity) 15

The V s of Big Data Volume Corporate data warehousing Log data, sensor data ( IoT) Social media Document corpus Speech, image, video Etc., etc., etc. Variety Structured, semi-structured, unstructured Velocity Rate of ingest, rate of analysis, decision automation Veracity Untrusted/unknown source, untreated data (Doesn t define big data like the others, but frequently accompanies ) 16

Not Only SQL (NoSQL) Rational Database Management System (RDBMS) Emphasis on ACID properties (Atomicity, Consistency, Isolation and Durability) NoSQL Schema-free ( V = variety!) High performance (no joins!) Scalable ( V = volume!) NoSQL does not address the velocity V 17

Not Only SQL (NoSQL) Rational Database Management System (RDBMS) Emphasis on ACID properties (Atomicity, Consistency, Isolation and Durability) NoSQL Schema-free ( V = variety!) High performance (no joins!) Scalable ( V = volume!) NoSQL does not address the velocity V NoSQL couchdb accumulo 18

Hadoop / MapReduce Master node Cluster 19 Distributed computation on commodity hardware (Intel/AMD x86 processors) across cluster against key-value pair operations Data/compute collocated Scalable, schema-free suitable for NoSQL computation Redundant storage resistant to node failure

Big Data vs. Data Science Data science application of IT capability, domain knowledge, and/or statistics to obtain new business value from (conglomerations of) data Big Data data challenges involving one or more of the V s 20

Big Data vs. Data Science Data science application of IT capability, domain knowledge, and/or statistics to obtain new business value from (conglomerations of) data Big Data data challenges involving one or more of the V s 21 All but the most pedestrian of big data problems use data science. Not all data science problems involve the V s of big data.

What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist What are data science techniques? 22

The Work of Data Science Data acquisition Internal (policy, transfer) Purchase Stream e.g., social media, etc., exposed via Application Programming Interface (API) Data manipulation, extract, transform, load (ETL), aggregation Data lake Natural language parsing (including sentiment analysis) Statistics Characteristics: ordinal/likert data, mixed inputs categories, geospatial data, binary/yes-no results, etc. Special regressions (e.g., logistic) Numerical techniques including supervised/unsupervised machine learning Random forests, clustering, Bayesian analysis, deep-learning neural nets, Monte Carlo simulations Visualization, sense-making 23

Data Science Ecosystem Written report Alerts/dashboards Exposed API Analytic tools and discoverable data Product delivery Machine learning Natural language processing Regression Visualization Analytic tools Data lake ETL Manual munging & wrangling Parsing Tagging Data conditioning & aggregation Owned batch data Owned streaming data (log, sensor, etc.) External/purchased data (batch) External streaming data Data sources Infrastructure 24

Social Media as Customer Data Twitter exposes 1% of all tweets on a public, no charge API 100% of tweets are available, live, stream through cost service if you tweet, you are Twitter s product! Companies use for real-time, geolocated information about customer (and competitor) behavior 25

26 Comparative Word Clouds of ISACA International and ISACA NCAC Official Twitter Feeds

What s Really Going On? Let s Unpack This I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist What are big data/data science products or enterprise delivery options? 27

Variety of Mechanisms to Deliver Data Science Value Streaming information, processed at scale in real-time? May want to consider real-time alerting for immediate decision But, need to make sure decision-making framework and personnel are prepared to capitalize Other more traditional options may be just as viable 28

Predictive Analytics Business/government moving from using data for retrospective understanding Patterns, sense-making, visualization to predictive tools for proactive response Predictive models built from statistical analysis Still primarily a future state 29

Some Thoughts on Governance/Controls I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist Warning big data may mean new data sources, data sharing, and data policies 30

Data Science and Big Data Can Mean Unprecedented sharing of data Unprecedented accumulations of data Investment to purchase data Corporate recognition of increased business value in data and in more kinds of data Direct sale of or exposure of data or direct derivatives What are your use-case specific best controls? 31

Data Science and Big Data Can Mean Unprecedented sharing of data I want to lay out three things that the private sector can Unprecedented do today that will protect accumulations them from the vast of majority data of attacks, from the Chinese and elsewhere. One: Patch IT software obsessively. Investment to purchase data Two: Segment your data. A single breach shouldn t give attackers access to a mother lode of proprietary data Corporate Three: Pay recognition attention to the threat of bulletins increased that DHS business and value FBI in put data out. and in more kinds of data And, if there s a fourth commandment, it s this: Teach folks what spear phishing looks like. Direct sale of or exposure of data or direct - Director of National Intelligence Clapper at the 2015 derivatives International Conference on Cyber Security What are your use-case specific best controls? 32

Data Science and Big Data Can Mean Unprecedented sharing of data I want to lay out three things that the private sector can Unprecedented do today that will protect accumulations them from the vast of majority data of attacks, from the Chinese and elsewhere. One: Patch IT software obsessively. Investment to purchase data Two: Segment your data. A single breach shouldn t give attackers access to a mother lode of proprietary data Corporate Three: Pay recognition attention to the threat of bulletins increased that DHS business and value FBI in put data out. and in more kinds of data And, if there s a fourth commandment, it s this: Teach folks what spear phishing looks like. Direct sale of or exposure of data or direct - Director of National Intelligence Clapper at the 2015 derivatives International Conference on Cyber Security What are your use-case specific best controls? 33

Some Thoughts on Governance/Controls I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist Warning uncontrolled data science = development tools in production environment against live data 34

Doing Data Science Many commercial and open source data science/big data capabilities available: Focused on log-file analysis, visualization, business analytics, data integration, democratization of analytics, etc. 35

it s about playing with the data! Initially and for evolution/maintenance, data scientists will want to bring flexible analytics to real business data 36

it s about playing with the data! Initially and for evolution/maintenance, data scientists will want to bring flexible analytics to real business data The domain of business value discovery for data science 37

it s about playing with the data! Initially and for evolution/maintenance, data scientists will want to bring flexible analytics to real business data Excel Hadoop: - MapReduce - Pig - Hive MATLAB Python 38 R

Some Thoughts on Governance/Controls I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist Warning application of illdefined/broad concepts could lead inconsistent/non-repeatable results in key business processes 39

Some Thoughts on Governance/Controls I used big data and machine learning to build a predictive analytics capability for your inventory flows for just-in-time delivery and I also developed a dashboard based on customer sentiment analysis of social media feeds to push alerts to your sales staff about real-time regional trends of interest in your product line. Dogbert the Data Scientist Warning analysis of RoI on start-up big data/data science efforts especially challenging, but needs to be baked in 40

41 Questions?

42 Backup

Governance/Controls Bonus What s this Internet of Things (IoT)??? Imagine your car navigation, calendar, clock, and coffeemaker having the ability to communicate. You have a high priority, early morning meeting. Your navigation system knows that there s a major traffic accident and your commute will be longer than normal. Therefore your clock automatically resets your wakeup alarm earlier and your coffee maker resets your auto-brew time earlier to get you to your meeting on time! Now ask: what are the IT security implications of this degree of connectedness? 43

Governance/Controls Bonus What s this Internet of Things (IoT)??? Imagine your car navigation, calendar, clock, and coffeemaker having the ability to communicate. You have a high priority, early morning meeting. Your navigation system knows that there s a major traffic accident and your commute will be longer than normal. Therefore your clock automatically resets your wakeup alarm earlier and your coffee maker resets your auto-brew time earlier to get you to your meeting on time! Now ask: what are the IT security implications of this degree of connectedness? 44