1 Big Data Little Impact? Raising the bar for "Data Science" Alexander Lang
2 Let s set the record straight Big Data is irrelevant Big Data Analytics is relevant Is the Big in Big Data Analytics reallyrelevant?
3 Astron Uses streaming analytics to deliver insights from the world s largest radio telescope 99% faster identification of relevant data and images, making information available to astronomers in minutes rather than several days Analyzes >1 exabyte of data daily twice the amount generated by global daily Internet traffic Government Integrates data from >3,000 dishes and antennas to form the largest and fastest radio telescope in the world Solution components Software IBM InfoSphere Streams IBM SPSS Modeler The transformation: Streaming analytics analyzes huge volumes of data-in-motion to gain insight from the world s largest Telescope.
4 Vestas Turns climate into capital with Big data Energy & Utilities 97% decrease in response times for wind forecasting information Cuts cost per kilowatt hour from wind energy increasing customer s return on investment 40% reduction in energy consumption, reducing IT footprint while increasing power Solution components Software IBM InfoSphere BigInsights Enterprise Edition Hardware IBM System x idataplex dx360 M3 IBM System Storage DS5300 The transformation: Analyzing petabytes of wind data to pinpoint optimal turbine placement, maximizes power generation and reduces energy costs. In our development strategy, we see growing our library in the range of 18 to 24 petabytes of data. And while it s fairly easy to build that library, we needed to make sure that we could gain knowledge from that data. Lars Christian Christensen, vice president, Vestas Wind Systems
5 Asian Telco reduces billing costs and improves customer satisfaction For some companies, Twitter s data volumes are a thing of the past Real-time mediation and analysis of 5 Billion Call Detail Records per day Data processing time reduced from 12 hrs to 1 min Hardware cost reduced to 1/8 th Proactively address issues (e.g. dropped calls) impacting customer satisfaction IBM Corporation
6 Everything works. We re done. End of Story?
7 Doing Analytics is NOT mainstream yet (Source: Rexer Data Miner Survey 2013,
8 Over 25% of analytic models are never or rarely deployed (Source: Rexer Data Miner Survey 2013)
9 and when they re deployed, they re deployed late (Source: Rexer Data Miner Survey 2013)
10 It s time to change that
11 A newroleemerges: The Data Scientist Integrate data across sources Extract features Visualize data Deal withbig Data StatisticsKnowledge KnowsStatistics/ Mining-Tools (SPSS, SAS, R,...) Understands the domain to ensure that modelsarerelevant and actionable
12 Data Scientist: Job Descriptions(kaggle.com) Qualification: Ph.D. in Statistics, Machine Learning, Computer Science or other quantitative data science field Master s or higher in computer science, artificial intelligence, statistics, or similar Solid understanding of applied statistics, predictive modeling, data mining, machine learning, and other quantitative methodology Tasks: Domain Expertise? Training on the job! The Lead Data Scientist works across the enterpriseto shape Bridgepoint s strategy by analyzing, and interpreting complex data setsand building innovative products or services that utilize big data. Use analytics/statistics to answer business questions; appropriately selecting the relevant analytical technique, creating meaningful data visualizationsand representations, and effectively communicating the data storyand resulting recommendations. Passionfor digging into large data sets and extracting knowledge through analysis and visualization
13 CDOs primary focus is to drive business value from data The Chief Data Officer: The executive for Data Science Data leverage Find ways to use existing data assets Data enrichment Augment data by combining internal and external data Data protection Protect data as an asset Data monetization Find new avenues of earnings and revenue Data upkeep Manage the health of data undergovernance 13 Source: The new hero of big data and analytics: The Chief Data Officer, IBM Institute for Business Value IBM
14 Technology Imperatives Make it easy for everybody to have a meaningful conversation with data Make it easy for Data Scientists to create Big Data Analytics applications
15 Data Scientistsarehardtocomeby howcansoftware help? Automatic feature extraction Visualization advice Automated selection oftheproper statistical approach IBM Watson Analytics Built in semantic knowledgefordata in particular domains Specialized Applications for particular domains: IBM SocialMedia Analytics, IBM PredictiveMaintenance andquality,...
16 End Goal: A meaningful conversation with data
17 Example: Improving responses to a campaign A regional insurance branch office has conducted a promotional campaign offering new insurance products to their existing customers. Customer responses have been recorded together with existing data on customer demographics and their insurance profile information. The branch manager wonders What are the key factors that influence whether customers respond to the campaign? How can I improve my marketing, going forward?
18 The data set: your everyday wide data problem 9000 customers 48 attributes for each customer Some data may be wrong, some data is incomplete
19 Enter: IBM Watson Analytics Don t believe what I say. Try it out! https://watson.analytics.ibmcloud.com/
20 Step 1:: Create a workbook Select the data set Select the variable to explain
21 Step 2: Watson Analytics identifies key influence factors Watson Analytics applies several statistics & data mining models and selects the ones that yield the best explanation Users can select the explanation that works best for them Watson Analytics also surfaces other insights about the data set
22 Step 3-1: Insurance Agent explores insights further 17 of the 49 attributes have an effect on campaign response
23 Step 3-2: Key success factors for campaign response Focus on the high gain rules
24 The Aha moments. The campaign mainly reasonatedwith costomersliving in suburbs, not rural or metropolitan areas and mostly with older customers (Years since License Obtained > 10) and only with customers that don t have many claims filed (0 to 3) Next time: Get more info on what customers in rural areas need (a better deal on SUVs?) Think twice before sending offers out to frequent claimers
25 Step 3-3: Statistical details for those who need it
26 Other insights: example
27 But -wait a minute! Is this reallythe best predictive model that explains campaign responses? Probably not. It s the best model for an insurance branch manager, who has never heard the word predictive before It s waaaybetter than Plan B : gut feel
29 The Data Set: The American Time Use Survey Available at Data on Americans, collected over 10 years, on how much time they spend in categories such as Sleeping Housework Food & Drink Prep Caring for Children Playing with Children Shopping Eating and Drinking Socializing & Relaxing Television
30 Start with a question.
31 Watson Analytics selects the best suited data set Watson Analytics presents a chart that may help to answer my question
32 Dynamically computed insights, based on the current selection
34 Visualization Coach, suggesting a better visualization type
36 Built-in data semantics: salary ~ weekly earnings
37 Not really useful Insights guide me to the question I really wanted to ask
38 Grad School still pays off Yay!
39 IBM Social Media Analytics (SMA) Analytics Application for Marketing, Sales, Brand Management, Data Scientists AnalyzesSocialMedia Content (twitter, facebook, blogs, messageboards, reviews, video comments, news) for share of voice, sentiment, customer demographics, behavior, evolving topics in multiple languages Both historical analysis as well as continuous updates(every 20 minutes) Configuration Analysis
40 Inside IBM Social Media Analytics Welcome Page Configuration Reporting Project Project Project Project Project Blogs Message boards Twitter News Facebook Hadoop Analytics Platform Evolving Topics Text Analytics IBM BigInsights 2.1 Influencer Identification DB2 ESE Cognos BI Reviews Video 40 E2E Orchestration IBM Social Media Analytics
41 Hadoop/ Warehouse / Big Data? All ourusersseeis: Application Users don t care about the processing platform (but Data Scientists, Analysts do...)
42 IBM SocialMedia Analytics: Identifying author demographics Gender Identified through cues from the author s first name, the author s nickname and the author content Isauthormarriedora parent Identified in author content through trigger terms and text analysis rules 42
43 IBM SocialMedia Analytics: Identifying author behavior Users ofa certainproductorservice Authors mentioning my X and other expressions Whatproductfeaturesarerelevant forthem? Author name Recommenders/ Detractors youshouldusex / stayawayfromx Prospective users Potential sales leads for 1:1 engagement 43
44 Big Data Little Impact? The key Big Data impact: organizations finally realize that their data has untapped potential Big Data can mean: many small data problems in parallel Data size does NOT matter it s the actionabilityof the analysis that counts Big data will have a big impact only if we can make data analytics become mainstream
45 Backup the second piece of the technology equation Make it easy for everybody to have a meaningful conversation with data Make it easy for Data Scientists to create Big Data Analytics applications
46 IBM BlueMix - Sign Up at https://www.ng.bluemix.net Open-standards, cloud-based platform for building, managing, and running apps of all types including analytic apps
47 IBM s big data platform All Data New/Enhanced Applications Operational data zone Real-time Data Processing & Analytics Landing, Exploration and Archive data zone Deep Analytics data zone EDW and data mart zone What action should I take? Decision management What is happening? Discovery and exploration Cognitive Fabric What could happen? Predictive analytics and modeling Why did it happen? Reporting and analysis Information Integration & Governance Systems Security Storage On premise, Cloud, As a service
48 InfoSphere Streams Bringing the Data to the Analytics Continuous ingestion Continuous analysis Filter / Sample Infrastructure provides services for Scheduling analytics across hardware hosts, Establishing streaming connectivity Transform Annotate Correlate Classify Achieve scale: By partitioning applications into software components By distributing across stream-connected hardware hosts 48 Where appropriate: Elements can be fused together for lower communication latency 2013 IBM Corporation
49 InfoSphere Streams Studio Eclipse-based tool that enables developers to create stream applications IBM Corporation
50 InfoSphere Streams Runtime Meters Company Filter Usage Model Temp Action Optimizing scheduler assigns jobs to hosts, and continually manages resource allocation Usage Contract Text Extract Season Adjust Daily Adjust Commodity hardware laptop, blades or high performance clusters x86 host x86 host x86 host x86 host IBM Corporation
51 InfoSphere Streams Runtime Optimizing scheduler assigns PEs to hosts, and continually manages resource allocation Commodity hardware laptop, blades or high performance clusters Dynamically add hosts and jobs New jobs work with existing jobs Meters Meters Company Filter Usage Model Temp Action Usage Contract Text Extract Season Adjust Daily Adjust Text Extract Degree History Compare History Store History x86 host x86 host x86 host x86 host x86 host IBM Corporation
52 University of Ontario Institute of Technology (UOIT) Detects Neonatal Patient Symptoms Sooner Performing real-time analytics using physiological data from neonatal babies Continuously correlates data from medical monitors to detect subtle changes and alert hospital staff sooner Early warning gives caregivers the ability to proactively deal with complications Significant benefits: Helps detect life threatening conditions up to 24 hours sooner Helps detect life threatening conditions up to 24 hours sooner Lower morbidity and improved patient care 2013 IBM Corporation
53 IBM s view of a big data platform Deep Analytics Zone All Data New/Enhanced Applications Operational data zone Real-time Data Processing & Analytics Landing, Exploration and Archive data zone Deep Analytics data zone EDW and data mart zone What action should I take? Decision management What is happening? Discovery and exploration Cognitive Fabric What could happen? Predictive analytics and modeling Why did it happen? Reporting and analysis Information Integration & Governance Systems Security Storage On premise, Cloud, As a service
54 IBM InfoSphereBigInsights: 100% Open Source Hadoop + Value-Adds Scalable New nodes can be added on the fly Affordable Massively parallel computing on commodity servers Flexible Hadoop is schema-less, and can absorb any type of data Fault Tolerant Through MapReduce software framework Performance & Reliability Compression, Flexible Scheduler Enterprise Hardening of Hadoop Adaptive MapReduce GPFS Deep Analytics Text Analytics Big R Big SQL Enterprise Integration Connectors to other Enterprise SW 54
55 IBM InfoSphere BigInsights Big R Explore, visualize, transform, and model big data using familiar R syntax and paradigm R Clients Pull data summaries to R client Scale out R Partitioningof large data ( divide ) Data sources R Packages 1 Parallelcluster execution of pushed down R code ( conquer ) All of this from within the R environment (Map/Reduce is hidden from you) Almost any R package can run in this environment R Packages 2 Embedded R Execution Or, push R functions right on the data
56 Big R Machine Learning API Big R functions Inspired by R s Algorithm bigr.lm() lm() Linear regression bigr.glm() glm() Generalized Linear Models bigr.logistic.regression() glm() Multi-class Logistic Regression bigr.kmeans() kmeans() K-means clustering bigr.naive.bayes() naivebayes() Naïve Bayes classifier bigr.svm() svm() Support Vector Machine classifier bigr.univariatestats() - Central tendency, dispersion, skewness, kurtosis bigr.bivariatestats() - Pearson s correlation, F-test. bigr.sample() sample() Uniform sample by percentage, exact number of samples, or partitioned sampling. bigr.transform() - Recoding, dummy-coding, binning, scaling, and null value imputation * Machine learning components based on System ML accessible as Big R function in current beta release
57 Don t believe me, try it out for yourself! BigInsights on IBM BlueMix Aimed at developers - No cluster management expertise required IBM Analytics for Hadoop in minutes on the Cloud
For Big Data Analytics There s No Such Thing as Too Big The Compelling Economics and Technology of Big Data Computing March 2012 By: 4syth.com Emerging big data thought leaders Forsyth Communications 2012.
1 Contents Introduction. 1 View Point Phil Shelley, CTO, Sears Holdings Making it Real Industry Use Cases Retail Extreme Personalization. 6 Airlines Smart Pricing. 9 Auto Warranty and Insurance Efficiency.
what can businesses do to capture the full potential of big data? helping companies observe and assess their data sets, identify potential revenues and mitigate challenges contents introduction 3 identify
An Oracle White Paper March 2013 Big Data Analytics Advanced Analytics in Oracle Database Advanced Analytics in Oracle Database Disclaimer The following is intended to outline our general product direction.
How to embrace Big Data A methodology to look at the new technology Contents 2 Big Data in a nutshell 3 Big data in Italy 3 Data volume is not an issue 4 Italian firms embrace Big Data 4 Big Data strategies
Customer Cloud Architecture for Big Data and Analytics Executive Overview Using analytics reveals patterns, trends and associations in data that help an organization understand the behavior of the people
Big Data & Analytics for Semiconductor Manufacturing 半 導 体 生 産 におけるビッグデータ 活 用 Ryuichiro Hattori 服 部 隆 一 郎 Intelligent SCM and MFG solution Leader Global CoC (Center of Competence) Electronics team General
Big Data in Big Companies Date: May 2013 Authored by: Thomas H. Davenport Jill Dyché Copyright Thomas H. Davenport and SAS Institute Inc. All Rights Reserved. Used with permission Introduction Big data
WHITEPAPER Big data in banking for marketers How to derive value from big data B NK 2020 PAGE 1 INNOVATION LAB INNOVATION LAB FOREWORD In Marketing & Sales the main strategic goals are to acquire new customers,
DATAMEER USE CASES EBOOK Top Five High-Impact Use Cases for Big Data Analytics You ve been collecting data for years. Learn how to use it to grow your business and gain a competitive edge. INTRODUCTION
IBM Software Thought Leadership White Paper June 2013 The top five ways to get started with big data 2 The top five ways to get started with big data Big data: A high-stakes opportunity Remember what life
Front cover Building Big Data and Analytics Solutions in the Cloud Characteristics of big data and key technical challenges in taking advantage of it Impact of big data on cloud computing and implications
NESSI White Paper, December 2012 Big Data A New World of Opportunities Contents 1. Executive Summary... 3 2. Introduction... 4 2.1. Political context... 4 2.2. Research and Big Data... 5 2.3. Purpose of
Hurwitz ViCtOrY index Advanced Analytics: The Hurwitz Victory Index Report SAP Hurwitz Index d o u b l e v i c t o r Marcia Kaufman COO and Principal Analyst Daniel Kirsch Senior Analyst Table of Contents
SAP BusinessObjects Business Intelligence SAP BusinessObjects Business Intelligence 4.0 Solutions Empowering the Real-Time, Mobile, Social, and Global Enterprise SAP BusinessObjects Business Intelligence
IBM Software Business Analytics Big Data Business Analytics for Big Data Unlock value to fuel performance 2 Business Analytics for Big Data Contents 2 Introduction 3 Extracting insights from big data 4
TABLE OF CONTENTS Introduction... 3 The Importance of Triplestores... 4 Why Triplestores... 5 The Top 8 Things You Should Know When Considering a Triplestore... 9 Inferencing... 9 Integration with Text
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
IBM Industries White paper Business analytics in the cloud Driving business innovation through cloud computing and analytics solutions 2 Business analytics in the cloud Contents 2 Abstract 3 The case for
Research Memorandum 94 July 2014 Creating business value from Big Data and business analytics: organizational, managerial and human resource implications Hull University Business School Prof Richard Vidgen
Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com W H I T E P A P E R B i g D a t a : W h a t I t I s a n d W h y Y o u S h o u l d C a r e Sponsored
TABLE OF CONTENTS 4 WHAT IS PREDICTIVE ANALYTICS? 6 WHO USES PREDICTIVE ANALYTICS AND HOW? 7 HOW DOES PREDICTIVE ANALYTICS HELP AUTOMATE DECISIONS? 8 HOW DOES PREDICTIVE ANALYTICS DIFFER FROM DATA MINING
TDWI research First Quarter 2014 BEST PRACTICES REPORT Predictive Analytics for Business Advantage By Fern Halper Co-sponsored by: tdwi.org TDWI research BEST PRACTICES REPORT First Quarter 2014 Predictive
Text Analytics: The Victory Index Report SAS VICTORY Index d o u b l e v i c t o r Fern Halper, Ph.D Partner and Principal Analyst Marcia Kaufman COO and Principal Analyst Daniel Kirsh Senior Analyst Table
OPEN DATA CENTER ALLIANCE : sm Big Data Consumer Guide SM Table of Contents Legal Notice...3 Executive Summary...4 Introduction...5 Objective...5 Big Data 101...5 Defining Big Data...5 Big Data Evolution...7