Industry Perspective: Big Data and Big Data Analytics David Barnes Program Director Emerging Internet Technologies IBM Software Group
What is Big Data?
The Adjacent Possible
Inexpensive disk + Increased processing power + Data Warehouse +The Web + X = Big Data X=Sensors used to gather climate information, posts to social media sites, digital pictures and videos, transaction records, cell phone GPS signals, and more.
161 exabytes of data were created in 2006 3 million times the amount of information contained in all the books ever written. In 2010 the number reached hit 988 exabytes. IDC estimates that 1.8 zettabytes were created and replicated in 2011. 2010 IBM Corporation
Every day, people create the equivalent of 2.5 quintillion bytes of data from sensors, mobile devices, online transactions, and social networks. Every month people send one billion Tweets and post 30 billion messages on Facebook. 90% (or more) of the world s data is unstructured. 2010 IBM Corporation
The true nature of information
Unstructured Data Is noisy Is often times dirty Is often full of valuable information
The Big Data Imperative Big Data has swept into every industry and business function. Businesses need to put the power of Big Data analytics in the hands of their business employees Data Scientist is somewhat misleading. Leaders in every sector will have to grapple with the implications of big data, not just a few data-oriented managers. McKinsey Global Institute Big Data Business Patterns Computational Journalism Chief Legal Officer Retail Business Planner IT Systems Management Pharma - Clinical Trials Business Fraud Detection Evidence Based Medicine Web Archiving... 2010 IBM Corporation 9
Today s Problem Data growing at compound annual growth of 60%/year Storage capacity continue to increase dramatically Storage access speeds have not kept up At transfer speed of 500 MB/sec - 1 terabyte of data will require ~30 mins to read from single drive Enter Map/Reduce Automates the mechanisms of large-scale distributed computation ( i.e. work distribution, load balancing, replication, failure/recovery) Divide & Conquer: Split 1 terabyte split among 100 drives will require ~20 seconds to read M/R parallel processing model provides cost effective framework for new generation of analytic applications on unstructured or semi-structured data 2010 IBM Corporation
Requirement: A New Class of Big Data Applications Big Data analytics must be brought to the line-of-business user. Leverage easy-to-use manipulation metaphors Use natural language technologies for analytics Provide rich visualizations to quickly identify insights 2010 IBM Corporation
Buyer Sentiment Analysis Demo
Social Media: Chiliean Earthquake 2010 2010 Chilean earthquake fifth largest earthquake in recorded history The affected areas suffered major devastation - buildings, airports, hospitals, prisons, bridges, and roads were severely damaged Land-based communications systems suffered major outages The wireless 3G infrastructure remained intact and operational 2010 IBM Corporation Sharenomics - Rise of Social Economy Slide 13
Social Media: Chiliean Earthquake 2010 Social networking on wireless networks major form of communications Extreme Blue students collected 226 million Tweets, analyzed,categorized by incidence type and location Tweets included - Can I get food? Can I get gas? Are the bridges down - images The results were visualized Completed in ~12 weeks 2010 IBM Corporation Sharenomics - Rise of Social Economy Slide 14
Big Data = Volume, Variety and Velocity Volume - Scale from terabytes to zettabytes Variety - Relational and non-relational data types from an everexpanding variety of sources Velocity - Streaming data and large volume data movement 2010 IBM Corporation 15
Big Data = Volume, Variety and Velocity Volume - Scale from terabytes to zettabytes Variety - Relational and non-relational data types from an everexpanding variety of sources Velocity - Streaming data and large volume data movement 2010 IBM Corporation
The Supercomputer is based on over 1,200 high powered IBM System X servers and can perform 150 trillion calculations per second -- equivalent to 30 million calculations per Danish citizen per second. Vestas expects its data sets will grow to 20-plus petabytes over the next four years.
Big Data = Volume, Variety and Velocity Volume - Scale from terabytes to zettabytes Variety - Relational and non-relational data types from an everexpanding variety of sources Velocity - Streaming data and large volume data movement 2010 IBM Corporation
Seton Healthcare Family Reducing CHF readmission to improve care IBM Content and Predictive Analytics for Healthcare uses the same type of natural language processing as IBM Watson, enabling us to leverage information in new ways not possible before. We can access an integrated view of relevant clinical and operational information to drive more informed decision making and optimize patient and operational outcomes. Business Challenge Seton Healthcare strives to reduce the occurrence of high cost Congestive Heart Failure (CHF) readmissions by proactively identifying patients likely to be readmitted on an emergent basis. What s Smart? IBM Content and Predictive Analytics for Healthcare solution will help to better target and understand high- risk CHF patients for care management programs by: Utilizing natural language processing to extract key elements from unstructured History and Physical, Discharge Summaries, Echocardiogram Reports, and Consult Notes Leveraging predictive models that have demonstrated high positive predictive value against extracted elements of structured and unstructured data Providing an interface through which providers can intuitively navigate, interpret and take action Smarter Business Outcomes Seton will be able to proactively target care management and reduce re- admission of CHF patients. Teaming unstructured content with predictive analytics, Seton will be able to identify patients likely for re- admission and introduce early interventions to reduce cost, mortality IBM solution IBM Content and Predictive Analytics for Healthcare IBM Cognos Business Intelligence IBM BAO solution services 2011 IBM Corporation
IBM Content and PredicUve AnalyUcs for Healthcare The Seton CHF Readmission SoluUon Raw Informa=on Unstructured Data (Cerner Clinical Documenta0on: History and Physical, Discharge Summary, Echocardiogram.) Structured Data (Avega Cost Data, DSS Admission History, DSS Procedure History, Cerner Clinical Events) IBM Watson for Healthcare UUlizing natural language processing to extract key elements from unstructured IBM Content and History and Physical Predic=ve and Discharge Summary Analy=cs Content AnalyBcs Natural Language Processing Medical Fact and Rela0onship Extrac0on (Annota0on) Trend, PaIern, Anomaly, Devia0on Analysis Health Integra=on Framework Confirm hypotheses or seek alternafve ideas with confidence based responses from learned knowledge* Leveraging predicuve models that have demonstrated high posiuve predicuve value against Analyzed extracted and elements of structured Visualized and unstructured data Informa=on PredicBve AnalyBcs Predic0ve Scoring and Probability Analysis Data Warehouse and Model Master Data Management Advanced Case Management Dynamic Mul=mode Interac=on Providing an interface through which providers can intuiuvely navigate, interpret and take Search acuon and Visually Explore (Mine) Monitor, Dashboard and Report (Cognos BI) Ques%on and Answer* Custom SoluBons Partners (HLI) Specialized Research Business AnalyBcs 2 2011 IBM CorporaUon 2011 IBM CorporaUon
What Really Causes Readmissions at Seton Key Findings The Data We Thought Would Be Useful Wasn t 113 candidate predictors from structured and unstructured data sources Structured data was less reliable then unstructured data increased the reliance on unstructured data New Unexpected Indicators Emerged Highly Predic=ve Model 18 accurate indicators or predictors (see next slide) Predictor Analysis % Encounters Structured Data % Encounters Unstructured Data 49% at 20 th percen0le 97% at 80 th percen0le Ejec0on Frac0on (LVEF) 2% 74% Smoking Indicator 35% (65% Accurate) 81% (95% Accurate) Living Arrangements <1% 73% (100% Accurate) Drug and Alcohol Abuse 16% 81% Assisted Living 0% 13% 3 2011 IBM CorporaUon
Visualizing the Results: Readmissions Dashboard Cognos dashboard reporung system can help in monitoring the key clinical, operauonal and financial metrics. More importantly, being able to track down the top priority cases for case management. 1 2 3 4 5 6 7 1.Clinical Sta=s=cs: admission count, readmission count and readmission rate 2.Opera=onal Sta=s=c: Counts of different length of stay periods 3.Financial Sta=s=c: Total direct cost by total admission and by readmission 4.Mortality: mortality rate 5.Average length of stay 6.Average direct cost by total admission and by readmission only 7.PA Model Score: Distribu0on of propensity of readmission 5 2011 IBM CorporaUon
Big Data = Volume, Variety and Velocity Volume - Scale from terabytes to zettabytes Variety - Relational and non-relational data types from an everexpanding variety of sources Velocity - Streaming data and large volume data movement 2010 IBM Corporation
USC Annenberg School of Communications 2010 IBM Corporation
InfoSphere Streams 2010 IBM Corporation 27
Big Data Platform Vision Bringing Big Data to the Enterprise Big Data Solutions Client and Partner Solutions Big Data User Environments Developers End Users Administrators Data Warehouse InfoSphere Warehouse Warehouse Appliances Netezza Master Data Mgmt AGENTS Big Data Enterprise Engines INTEGRATION InfoSphere MDM Database DB2 Analytics Streaming Analytics Internet Scale Analytics SPSS Business Intelligence Open Source Foundational Components Cognos Hadoop MapReduce HDFS Hbase Pig Lucene Jaql Marketing Unica 2010 IBM Corporation 28