Big Data Analytics Harvard-Smithsonian Center for Astrophysics Data Science Training for Librarians April 4, 2013 David Dietrich, EMC Education Services
I ll go into a company and say, What data problems can we solve? We get blank looks, he says. when he asks, instead, what things can help a company lose money and make money, usually two out of three are problems that data can solve. Anthony Goldbloom, CEO of Kaggle 2
Agenda In other words Level setting on Big Data Emerging Need for Advanced Analytics Tools, Technology, & Skill Development 3
4
5
Using Social Graphs to Map the Spread of Innovation Ideas
Examples of Big Data Analytics 7
Big Data Analytics: Industry Examples Health Care Reduce time needed to detect pandemics, provide vaccines where they are needed most Telecommunications Financial Medical Improve customer churn prediction with social media data Government Industry Verticals Internet Financial Services Phone/TV Retail Accelerate lending decisions using big data sources
Big Data Analytics: Industry Examples Health Care Reduce time needed to detect pandemics, provide vaccines where they are needed most Telecommunications Financial Medical Improve customer churn prediction with social media data Government Industry Verticals Internet Financial Services Phone/TV Retail Accelerate lending decisions using big data sources
Big Data Improves Healthcare Traditional Approach to Distributing Vaccines Traditional Approach Challenges with Traditional Approach Vaccine distribution usually based on: Regional population First-come, first-served Wait for the reports from hospitals and agencies before distributing vaccines Distribution methods do not focus on patients most in need of vaccines Waiting for the reports can take 3-6 months During the wait, the pandemic may become out of control 10
Big Data Improves Healthcare New Approach to Distributing Vaccines Health agencies can now use social networks, such as Twitter, to detect pandemic trends in near real-time 1. Search tweets with certain keywords such as flu, vaccine and immunization to find potential patients 2. Look through these patients social networks to identify their infection patterns 3. Make maps of people tweeting to find out pandemic trends on a global or local scale. 11
Big Data Analytics: Industry Examples Health Care Reduce time needed to detect pandemics, provide vaccines where they are needed most Telecommunications Financial Medical Improve customer churn prediction with social media data Government Industry Verticals Internet Financial Services Phone/TV Retail Accelerate lending decisions using big data sources
Churn Analysis for Mobile Telco Definitions Churn is the term used to describe customer attrition or loss Churn rate is the number of participants who discontinue their use of a service divided by the average number of total participants during a period Reasons to churn Easy to switch provider Inadequate services Quality of service Plenty of attractive offers Customer dissatisfaction Difficult to manage the customer data Can we predict churn? If so, how? 13
Churn Analysis for Mobile Telco Synposis: A Mobile Telco company was losing customers and wanted to understand why Approach with Big Data Analyze call history data Treat call history as a social network Business challenge: Proactively detect mobile phone customers at risk of canceling contracts (customer churn) to retain customers and protect revenue Traditional Approach to Churn Analysis Look at spending patterns Review recurrent problems Cell phone history portrayed as a social network 14
Example of Cell Phone Cancellation Outbreak Month 1 15
Example of Cell Phone Cancellation Outbreak Month 2 16
Example of Cell Phone Cancellation Outbreak Month 3 17
Example of Cell Phone Cancellation Outbreak Month 4 18
Using Social Network Analysis to Improve Churn Prediction High risk cell phone churners can now be identified in 1 hour, saving $40 MM in first year If we had known two customers calling networks Could we have prevented five more from leaving? 19
Big Data Analytics: Industry Examples Health Care Reduce time needed to detect pandemics, provide vaccines where they are needed most Telecommunications Financial Medical Improve customer churn prediction with social media data Government Industry Verticals Internet Financial Services Phone/TV Retail Accelerate lending decisions using big data sources
Underwriting Risk Traditional Approach to Loan Processing Traditional Underwriting Risk Level Big Data Enabled Underwriting Risk Level TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED 21
Underwriting Risk Big Data Enabled Loan Processing Streamlined Process Traditional Underwriting Risk Level Big Data Enabled Underwriting Risk Level TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED 22
Underwriting Risk Big Data Enabled Loan Processing Shorter Time to Decision Traditional Underwriting Risk Level Big Data Enabled Underwriting Risk Level AVERAGE TIME TO CLOSE A HOME LOAN APPLICATION PRE-APPROVAL UNDERWRITING CLOSING TODAY BIG DATA ENABLED 2-3 Weeks 3-4 Weeks ~30% IMPROVEMENT TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED 23
Big Data
Big Data Key Characteristics Large Volumes New Sources Low Latencies Implications for the Enterprise New Platforms New Roles New Techniques 25
What s Driving the Data Deluge? Mobile Sensors Video Surveillance Social Media Oil Exploration OIL RIGS GENERATE 25000 DATA POINTS PER SECOND Smart Grids READING SMART METERS EVERY 15 MINUTES IS 3000X MORE DATA INTENSIVE Medical Imaging Video Rendering Gene Sequencing COST TO SEQUENCE ONE GENOME HAS FALLEN FROM $100M IN 2001 To $4K in 2013 26
What is Big Data? big data \ datasets so large they break traditional IT infrastructures. 27
Four Main Types of Data Structures Structured Data Quasi-Structured Data Semi-Structured Data View Source http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist& pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs _sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651 Unstructured Data The Red Wheelbarrow, by William Carlos Williams 28
Opportunities for a New Approach to Analytics Big Data Ecosystem 1 D a t a D e v i c e s Individual Law Enforcement 2 D a t a C o l l e c t o r s Analytic Services Government Medical Information Brokers Internet Advertising Marketers Websites 3 Employers D a t a A g g r e g a t o r s D a t a U s e r s / B u y e r s 4 Media Phone/TV Retail Catalog Co-Ops Media Archives Banks Credit Bureaus Financial Government List Brokers Delivery Service Private Investigators /Lawyers 29
Industries Are Broadly Embracing Data Science Retail CRM Customer Scoring Store Siting and Layout Fraud Detection / Prevention Supply Chain Optimization Advertising & Public Relations Demand Signaling Ad Targeting Sentiment Analysis Customer Acquisition Financial Services Algorithmic Trading Risk Analysis Fraud Detection Portfolio Analysis Media & Telecommunications Network Optimization Customer Scoring Churn Prevention Fraud Prevention Manufacturing Product Research Engineering Analytics Process & Quality Analysis Distribution Optimization Energy Smart Grid Exploration Government Market Governance Counter-Terrorism Econometrics Health Informatics Healthcare & Life Sciences Pharmaco-Genomics Bio-Informatics Pharmaceutical Research Clinical Outcomes Research 30
Emerging Need for Advanced Analytics
Business Drivers for Advanced Analytics Current Business Problems Provide Opportunities for Organizations to Become More Analytical & Data Driven 1 2 3 4 Driver Desire to optimize business operations Desire to identify business risk Predict new business opportunities Comply with laws or regulatory requirements Examples Sales, pricing, profitability, efficiency Customer churn, fraud, default Upsell, cross-sell, best new customer prospects Anti-Money Laundering, Fair Lending, Basel II 32
Big Data Requires New Approaches to Analytics Business Intelligence versus Data Science Predictive Analytics and Data Mining (Data Science) Exploratory Typical Techniques and Data Types Common Questions Optimization, predictive modeling, forecasting, statistical analysis Structured/unstructured data, many types of sources, very large data sets What if..? What s the optimal scenario for our business? What will happen next? What if these trends continue? Why is this happening? Analytical Approach Business Intelligence Data Science Business Intelligence Typical Techniques and Data Types Common Questions Standard and ad hoc reporting, dashboards, alerts, queries, details on demand Structured data, traditional sources, manageable data sets What happened last quarter? How many did we sell? Where is the problem? In which situations? Explanatory Past TIME Future 33
Tools, Technology, & Skill Development
Data Science is a Team Sport Key Roles for a Successful Analytic Project Business User Project Sponsor Project Manager Business Intelligence Analyst Database Administrator (DBA) Data Engineer Data Scientist 35
Data Analytics Lifecycle 1 Discovery 6 Operationalize 2 Data Prep 5 Communicate Results 3 Model Planning 4 Model Building 36
Leveraging Data Science Throughout the Organization Sales Identify associations between items frequently purchased together Marketing Clustering analysis to group similar customers together Finance Apply regression analysis to predict starting salaries Human Resources Use decision trees to predict employee turnover R & D Text Analysis of log files for service and security analysis Customer Support Classify support requests for intelligent routing Manufacturing Run simulations to optimize complex process flows 37
Data Sources for Analytic Projects Internal Data Sources External Data Sources Social Media Customer Demographics Mfg On-line Portal CRM System Marketing and Sales Customer Support Business HR R&D Finance ERP System Sales Lead Repository Economic Indicators
Tools and Technologies for Big Data Analytics Domain Free/Open Source Commercial Statistical Analysis and Data Mining NoSQL Natural Language Processing 39
Evolution of Big Data Analytics Embedding Analytical Intelligence Info Computing Distributed Computing Standalone analytics Simplifying Big Data 40
Growth of Data Scientist Opportunities Job Trends from Indeed.com A significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data." By 2018...the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data. Average "data scientist" salaries for job postings nationwide are 55% higher than average salaries for all job postings nationwide. Source: McKinsey Global Institute Big data: The next frontier for innovation, competition and productivity May 2011 Source: McKinsey Global Institute ; Big data: The next frontier for innovation, competition and productivity, May 2011 41
People & Skills Three Key Roles of the New Data Ecosystem Role Deep Analytical Talent Data Scientists Projected U.S. talent gap: 140,000 to 190,000 Data Savvy Professionals Projected U.S. talent gap: 1.5 million Technology & Data Enablers Note: Figures above reflect a projected talent gap in US in 2018, as shown in McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity 42
Profile of a Data Scientist Quantitative Technical Curious & Creative Skeptical Communicative & Collaborative 43
Skills Matrix, Based on Recent Students Quantitative Analysts, Statisticians, Business and data analysts Data Scientists Quantitative Skills Recent STEM Grads Business Intelligence Professionals, IT Technical Ability 44
Data Science and Big Data Analytics Course and EMCDSA Certification Course Overview Details Open curriculum Practitioner s approach Enables immediate participation on analytics projects Prepares for EMC Proven Professional Data Science Associate (EMCDSA) Certification 45
Two New EMC Data Science Courses for Business Transformation Business Leaders 90 min New Introducing Data Science and Big Data Analytics for Business Transformation Heads of Data Science Teams 1 day New Data Science and Big Data Analytics for Business Transformation Aspiring Data Scientists 5 days Data Science and Big Data Analytics
EMC Academic Alliance Provides students with competitive edge Partners with colleges and universities to prepare students for roles in data science and cloud computing 1,000+ Institutions in 60+ countries Provides unique open courseware at no cost Program resources Free faculty readiness training Instructor materials Online faculty and student communities Discount certification exam vouchers
Specific Data Science Skills & Traits 1 2 3 EDW 4 5 Apply data science methods in their current roles 48
Others Ways to Learn about Big Data Analytics Formal Training EMC Data Science & Big Data Analytics course STEM graduate programs and certificates Conferences on Analytics (Strata, PAW, ACM, ACL, INFORMS.) Free Massive Open Online Courses (MOOCs) 6 12 week online courses edx, Coursera, Udacity, Udemy, itunesu, Khan Academy Informal Training Look for opportunities to try out your skills, your day job provides this Offer to help on projects, opportunistically Every team is looking for people with these skills right now 49
Leverage The Wisdom of Crowds Social Media Volunteer to help Try Contests Kaggle, Innocentive 50
Key Takeaways Analyzing big data provides significant opportunity for deriving new value To do this, individuals will need to step up to the challenges and opportunities that Big Data and advanced analytics provide Look for opportunities to grow your skills and drive new value as a Data Scientist, Data-Savvy Librarian.. 51
Closing Thoughts. How will you use Big Data Analytics? Do you want to. Map the flow of ideas in research literature? Use citation networks to identify the most influential researchers? Predict award-winning research papers? Increase collaboration with researchers and faculty? Challenge traditional thinking using analytics? 52
Questions? Additional Resources: 1. My Blog on Data Science & Big Data Analytics: http://infocus.emc.com/author/david_dietrich/ David Dietrich @imdaviddietrich 2. Blog on applying Data Analytics Lifecycle to measuring innovation data: http://stevetodd.typepad.com/my_weblog/data-science-andbig-data-curriculum/ 3. EMC Education Services curriculum on Data Science & Big Data Analytics: http://education.emc.com/guest/campaign/data_science.aspx 53