1 Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich
2 Goal of Today What is Big Data? introduce all major buzz words What is not Big Data? get a feeling for opportunities & limitations
3 Answering Tough Questions Problem: sales for lollipops are going down Data: all sales data by customer, region, time, Information: lollipops bought by people older than 25 (but eaten by people younger than 10) Knowledge: moms believe: lollipops = bad teeth Value: dentists advertise your lollipops
4 Answering Tough Questions Problem: sales for lollipops are going down Data: all sales data by customer, region, time, Information: lollipops bought by people older than 25 Knowledge: moms believe: lollipops = bad teeth Value: dentists advertise your lollipops
5 Why is this difficult? You need more data than your data warehouse. you need more data that you have logs, Twitter feeds, blogs, customer surveys, You need to ask the right questions. data alone is silent You need technology and organization that help you concentrate on asking the right questions.
6 Big Data (Step 1) You! (TB) 6
7 Big Data (Step 2) You! (PB) Your Friends (TB) 7
8 Big Data (Step 3) You! (PB) The World (EB) Your Friends (TB/PB) 8
9 Limitations of State of the Art Manual Analysis does not scale inflexible + data loss Business Processes Storage Network data is dead ETL into RDBMS 9 Archive
10 What needs to be done? (Technology) Take Steps 0 to 3 Step 0: Data Warehouses (relational Databases) Step 1: Data Warehouses + Hadoop (HDFS) Step 2: Business Processes + Analytics + Exchange Step 3: BP + Analytics + Exchange + Real-Time 10
11 What needs to be done? (Technology) Take Steps 0 to 3 Step 0: Data Warehouses (relational Databases) Step 1: Data Warehouses + Hadoop (HDFS) Step 2: Business Processes + Analytics + Exchange Step 3: BP + Analytics + Exchange + Real-Time 11
12 What needs to be done? (Technology) Take Steps 0 to 3 Step 0: Data Warehouses (relational Databases) Step 1: Data Warehouses + Hadoop (HDFS) Step 2: Data Warehouses + Hadoop + XML (Standards) Step 3: BP + Analytics + Exchange + Real-Time 12
13 What needs to be done? (Technology) Take Steps 0 to 3 Step 0: Data Warehouses (relational Databases) Step 1: Data Warehouses + Hadoop (HDFS) Step 2: Data Warehouses + Hadoop + XML (Standards) Step 3: Data Warehouses + Hadoop + XML +? 13
14 What needs to be done? (Organiation) Static Business Model -> Agile Business Model You and your customers adapt to each other No more data silos (ownership of data is distributed) You allocate resources on demand Execute Business Process -> Data Science You think about experience you have made 14
15 What is Big Data? Three alternative perspectives philosophical business technical (Ultimately, it is a buzz word for everybody.)
16 Philosophical What is more valuable, if you had to pick one? experience or intelligence? Traditional (computer) science: logic! [intelligence] understand the problem, build model / algorithm answer question from implementation of model New science: statistics! [experience] collect data answer question from data (what did others do?)
17 Quote Go to school because nobody can take away your knowledge. (Judith Kossmann, 1984) Nobody ever said that going to school makes you smarter!!!
18 Statistics vs. Logic? Find a spouse? Should Adam bite into the apple? 1 + 1? Cure for cancer? How to treat a cough? Should I give Donald a loan? Premium for fire insurance? When should my son come home? Which book should I read next? Translate from German to English.
19 (IMHO) Solution Find a spouse? I do not want to know! Should Adam bite into the apple? If you believe 1 + 1? Definition Cure for cancer? I do not know. Maybe. How to treat a cough? Yes. (Google Insight) Should I give Donald a loan? Yes. (e.g., Schufa) Premium for fire insurance? Yes. (e.g., SwissRe) When should my son come home? No! But Which book should I read next? Yes. (Amazon) Translate from German to English. Yes. (Google Transl.)
20 Data Science, 4 th Paradigm New approach to do science Step 1: Collect data Step 2: Generate Hypotheses Step 3: Validate Hypoteheses Step 4: (Goto Step 1 or 2) Why is this a good approach? it can be automated: no thinking, less error Why is this a bad approach? how do you debug without a ground truth?
21 Is bigger = smarter? Yes! tolerate errors discover the long tail and corner cases machine learning works much better 21
22 Is bigger = smarter? Yes! tolerate errors discover the long tail and corner cases machine learning works much better But! more data, more error (e.g., semantic heterogeneity) with enough data you can prove anything still need humans to ask right questions 22
23 Big Data Success Story Google Translate you collect snipets of translations you match sentences to snipets you continuously debug your system Why does it work? there are tons of snipets on the Web there is a ground truth that helps to debug system
24 Big Data Farce (only a joke) Which lane is fastest in a traffic jam? you ask people where they go and whether happy (maybe, you even use a GPS device) you conclude that left lane is fastest Why is this stupid? because there is no ground truth! you will get a conclusion because Big Data always gives an answer. But, it does not make sense! getting more data does not help either
25 How to play lottery in Napoli Step 1: You visit (and pay) oracles they tell you which numbers to play Step 2: You visit (and pay) interpreters they explain what oracles told you Step 3: After you lost, you visit (and pay) analyst they explain why oracles and interpreters were right goto Step 1 Lessons learned life is try and error; trying keeps the system running [Luciano de Crescenzo: Thus Spake Bellavista]
26 What is Big Data? Business Perspective it is a new business model People pay with data e.g. Facebook, Google, Twitter: use service, give data Google sells your data to advertisers (you pay advertisers indirectly) e.g., 23andMe, Amazon: pay service + give data sells data and uses data to improve service
27 Business Perspective Bank keeps your money securely (kind of ) puts your money at work (lends it to others), interest you keep ownership of data and take it when needed Databank keeps your data securely (kind of ) puts your data at work: interest or better service (you keep ownership of data: hopefully to come) Swiss Banks 2.0???
28 Technical Perspective (us!) You collect all data the more the better -> statistical relevance, long tail keeping all is cheaper than deciding what to keep You decide independently what to do with data run experiments on data when question arises Huge difference to traditional information systems design upfront what data to keep and why!!! (e.g., waterfall model of software engineering!)
29 Volume: data at rest Consequences it is going to be a lot of data Speed: data in motion it is going to arrive fast Diversity: data in many formats it is going to come in different shapes (e.g., different versions, different sources) Complexity: You want to do something interesting SQL will not be enough
30 Alternative Definition (Gartner, IBM) Volume: same as before Velocity: same as speed Variety: same as diversity Veracity: data in doubt you do not know exactly what you have
31 Overview Introduction What is Big Data? Data Warehouses VOLUME The old world of Big Data Cloud Computing VOLUME The infrastructure to collect and process Big Data Map Reduce COMPLEXITY The new world of Big Data (programming model)
32 Semi-structured Data Overview (ctd.) DIVERSITY The new world of Big Data (data model) Collect first think later Streaming Data SPEED Making Big Data fast Other Topics visualization, data cleaning, security, crowdsourcing, Applications (Guest Lecture)
33 Why now? Mega-trend: All data is digital, digitally born! 70 years ago: computers for + 15 years ago: disks cheaper than paper 7 years ago: Internet has eyes and ears Because we can 40 years of databases -> volume 40 years of Moore s law -> complexity years of statistics -> it is only counting enough optimisms that we get the rest done, too Because we reached dead end with logic (?)
34 Because we can Really? Yes! all data is digitally born storage capacity is increasing counting is embarrassingly parallel 34
35 Because we can Really? Yes! all data is digitally born storage capacity is increasing counting is embarrassingly parallel But, data grows faster than energy on chip value / cost tradeoff unknown ownership of data unclear (aggregate vs. individual) I believe that all these but s can be addressed 35
36 Utiliy & Cost Functions of Data Utility Cost Noise / Error Noise / Error 36
37 Utiliy & Cost Functions of Data Utility Cost curated curated malicious random random malicious Noise / Error Noise / Error 37
39 What is good enough? Utility Cost curated curated Noise / Error Noise / Error 39
40 What you have learnt today? a number of buzz words, some cool examples you should survive any discussion with your boss some motivation to come back next week approach all buts of because we can learn some of the technologies do some sort of Big Data project this semester
Database Systems Journal vol. IV, no. 3/2013 31 Big Data Challenges Alexandru Adrian TOLE Romanian American University, Bucharest, Romania email@example.com The amount of data that is traveling across
For Big Data Analytics There s No Such Thing as Too Big The Compelling Economics and Technology of Big Data Computing March 2012 By: 4syth.com Emerging big data thought leaders Forsyth Communications 2012.
ericsson White paper Uen 284 23-3264 February 2015 Next-generation data center infrastructure MAKING HYPERSCALE AVAILABLE In the Networked Society, enterprises will need 10 times their current IT capacity
3 Big Data: Challenges and Opportunities Roberto V. Zicari Contents Introduction... 104 The Story as it is Told from the Business Perspective... 104 The Story as it is Told from the Technology Perspective...
Q&A: Esri's Jack Dangermond on cloud, big data and Apple vs Google map wars The company today unveiled ArcGIS Online organizational subscriptions Sharon Machlis June 14, 2012 (Computerworld) GIS pioneer
Big Data Analytics ALTERYX SPECIAL EDITION by Michael Wessler, OCP & CISSP Big Data Analytics For Dummies, Alteryx Special Edition Published by John Wiley & Sons, Inc. 111 River St. Hoboken, NJ 07030-5774
Big Data: Powering the Next Industrial Revolution Author: Abhishek Mehta April 2011 p2 Executive Summary Data is a key raw material for a variety of socioeconomic business systems. Unfortunately, the ability
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 4, Number 1 (2014), pp. 33-40 International Research Publications House http://www. irphouse.com /ijict.htm Big Data
An Oracle White Paper March 2013 Big Data Analytics Advanced Analytics in Oracle Database Advanced Analytics in Oracle Database Disclaimer The following is intended to outline our general product direction.
Big Data for the Next Big Idea in Financial Services Understanding Customers, Global Economies and Human Welfare with Analytics CONCLUSIONS PAPER Insights from a panel discussion at the SAS Financial Services
OPEN DATA CENTER ALLIANCE : sm Big Data Consumer Guide SM Table of Contents Legal Notice...3 Executive Summary...4 Introduction...5 Objective...5 Big Data 101...5 Defining Big Data...5 Big Data Evolution...7
Predictive Analytics Guest was Eric Siegel Sponsored by Related Podcast: Transcription of Podcast Joseph Dager: Welcome everyone, this is Joe host of the Business901 podcast with me today is Eric Siegel,
National Association of State Personnel Executives 859.244.8182 firstname.lastname@example.org www.naspe.net BRINGING MODERN RECRUITING SYSTEMS TO STATE GOVERNMENTS INTRODUCTION State governments manage a large workforce
Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com W H I T E P A P E R B i g D a t a : W h a t I t I s a n d W h y Y o u S h o u l d C a r e Sponsored
Big Data: Challenges and Opportunities Roberto V. Zicari Goethe University Frankfurt This is Big Data. Every day, 2.5 quintillion bytes of data are created. This data comes from digital pictures, videos,
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania email@example.com, firstname.lastname@example.org
A Forrester Consulting Thought Leadership Paper Commissioned By SAP Real-Time Data Management Delivers Faster Insights, Extreme Transaction Processing, And Competitive Advantage June 2013 Table Of Contents
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
Age of Big Data and Smart Cities: Privacy Trade- Off Kaoutar Ben Ahmed #1, Mohammed Bouhorma #2, Mohamed Ben Ahmed #3 1 Ph.D. Student, 2 Professor, 3 Assistant Professor, # Laboratory of Informatics, Systems
COULD VS. SHOULD: BALANCING BIG DATA AND ANALYTICS TECHNOLOGY The business world is abuzz with the potential of data. In fact, most businesses have so much data that it is difficult for them to process
David Chappell October 2012 WINDOWS AZURE DATA MANAGEMENT CHOOSING THE RIGHT TECHNOLOGY Sponsored by Microsoft Corporation Copyright 2012 Chappell & Associates Contents Windows Azure Data Management: A
Eindhoven, August 2014 Big Data Opportunities for the Retail Sector A Model Proposal by M.G.H. (Marcel) van Eupen BSc Industrial Engineering & Management Science TU/e 2014 Student identity number 0715154
Change the world with data. We ll show you how. strataconf.com Sep 25 27, 2013 Boston, MA Oct 28 30, 2013 New York, NY Nov 11 13, 2013 London, England 2013 O Reilly Media, Inc. O Reilly logo is a registered
GLOBAL FINANCIAL SERVICES COMPLIANCE & RISK MANAGEMENT with Bombuz A Big Data & Semantic Web Solution Insigma Hengtian Software Ltd. Bayshore Management Consultants, LLC Table of Contents OVERVIEW... 1
Time Matters for the PI Attorney 6/14/2006 Robert Gray GrayLint Enterprises, Inc. Personal Injury Case Management...1 Time Matters Things You Might Like...2 Time Matters Things You Might Not Like...3 Other
April 2013 Operational Intelligence: What It Is and Why You Need It Now Sponsored by Splunk Contents Introduction 1 What Is Operational Intelligence? 1 Trends Driving the Need for Operational Intelligence