Statistics, Big Data and Data Science!? Prof. Dr. Göran Kauermann Ludwig-Maximilians-Universität Munich, Germany Statistics, Big Data and Data Science Statistics Founded around 1900 with the seminal work of Pearson and later Fisher Big Data The Big Topic with the three (four) V s Data Science Proposed by Cleveland (2001, 2005): Learning from Data: Unifying Statistics and Computer Science 2 1
Statistics, Big Data and Data Science Sta$s$cs Founded around 1900 with the seminal work of Pearson and later Fisher Big Data The Big Topic with the three (four) V s Data Science Proposed by Cleveland (2001, 2005): Learning from Data: Unifying Sta5s5cs and Computer Science 3 Statistics Statistics is (the) science that pertains to the collection, analysis, interpretation and presentation of data. (Wikipedia) 4 2
Statistics the first 100 years Sta$s$cal Founda$ons Sta$s$cal Modelling Sta$s$cs and Big Data? Likelihood-Inferenzce Sta5s5cal Tests ANOVA Linear Regression EDA etc. Generalised Regression Computa5onal Sta5s5cs, MCMC R-Project, Smooth Regression Data Mining Inference in Big Data Computa5onal Sta5s5cs Data Science 1900 1950 2000 2015 Is statistics ready for the next century? 5 5 Statistics in Germany 6 3
Statistics in Germany Statistics has been prosperous in Germany in the last 10 years TU Dortmund and LMU Munich (BA and MA) HU/FU/TU Berlin, Bielefeld, Göttingen (MA) Ulm, Bremen, Heidelberg, Bamberg, Trier, Mainz, Magdeburg (special programs) Mathematics departments and economics departments Are the German statisticians ready for the next century? 7 Statistics, Big Data and Data Science Statistics Founded around 1900 with the seminal work of Pearson and later Fisher Big Data The Big Topic with the three (four) V s Data Science Proposed by Cleveland (2001, 2005): Learning from Data: Unifying Statistics and Computer Science 8 4
Big Data Everybody talks about it! 9 Big Data Everybody talks about it! Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it (Dan Ariely, 2013) 10 5
Big Data The Buzzword Financial Times Magazine (March 2014): Big Data is a vague term for a massive phenomenon that has rapidly become an obsession with entrepreneurs, scientists, governments and the media. As with so many buzzwords, big data is a vague term, often thrown around by people with something to sell. Is Big Data the new gold rush? 11 Big Data the four V s Big Data are classified with the four V s Volume Big Data are large in size Variety Big Data are complex Velocity Big Data arrive in high speed at high resolution Veracity Big Data may not be reliable (bias issues) 12 6
Big Data From Data to Knowledge Wired Magazine (June 2008): The End of Theory: The data deluge makes the scientific method obsolete. The End of Theory: With enough data, the numbers speak for themselves. Big Data, is this the end of statistics? 13 Gartner s Hype Cycle Source: Gartner Blog network 14 7
Big Data The two Extremes Opinions The two view about Big Data: With enough data we don t need theory and we can explain the world. Big Data is just a hype and will die out sooner or later. Big Data, a challenge or the end of statistics? 15 Big Data End of Statistics? Let s answer the question with Big https://www.google.com/trends/ Google Trends protocols which keywords are searched in Google, when, where, etc. 16 8
17 18 9
19 20 10
21 Big Data End of Theory? Is Big Data the death of Statistics? Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster and cheaper these days, but we must not pretend that the traps have all been made safe. (Financial Times Magazin, Tim Harford, 28.3.2014) 22 11
Data Scientists Why are they needed by the industry? 23 Big Data Example 1 Source: Lazer et al, 2014, Science, Vol. 343. 24 12
Big Data Example 1 Google s Flu Trend The trend worked nicely, but then it failed, since: Correlation is not equal to causation What causes what needs a model and data 25 Big Data Example 2 Price Elasticity Estimation Research Project with large German airline Problem: Estimation of Price Elasticity Huge (!!) data base containing Price and Ticket sales Regression model: Ticket Sales = s(price) + error 26 13
Big Data Example 2 Price Elasticity Estimation Problem: The price is NOT exogeneous!! Demand depends on price and price depends on demand The data-based price elasticity is overestimated The problem is well know in econometrics 27 Big Data Example 3 Big Computing versus Sampling Big Data often demand for Big Computing Information, however, can be also be retrieved from a sample Example: Network Data (e.g. Facebook) Statisticians know how to sample 28 14
Big Data Example 3 The Sonntagsfrage asks roughly 1.000 people about their political views sample 1.000 out of about 60 million margin of error (standard deviation) Why is it better to ask just 1.000 people and not 60 million, if possible? sampling error diminishes, but sampling bias occurs 29 We conclude: Big Data and Statistics Big Data does not make theory (thinking) obsolete. Big Data analy5cs needs sta5s5cal thinking and reasoning But:. 30 15
We conclude: Big Data and Statistics Big Data does not make theory (thinking) obsolete. Big Data analy5cs needs sta5s5cal thinking and reasoning But: Sta5s5cs also needs to tackle Big Data issues 31 Big Data and Statistics David Spiegelhalter: Complete bollocks. Absolute nonsense. There are a lot of small data problems that occur in big data. They don t disappear because you ve got lots of the stuff. They get worse. 32 16
Big Data and Statistics Other statements: David Hand: We have a new resource here. But nobody wants data. What they want are the answers. Patrick Wolfe: It s the wild west right now. People who are clever and driven will twist and turn and use every tool to get sense out of these data sets, and that s cool. But we re flying a little bit blind at the moment. 33 Big Data - A further (single) Statistician s View Statistics needs more involvement in the Big Data wave Statistical ideas and models are useful and need to be scaled up The old statistics is not dying out (p-values and small samples remain useful) A new paradigm: Approximate data analysis may be better than optimal fitting procedures (Göran Kauermann, 2016) 34 17
Statistics, Big Data and Data Science Statistics Founded around 1900 with the seminal work of Pearson and later Fisher Big Data The Big Topic with the three (four) V s Data Science Proposed by Cleveland (2001, 2005): Learning from Data: Unifying Statistics and Computer Science 35 Statistics versus Data Scientists What is Data Science? Cleveland (2001): Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics Data Science = Statistics of tomorrow? or Data Science = Statistics carried out by non-statisticians? 36 18
Statistics and Data Scientists 1900 Statistics 1950 Computer Science Data Science 2000 37 Statistics and Data Scientists 1900 Statistics 1950 Computer Science 38 19
Statistics and Data Scientists 1900 Statistics Data Science 1950 Computer Science 39 Quotes from Cleveland Computer scien4sts, waking up to the value of the informa4on stored, processed and transmi<ed by today s compu4ng environments, have a<empted to fill the void. One current of work is data mining. But the benefit to the data analyst has been limited, because the knowledge among computer scien4sts about how to think of and approach the analysis of data is limited, just as the knowledge of compu4ng environments by sta4s4cians is limited. A merger of the knowledge bases would produce a powerful force for innova4on. 40 20
Computer scien4sts, waking up to the value of the informa4on stored, processed and transmi<ed by today s compu4ng environments, have a<empted to fill the void. One current of work is data mining. But the benefit to the data analyst has been limited, because the knowledge among computer scien4sts about how to think of and approach the analysis of data is limited, just as the knowledge of compu4ng environments by sta4s4cians is limited. A merger of the knowledge bases would produce a powerful force for innova4on. Quotes from Cleveland 41 Data Science What is it about? Data Science combines informatics and statistics in order to extract information from real data. Data Science is a blend of Red-Bull-fuelled hacking and espresso-inspired sta4s4cs (Mike Driscoll, CEO Metamarket) 42 21
Data Scientists What do they do? Source: C. O Neil, R. Schuf (2014), Doing Data Science, O Reilly Media Inc., USA. 43 Data Scientists What do they do? Retrieve information from data Apply machine learning tools Deal with data confidentiality Source: C. O Neil, R. Schutt (2014), Doing Data Science, O Reilly Media Inc., USA. Communicate the results Use statistical models 44 22
Statistics and Computer Science The stereotypes: Computer Scientists predict and forecast Statisticians model and interpret But both tackle the question: How can we make the data speak? 45 Data Science The definition of Data Science is not consolidated We consider Data Science as 50% Statistics and 50% Informatics (Computer Science) Master in Data Science at LMU (Elite-Network Bavaria) 46 23
Program starts Oct 2016 International Program Data Science @ LMU 50% Statistics and 50% Informatics www.datascience-munich.de 47 Challenges in Data Science Collaboration Big Data occur outside of statistics/informatics Training More master programs in Data Science Consolidation Data Science is Data Science 48 24
Statistics, Big Data and Data Science Statistics and Computer Science merged into Data Science Big Data are the driving force Classical Statistics remains important New challenges in Statistics/Informatics 49 Challenges in Statistics Do we need optimal solutions? Approximate inference, Smart and real time computing Parrallel Computing Do we need asymptotic statistics? We have large n, so why bother about mathematical asymptotics What does n è really mean? 50 25
Challenges in Statistics Do we need significance tests? Model Selection is important Significance versus relevance Do we need statistical models at all? Stochastic character remains in big data Simple stochastic models are too simple 51 Challenges in Statistics Do we need correlation? Dependence structure is relevant Copula or more complex models Do we need linear models at all? Linear models and linear procedures are fast Linear approximations are often sufficient 52 26
The Statistical Approach After all: The statistical paradigm remains Questions Data Ú Model Answers Estimates 53 Many thanks 54 27