Big Data Hope or Hype? David J. Hand Imperial College, London and Winton Capital Management Big data science, September 2013 1
Google trends on big data Google search 1 Sept 2013: 1.6 billion hits on big data Big data science, September 2013 2
What is big data? Various definitions: data which are too extensive to permit iterative analysis: one pass analysis is necessary; data sets which standard database tools cannot handle; data sets which are so large they require new forms of processing; a data set which exceeds 20% of the RAM of a given machine; Big data science, September 2013 3
Some big data stories The Large Hadron Collider: petabyte (10 15 ) per second; Sequencing the Human Genome: 3.3 billion base pairs Social network analysis: 2.5 quintillion (10 18 ) bytes per day Climate modelling: Coupled model intercomparison project 5 th phase: more than 2 petabytes Google Translate: statistical machine translation; 200 billion words from UN documents Big data science, September 2013 4
Why now? automatic data capture (often secondary) simulations (e.g. meteorology, physics) exponential growth in computer memory Big data science, September 2013 5
But it s not new! It s media rebranding 1994: Wal Mart, with over 7 billion transactions per year; 1997: AT&T, with over 70 billion long distant phone call records per year; 1990s: Mobil Oil, over 100 terabytes of data; 2000: in just a few months the Sloan Digital Sky Survey collected more data than had previously been collected in the entire history of astronomy Big data science, September 2013 6
Why is it exciting? A new world, according to many! McKinsey: we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new modes of competition and value capture, all driven by big data as consumers, companies, and economic sectors exploit its potential Big data science, September 2013 7
Some see big data as a paradigm shift in science: Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. Chris Anderson Wired in an article called The end of theory: the data deluge makes scientific method obsolete. Big data science, September 2013 8
But he was wrong: the numbers don t speak for themselves Big data science, September 2013 9
But he was wrong: the numbers don t speak for themselves There are two kinds of models: data driven substantive Big data science, September 2013 10
Data driven models Based purely on empirical relationships in the data e.g.in credit scoring the model of choice is a logistic regression tree The population is partitioned into segments on empirical grounds Different logistic regression models built in each segment No underlying theory No psychology, prospect theory, behavioural finance, etc. Big data science, September 2013 11
Data driven models are not new e.g. segmented regression in credit scoring in 1960s Data driven models are good for prediction and anomaly detection which is why they are so heavily used in some domains But data driven models don t provide insight Big data science, September 2013 12
Substantive models Are essentially theories e.g. Newton s Laws of Motion necessary for understanding e.g. to detect dark matter from galaxy rotation lack of insight has its dangers Billions 0 10 20 30 40 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 Sources: BBA, CCRG (Access cards only added from 1974, Building Societies from 1996) Big data science, September 2013 13
So it s too much to say Out with every theory of human behavior (Anderson) It depends what you are using the models for prediction understanding Big data science, September 2013 14
Big data needs Computer science for manipulating data Sorting, adding, selecting, aggregating, concatenating, etc Statistics for extracting information from data Most of the problems we want to solve are inferential We don t want to make a statement about the data we have, but about data we might get tomorrow (e.g. economic forecasting); the population from which our data were drawn (e.g. astronomical databases); a true value, which we have observed with measurement error (e.g. gene expression data); data we might have had if things had been different (e.g. social policy) Big data science, September 2013 15
Big data risks big data often collected as a side effect of some other exercise: the definitions may not match definitions may change over time if administrative data quality (good for one purpose, not for another; computer is a necessary intermediary) selection bias different observational automatic data capture sources have different biases; problem of selecting on basis of response variable crime maps example: Direct Line Insurance survey: selective reporting of incidents for fear of impact on house prices multiple testing everything significant Big data science, September 2013 16
New tools needed Wikipedia says the challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization While true, this is mostly talking about computational housekeeping tools rather than knowledge extraction tools: It s talking about data juggling rather than inference [Even analysis in the above quote refers to Hadoop, missing the point] Big data science, September 2013 17
But there are Implications for inference visualisation: but familiar tools may be inadequate Big data science, September 2013 18
iteration too slow use simple models (eg. regression instead of logistic reg) splitting and screening (not really taking advantage of the big data) e.g. the LHC: 1 petabyte per sec, online filter reduces by a factor of 10,000, further selection by factor of 100. anomaly detection streaming data Big data science, September 2013 19
Big data does not mean end of small data Power law for data set size: The probability of observing a data set of size n is inversely related to a power of n There are vastly more small data sets than very large ones Big data science, September 2013 20
The data mining experience Most unusual structures in large data sets arise because of data errors turn out to be known about beforehand are uninteresting e.g. the discovery that in a time series of data, maxima and minima alternate e.g. the discovery that in the US about half the married people are male Big data science, September 2013 21
Summary Big data has great potential Big data does not mean the end of small data Big is not necessarily good, useful, valuable, or interesting Data is not knowledge It is possible to be data rich but information poor Big data science, September 2013 22
The future is not big data, but what you learn from it Big data science, September 2013 23
thanks! Big data science, September 2013 24