Big Data Hope or Hype?



Similar documents
Big Data a threat or a chance?

Collaborations between Official Statistics and Academia in the Era of Big Data

Big Data. Fast Forward. Putting data to productive use

Statistics for BIG data

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Data Analytics in Organisations and Business

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data Challenges in Bioinformatics

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Learning from Big Data in

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

BIG DATA, MAPREDUCE & HADOOP

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Data-Intensive Science and Scientific Data Infrastructure

Chapter 7: Data Mining

Big Data Big Knowledge?

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Training for Big Data

Data Mining in Telecommunication

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Introduction to Data Mining

How To Improve Data Quality

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

DATAOPT SOLUTIONS. What Is Big Data?

Foundations of Business Intelligence: Databases and Information Management

The Scientific Data Mining Process

Perspectives on Data Mining

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

EVERYTHING THAT MATTERS IN ADVANCED ANALYTICS

How To Get More Data From Your Computer

Turning Big Data into Big Decisions Delivering on the High Demand for Data

Data Mining + Business Intelligence. Integration, Design and Implementation

Navigating Big Data business analytics

Conquering the Astronomical Data Flood through Machine

Is Big Data Bigger than a Bread Box?

Fast Analytics on Big Data with H20

Predictive Analytics for Demand Forecasting and Planning Managers A Big Data Challenge Hans Levenbach, Delphus, Inc.

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data in Telco & Banking Analytics. Benjamin Sznajder IBM Research Haifa

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

How To Understand The Big Data Paradigm

Introduction of Information Visualization and Visual Analytics. Chapter 2. Introduction and Motivation

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Insightful Analytics: Leveraging the data explosion for business optimisation. Top Ten Challenges for Investment Banks 2015

Statistical Challenges with Big Data in Management Science

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

From Distributed Computing to Distributed Artificial Intelligence

Impact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

Data Isn't Everything

Advanced Big Data Analytics with R and Hadoop

What happens when Big Data and Master Data come together?

Understanding the Value of In-Memory in the IT Landscape

Data Mining Solutions for the Business Environment

Machine Learning and Data Mining. Fundamentals, robotics, recognition

ANALYTICS CENTER LEARNING PROGRAM

Big Data. George O. Strawn NITRD

Why the Big Deal about Big Data?

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

Master of Science in Marketing Analytics (MSMA)

The Big Gift of Big Data

How To Teach Data Science

IBM Big Data in Government

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Accelerate BI Initiatives With Self-Service Data Discovery And Integration

Good morning. It is a pleasure to be with you here today to talk about the value and promise of Big Data.

NITRD and Big Data. George O. Strawn NITRD

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Journée Thématique Big Data 13/03/2015

This Symposium brought to you by

Opportunities and Limitations of Big Data

Big Data: Image & Video Analytics

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

CONNECTING DATA WITH BUSINESS

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Transcription:

Big Data Hope or Hype? David J. Hand Imperial College, London and Winton Capital Management Big data science, September 2013 1

Google trends on big data Google search 1 Sept 2013: 1.6 billion hits on big data Big data science, September 2013 2

What is big data? Various definitions: data which are too extensive to permit iterative analysis: one pass analysis is necessary; data sets which standard database tools cannot handle; data sets which are so large they require new forms of processing; a data set which exceeds 20% of the RAM of a given machine; Big data science, September 2013 3

Some big data stories The Large Hadron Collider: petabyte (10 15 ) per second; Sequencing the Human Genome: 3.3 billion base pairs Social network analysis: 2.5 quintillion (10 18 ) bytes per day Climate modelling: Coupled model intercomparison project 5 th phase: more than 2 petabytes Google Translate: statistical machine translation; 200 billion words from UN documents Big data science, September 2013 4

Why now? automatic data capture (often secondary) simulations (e.g. meteorology, physics) exponential growth in computer memory Big data science, September 2013 5

But it s not new! It s media rebranding 1994: Wal Mart, with over 7 billion transactions per year; 1997: AT&T, with over 70 billion long distant phone call records per year; 1990s: Mobil Oil, over 100 terabytes of data; 2000: in just a few months the Sloan Digital Sky Survey collected more data than had previously been collected in the entire history of astronomy Big data science, September 2013 6

Why is it exciting? A new world, according to many! McKinsey: we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new modes of competition and value capture, all driven by big data as consumers, companies, and economic sectors exploit its potential Big data science, September 2013 7

Some see big data as a paradigm shift in science: Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. Chris Anderson Wired in an article called The end of theory: the data deluge makes scientific method obsolete. Big data science, September 2013 8

But he was wrong: the numbers don t speak for themselves Big data science, September 2013 9

But he was wrong: the numbers don t speak for themselves There are two kinds of models: data driven substantive Big data science, September 2013 10

Data driven models Based purely on empirical relationships in the data e.g.in credit scoring the model of choice is a logistic regression tree The population is partitioned into segments on empirical grounds Different logistic regression models built in each segment No underlying theory No psychology, prospect theory, behavioural finance, etc. Big data science, September 2013 11

Data driven models are not new e.g. segmented regression in credit scoring in 1960s Data driven models are good for prediction and anomaly detection which is why they are so heavily used in some domains But data driven models don t provide insight Big data science, September 2013 12

Substantive models Are essentially theories e.g. Newton s Laws of Motion necessary for understanding e.g. to detect dark matter from galaxy rotation lack of insight has its dangers Billions 0 10 20 30 40 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 Sources: BBA, CCRG (Access cards only added from 1974, Building Societies from 1996) Big data science, September 2013 13

So it s too much to say Out with every theory of human behavior (Anderson) It depends what you are using the models for prediction understanding Big data science, September 2013 14

Big data needs Computer science for manipulating data Sorting, adding, selecting, aggregating, concatenating, etc Statistics for extracting information from data Most of the problems we want to solve are inferential We don t want to make a statement about the data we have, but about data we might get tomorrow (e.g. economic forecasting); the population from which our data were drawn (e.g. astronomical databases); a true value, which we have observed with measurement error (e.g. gene expression data); data we might have had if things had been different (e.g. social policy) Big data science, September 2013 15

Big data risks big data often collected as a side effect of some other exercise: the definitions may not match definitions may change over time if administrative data quality (good for one purpose, not for another; computer is a necessary intermediary) selection bias different observational automatic data capture sources have different biases; problem of selecting on basis of response variable crime maps example: Direct Line Insurance survey: selective reporting of incidents for fear of impact on house prices multiple testing everything significant Big data science, September 2013 16

New tools needed Wikipedia says the challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization While true, this is mostly talking about computational housekeeping tools rather than knowledge extraction tools: It s talking about data juggling rather than inference [Even analysis in the above quote refers to Hadoop, missing the point] Big data science, September 2013 17

But there are Implications for inference visualisation: but familiar tools may be inadequate Big data science, September 2013 18

iteration too slow use simple models (eg. regression instead of logistic reg) splitting and screening (not really taking advantage of the big data) e.g. the LHC: 1 petabyte per sec, online filter reduces by a factor of 10,000, further selection by factor of 100. anomaly detection streaming data Big data science, September 2013 19

Big data does not mean end of small data Power law for data set size: The probability of observing a data set of size n is inversely related to a power of n There are vastly more small data sets than very large ones Big data science, September 2013 20

The data mining experience Most unusual structures in large data sets arise because of data errors turn out to be known about beforehand are uninteresting e.g. the discovery that in a time series of data, maxima and minima alternate e.g. the discovery that in the US about half the married people are male Big data science, September 2013 21

Summary Big data has great potential Big data does not mean the end of small data Big is not necessarily good, useful, valuable, or interesting Data is not knowledge It is possible to be data rich but information poor Big data science, September 2013 22

The future is not big data, but what you learn from it Big data science, September 2013 23

thanks! Big data science, September 2013 24