Challenges, Tools and Examples for Big Data Inference

Similar documents

Statistical Inference, Learning and Models for Big Data

Statistics for BIG data

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

General overview, and sources and uses of Big Data for urban and regional analysis

ICT Perspectives on Big Data: Well Sorted Materials

How To Make Sense Of Data With Altilia

Collaborations between Official Statistics and Academia in the Era of Big Data

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

Of all the data in recorded human history, 90 percent has been created in the last two years. - Mark van Rĳmenam, Think Bigger, 2014

Research of Postal Data mining system based on big data

Machine Learning for Data Science (CS4786) Lecture 1

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Mobile Monetization Scenario Design & Big Data. Arther Wu Senior Director of Monetization and Business Operation

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

Big Data & Analytics: Your concise guide (note the irony) Wednesday 27th November 2013

The Big Picture on Big Data. Princeton Section 307 Dinner Meeting December 11, 2013 Richard Herczeg

Statistical Challenges with Big Data in Management Science

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

"BIG DATA A PROLIFIC USE OF INFORMATION"

BIG DATA What it is and how to use?

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Towards a Domain-Specific Framework for Predictive Analytics in Manufacturing. David Lechevalier Anantha Narayanan Sudarsan Rachuri

Chapter 6. The stacking ensemble approach

TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES

Big Data and Marketing

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data and Data Science: Behind the Buzz Words

IDC MaturityScape Benchmark: Big Data and Analytics in Government. Adelaide O Brien Research Director IDC Government Insights June 20, 2014

MS1b Statistical Data Mining

How is Big Data Different? A Paradigm Shift

The Scientific Data Mining Process

USING BIG DATA FOR INTELLIGENT BUSINESSES

Big Data for Development: What May Determine Success or failure?

IDC MaturityScape Benchmark: Big Data and Analytics in Government

Customer Centric Banking. June 2014, IBU Banking, SAP

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

A Strategic Approach to Unlock the Opportunities from Big Data

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

BIG DATA: BIG BOOST TO BIG TECH

BIG DATA: IT MAY BE BIG BUT IS IT SMART?

The? Data: Introduction and Future

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

BIG DATA CHALLENGES AND PERSPECTIVES

Data Analytics in Organisations and Business

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Big Health Data the challenges and connections

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Data Refinery with Big Data Aspects

Big Analytics: A Next Generation Roadmap

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

Big Data in Pictures: Data Visualization

Big Data in Healthcare: Myth, Hype, and Hope

Predicting & Preventing Banking Customer Churn by Unlocking Big Data

IMAV: An Intelligent Multi-Agent Model Based on Cloud Computing for Resource Virtualization

Modern (Computational) Approaches to Big Data Analytics. CSC 576 Computer Science, University of Rochester Instructor: Ji Liu

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Spatio-Temporal Networks:

How To Learn To Use Big Data

The 3 questions to ask yourself about BIG DATA

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Predicting & Preventing Banking Customer Churn by Unlocking Big Data

BIG DATA: CONVENTIONAL METHODS MEET UNCONVENTIONAL DATA

Five Questions to Ask Your Mobile Ad Platform Provider. Whitepaper

Big Data & Security. Aljosa Pasic 12/02/2015

The 4 Pillars of Technosoft s Big Data Practice

Rebecca Yates Coley, Ph.D.

Big Data Discovery: Five Easy Steps to Value

So which is the best?

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

Outline. What is Big data and where they come from? How we deal with Big data?

DATA VISUALIZATION: When Data Speaks Business PRODUCT ANALYSIS REPORT IBM COGNOS BUSINESS INTELLIGENCE. Technology Evaluation Centers

Big Data Systems CS 5965/6965 FALL 2014

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

IC05 Introduction on Networks &Visualization Nov

Big Data Introduction, Importance and Current Perspective of Challenges

North Highland Data and Analytics. Data Governance Considerations for Big Data Analytics

Big Data. What is Big Data? Over the past years. Big Data. Big Data: Introduction and Applications

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

Transcription:

Challenges, Tools and Examples for Big Data Inference Jean-François Plante, HEC Montréal Closing Conference: Statistical and Computational Analytics for Big Data June 12 th, 2015

What is Big Data? Dan Ariely from Duke Univeristy : 1

What is Big Data? 2

Overview of the Opening Conference and Bootcamp Held at Fields January 12 to January 23. 35 scientific talks. Covering all themes of the Big Data program, one theme per day. An overview paper is being prepared by the postdoctoral fellows and longer term visitors at the Fields institute. 3

Themes of the Program Week one: Introductory Lectures and Overview Inference Environmental Science Optimization Week two: Visualization Social Policy Health Policy Deep Learning Networks and Machine Learning 4

Why Do We Talk About Big Data? Because we can! (techonology makes it possible). Because Big Data allows to observe and measure behaviours or events about humans. Because we can measure new things that are otherwise hard or impossible to evaluate. Because imperfect, large, unstructured or hard to handle data may still contain valuable information that we should not dismiss. 5

Example #1: Measuring the Effect of Nutrition David Buckeridge, McGill University, with INSPQ Diet is known as an important factor in the study of disabilities, but very little is known about people s nutritional behaviour. Nielsen: Information about all products sold by groceries and corner stores (from about 10% of all outlets) at the 3-digit postal code level. Match with UPC for nutrition. Loyalty programs: Purchases at the household level. Can be combined to medical records of disabilities (eg. diabetes). 6

Example #2: Predicting Insurgencies Shane Reese, Bringham Young University Insurgencies and riots are frequent in South America: 100s or 1000s in each country every year. 4 years of Twitter messages from South America. The massive database is stored on a Hadoop file system. Gold standard for insurgencies: GSR. Occurrence of an insurgency predicted by the volume of tweets, the presence of some keywords, and an increase in the use of The Onion Router (TOR), an online service to anonymize tweets. 7

Challenges from Volume Methods fail on available computers they do not scale well Exploratory Data Analysis is still crucial, but it is harder and more complex to perform Special infrastructure may be needed (eg. cluster for distributed data) using languages we are not typically trained for. Asymptotics fail: The relative link between n and p is different (eg. n/p k < as n ). 8

Challenges from Variety New types of data are available and must be included in the analysis: o Text o Images o Sound o Video o Networks Data may be heterogeneous : o Patrick Brown, UofT: spatial data with postal codes and census areas: do not match and vary through time. o Bo Li, U. of Illinois: Reconstructing temperature data from many proxies that vary through time (tree rings, pollen, ice cores, etc.) 9

Challenges Related to Veracity Data were collected for a purpose other than the one we want to use them for. They are observational, thus typically not from the population of interest Bias Data quality is hard to maintain in large administrative databases. o Lisa Lix, U. Manitoba: Models to improve the quality. Bias may be induced by model selection o Richard Lockhart, SFU: Inference from the LASSO. o Ejaz Ahmed, Brock U.: Bias from small signals forced to 0. 10

Challenges from Velocity Velocity is often a challenge when real-time decision or predictions must be made. Inference appears to be often done on fixed data and velocity is not the main issue. As a notable exception: models that are designed to make online predictions have to be able to produce those predictions fast. 11

Solution #1: Building More Complex Models With more data available, there is the possibility of fitting a much more complex model. Deep learning is a very successful example of the power of more complex models (eg. talk of Ruslan Salakhutdinov, UofT). Many layers of latent variables. Generates features automatically. Demo: o Finding similar images. o Generating captions for images. 12

Solution #2: Assuming Sparsity High dimensional data may have a lower dimensional underlying structure. Sometimes, the dimension of a model may even exceed the sample size! Assuming sparsity (ie. that most coefficients are 0) is a possible solution. The LASSO assumes that only some variables contribute to the signal. A penalty controls the number of null parameters (indirectly by controlling their magnitude). Regularization (penalty to control the coefficients) is used for other models as well, including deep learning models. Random projections maps a high dimensional space to a smaller space where distances are (almost) preserved. 13

Solution #3: Non-Convex Optimization Regularization with convex functions is easy to optimize, but non-convex penalties offer better behaviour of the estimates. Statistical problems do not tend to be adversary and it is possible to give guarantees of convergence. Martin Wainwright, UC Berkeley: No point in optimizing beyond statistical precision. Local maximum within a range of the global solution are acceptable. Optimization for distributed data (and infrastructure). 14

Solution #4: Developing New Visualization Tools Two examples: 1. Papillio: Sheelagh Carpendale, U. of Calgary. 15

2. Sofia Olhede, UCL: Network histogram 16

Solution #5: Developing New Asymptotics The assumption that n while p is fixed is often violated. Classical results may not apply. New asymptotic results are not only useful to develop methodology, but they help understand better the structure and behaviour of large dimensional problems. 17

Big Data as a Game Changer Sallie Keller s analogy with Hubble: Big Data allows us to observe phenomenon that were always there, but that we could not observe with previous technologies. Applied sciences: the cost of research is shifting from data acquisition to data storage and analysis. Data as a resource: In Business or in Urban Analytics, data are a resource that you must exploit to remain competitive. Multidisciplinarity gives a big boost. 18

Statistics vs Computer Science The Computer Science community has developed infrastructure and tools that make Big Data possible. What can the statisticians bring? A bigger focus on inference. A good intuition on potential sources of bias. A good understanding of stochasticity. Strategies to deal with noise (vs signal). From Steeve Scott, Google: Statistician talk to human and the brain needs very low-dimensional input for interpretation, Computer scientists talk to computers for whom such low dimensional input is not a requirement. 19

Conclusion: A Few Words of Wisdom Knowledge and wisdom about inference is still valid. We should not dismiss what we already know because of the promises of Big Data. Big Data traps according to David Buckeridge: Hubris: Seeing big data as a solution in isolation, rather than as potential added value to existing methods and theory. Dazzle: Starting with the data and looking for problems, rather than defining a problem then finding the data. The hype around the term Big Data will probably fade, but the new challenges will remain. 20