Statistical Challenges with Big Data in Management Science



Similar documents
CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

BIG DATA FUNDAMENTALS

Big Data, Official Statistics and Social Science Research: Emerging Data Challenges

Are You Ready for Big Data?

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

Now, Next and the Future: IT, Big Data and other Implications for RIM. Presented by Michael S. Smith /

How To Understand The Benefits Of Big Data

Statistics for BIG data

Are You Ready for Big Data?

1. Understanding Big Data

Beyond Watson: The Business Implications of Big Data

Big Data-Challenges and Opportunities

Big Data & Analytics: Your concise guide (note the irony) Wednesday 27th November 2013

BIG DATA I N B A N K I N G

The 4 Pillars of Technosoft s Big Data Practice

Industry Impact of Big Data in the Cloud: An IBM Perspective

Big Data Introduction, Importance and Current Perspective of Challenges

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

BIG DATA: PROMISE, POWER AND PITFALLS NISHANT MEHTA

WHAT IS BIG DATA? David Bechtold

Big Data. Patrick Derde. Use Cases and Architecture

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Deploying Big Data to the Cloud: Roadmap for Success

Data Refinery with Big Data Aspects

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Data, Measurements, Features

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Research Note What is Big Data?

Predictive Analytics: Turn Information into Insights

BIG DATA TECHNOLOGY. Hadoop Ecosystem

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Bruhati Technologies. About us. ISO 9001:2008 certified. Technology fit for Business

BIG DATA What it is and how to use?

WHITE PAPER ON. Operational Analytics. HTC Global Services Inc. Do not copy or distribute.

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

Big Data in Retail Big Data Analytics Central to Customer Acquisition and Retention Strategies in Retail

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Big Data Mining: Challenges and Opportunities to Forecast Future Scenario

Of all the data in recorded human history, 90 percent has been created in the last two years. - Mark van Rijmenam, Think Bigger, 2014

The New World of Data. Don Strickland President, Strickland & Associates

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Big Data / FDAAWARE. Rafi Maslaton President, cresults the maker of Smart-QC/QA/QD & FDAAWARE 30-SEP-2015

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Big Data Analytics: 14 November 2013

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

BIG DATA. - How big data transforms our world. Kim Escherich Executive Innovation Architect, IBM Global Business Services

Data Centric Computing Revisited

The Big Data Paradigm Shift. Insight Through Automation

IBM Software Top tips for securing big data environments

Associate Prof. Dr. Victor Onomza Waziri

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Big Data a threat or a chance?

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data. Fast Forward. Putting data to productive use

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

What happens when Big Data and Master Data come together?

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Predicting & Preventing Banking Customer Churn by Unlocking Big Data

Exploring Big Data in Social Networks

Transforming the Telecoms Business using Big Data and Analytics

locuz.com Big Data Services

How To Use Big Data Effectively

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Predicting & Preventing Banking Customer Churn by Unlocking Big Data

Certificate Program in Applied Big Data Analytics in Dubai. A Collaborative Program offered by INSOFE and Synergy-BI

Real Time Analytics for Big Data. NtiSh Nati

Big Data in Healthcare: Myth, Hype, and Hope

BIG DATA ANALYTICS For REAL TIME SYSTEM

Collaborations between Official Statistics and Academia in the Era of Big Data

HP Vertica. Echtzeit-Analyse extremer Datenmengen und Einbindung von Hadoop. Helmut Schmitt Sales Manager DACH

Big Data Use Cases Update

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Introduction to Data Mining

A Study on Security and Privacy in Big Data Processing

BEYOND BI: Big Data Analytic Use Cases

Big Data: How can it enhance your strategy?

Business Analytics and the Nexus of Information

Analyzing Big Data: The Path to Competitive Advantage

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

No Data Governance, No Actionable Insights

BUY BIG DATA IN RETAIL

Big Data Challenges and Success Factors. Deloitte Analytics Your data, inside out

Transcription:

Statistical Challenges with Big Data in Management Science Arnab Kumar Laha Indian Institute of Management Ahmedabad

Analytics vs Reporting Competitive Advantage Reporting Prescriptive Analytics (Decision Optimization) Predictive Analytics Forecasting Statistical models Alerts Query/drill down Ad hoc reports Standard reports Degree of Intelligence What s the best that can happen? What will happen next? What if these trends continue? Why is this happening? What actions are needed? Where exactly is the problem? How many, how often, where? What happened? Analytics

How big is Big Data? Every day, we create 2.5 quintillion (10 18 ) bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

Big Data Use of large-scale business and administrative data sets for secondary purposes other than those for which the data was originally collected

4V s of Big Data High Volume High Velocity Big Data Large Variety Poor Veracity

KEY ENABLERS for Big-Data? + Increase in storage capabilities + Increase in processing power + Availability of data + Cheaper Hardware + Better Value-for-Money for Businesses

Big data : Volume Enterprises are acquiring very large volume of data through variety of sources Some examples of use: Sentiment Analysis Twitter data Terabytes of Tweets are created each day which can be used for improved product sentiment analysis Predict power consumption Use billions of annual meter readings for better prediction of power consumption say every hour / minute.

Big data : Velocity For time-sensitive processes such as catching fraud, preventing accidents, giving life saving medication etc. big data must be used as it streams into an enterprise in order to maximize its value. The data that can be used for decision making is small Some examples of use: - Scrutinize millions of credit card transactions each day to identify potential fraud - Analyze billions of daily call detail records in realtime to predict customer churn faster - In ICU, analyze blood chemistry / ECG readings in real time to deliver life saving medication

Big data : Variety Big data can be of any type -structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. Some examples of use: Monitor live video feeds from surveillance cameras to identify potential threats Utilize image, audio, video and web information about a customer to give better product usage trainings, safety tips and recommendations.

Big data : Veracity Accuracy is a big concern in Big Data. There is no easy way to segregate good data from bad. Some concerns: -Among thousands of reviews of hotels which ones are authentic and which ones are not? -How to find out the TRUTH from thousands of product reviews - How to identify a rumor from a informed communication?

Challenge of Big Data Scalability of Algorithms for Statistical Computations Some very useful tools of statistical analysis are presently not implemented with scalable algorithms For e.g. if median is to be computed using the simple bubble sort algorithm it may take a very large amount of time algorithmic complexity = O(N 2 ). Statistical methods which uses median or other quantiles as its base will all face this challenge. Use of some better sorting algorithms like radix sort or heap sort may help.

Challenge of Big Data Scalability of Algorithms for Statistical Computations Note that the product of a matrix (NxN) with a vector (Nx1) when done is usual manner takes O(N 2 ) computations. Since this operation is a basic operation for many statistical algorithms, we see that these algorithms would not be scalable. The product of two NxN matrices done in conventional manner has O(N 3 ) complexity. Algorithms such as the Strassen algorithm or Coppersmith-Winograd algorithms give better performance. Moral: Newer scalable algorithms are required for doing computations with Big Data.

Challenge of Big Data Non-Random Samples While the data volume is large it is not collected for a specific purpose No random sampling schemes are present in most cases Inferential models of statistics (Frequentist / Bayesian) generally assume a random sample drawn from a population May give inaccurate results when used with non-random samples Need is for statistical methods which can handle non-random samples. Approaches such as using weights to adjust for sample selection may be useful (Nevo, 2003 JBES)

Challenge of Big Data Mixture Data Is there a single population or multiple populations in the data set? Most statistical methods are devised for drawing inference for a single population If the data is a mixture of observations from multiple populations we need to discover the number of populations as well have specific separate inferences. Algorithms such as Flexible Regression and Classification, Machine learning algorithms like CART/ CHAID attempts do this More such methods needed

Challenge of Big Data Real Time Streaming data - focus is on speed Real time problems such as fraud detection need quick analysis. Data streams are analyzed in memory in time windows before being written to disk. Analysis of data stored on disk is time consuming and hence not useful for these kind of applications So only a tiny part of the Big Data is used for drawing conclusion

Challenge of Big Data Real Time Statistical methods generally work with whole data and aims to give the best solution It may not be possible to have best solution by analyzing only a part of the data. Trade-off between speed and accuracy. Change of Philosophy Get Good Answer Fast! instead of Get the best answer Techniques from Statistical Quality Control and Statistical Surveillance can possibly be adapted to deal with real time analytics with streaming data

Google Flu Trend In 2009, Google reported that by analyzing flu-related search queries it had been able to detect the spread of the flu as accurately and more quickly than CDCP, USA In Feb 2013, Nature reported that Google flu-trends weren t working and predicted more than double the proportion of doctor visits for influenza-like illnesses than CDCP. Many reasons are possible including change in search behaviour, emergence of alternative sources of information etc.

Challenge of Big Data Variety of Data Big Data consists of different kinds of data With more devices becoming internet enabled the variety of data is large Images, Audio, Text, Social Media -Twitter, Facebook, Graphs etc. all contribute to the variety Goal is to arrive at good decisions using all the information from all the sources

Challenge of Big Data Variety of Data Symbolic Data Analysis (SDA) attempts to handle this complex problem Present SDA methods are largely descriptive New methods of quantifying association and uncertainty for these kind of data are required. Statistics on Manifolds?

Challenge of Big Data Data Quality Data Quality is a big concern Apart from bias, the data can be inaccurate Over time data definitions as well as method of collection may change For e.g. in credit default prediction studies the persons seeking loan from an institution may change over a period of time based on the bank s behaviour in terms of giving loan.

Challenge of Big Data Data Quality It may be interesting to know that to what extent a customer seeking a loan is indebted Assume that the customer may have taken loans from three financial institution at different times The names on loan records for the same person may be Ajay K Bose, Ajoy Bose and Ajoy Kr Bose. How does one know these are same person? Statistical matching may be helpful in cleaning and building unique customer profiles

Challenge of Big Data Protecting Privacy and Confidentiality The retail chain Target in US could figure out that a teenager was pregnant even before her father knew (Source: Forbes, Feb 2012) A British bank is able to identify potential money launderers who may be linked to terrorism. (Source : Superfreakonomics, Levitt & Dubner) Is it ethical for Target to guess the private information of an individual customer? What if the algorithm identifies a wrong person as a potential money launderer? What happens to his/ her reputation?

Large p, Small N In genomic studies, the number of variables(p) is very large and often exceeds the number of samples (N) by a big amount. We need to reduce the dimension of the problem to be able to draw meaningful conclusions Similar situations arise in many organizational strategy studies What would be best strategy for organization A which is successful in UK to enter emerging markets such as India?

Summary Big data poses new challenges to statisticians both in terms of theory and applications Some of the challenges include - Scalability of statistical computation methods - Non-random data - Mixture data - Real Time Analysis on Streaming Data - Statistical Analysis with multiple kinds of data - Data Quality - Protecting Privacy and Confidentiality - High Dimensional data

Thank you arnab@iimahd.ernet.in