Statistical Challenges with Big Data in Management Science Arnab Kumar Laha Indian Institute of Management Ahmedabad
Analytics vs Reporting Competitive Advantage Reporting Prescriptive Analytics (Decision Optimization) Predictive Analytics Forecasting Statistical models Alerts Query/drill down Ad hoc reports Standard reports Degree of Intelligence What s the best that can happen? What will happen next? What if these trends continue? Why is this happening? What actions are needed? Where exactly is the problem? How many, how often, where? What happened? Analytics
How big is Big Data? Every day, we create 2.5 quintillion (10 18 ) bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
Big Data Use of large-scale business and administrative data sets for secondary purposes other than those for which the data was originally collected
4V s of Big Data High Volume High Velocity Big Data Large Variety Poor Veracity
KEY ENABLERS for Big-Data? + Increase in storage capabilities + Increase in processing power + Availability of data + Cheaper Hardware + Better Value-for-Money for Businesses
Big data : Volume Enterprises are acquiring very large volume of data through variety of sources Some examples of use: Sentiment Analysis Twitter data Terabytes of Tweets are created each day which can be used for improved product sentiment analysis Predict power consumption Use billions of annual meter readings for better prediction of power consumption say every hour / minute.
Big data : Velocity For time-sensitive processes such as catching fraud, preventing accidents, giving life saving medication etc. big data must be used as it streams into an enterprise in order to maximize its value. The data that can be used for decision making is small Some examples of use: - Scrutinize millions of credit card transactions each day to identify potential fraud - Analyze billions of daily call detail records in realtime to predict customer churn faster - In ICU, analyze blood chemistry / ECG readings in real time to deliver life saving medication
Big data : Variety Big data can be of any type -structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. Some examples of use: Monitor live video feeds from surveillance cameras to identify potential threats Utilize image, audio, video and web information about a customer to give better product usage trainings, safety tips and recommendations.
Big data : Veracity Accuracy is a big concern in Big Data. There is no easy way to segregate good data from bad. Some concerns: -Among thousands of reviews of hotels which ones are authentic and which ones are not? -How to find out the TRUTH from thousands of product reviews - How to identify a rumor from a informed communication?
Challenge of Big Data Scalability of Algorithms for Statistical Computations Some very useful tools of statistical analysis are presently not implemented with scalable algorithms For e.g. if median is to be computed using the simple bubble sort algorithm it may take a very large amount of time algorithmic complexity = O(N 2 ). Statistical methods which uses median or other quantiles as its base will all face this challenge. Use of some better sorting algorithms like radix sort or heap sort may help.
Challenge of Big Data Scalability of Algorithms for Statistical Computations Note that the product of a matrix (NxN) with a vector (Nx1) when done is usual manner takes O(N 2 ) computations. Since this operation is a basic operation for many statistical algorithms, we see that these algorithms would not be scalable. The product of two NxN matrices done in conventional manner has O(N 3 ) complexity. Algorithms such as the Strassen algorithm or Coppersmith-Winograd algorithms give better performance. Moral: Newer scalable algorithms are required for doing computations with Big Data.
Challenge of Big Data Non-Random Samples While the data volume is large it is not collected for a specific purpose No random sampling schemes are present in most cases Inferential models of statistics (Frequentist / Bayesian) generally assume a random sample drawn from a population May give inaccurate results when used with non-random samples Need is for statistical methods which can handle non-random samples. Approaches such as using weights to adjust for sample selection may be useful (Nevo, 2003 JBES)
Challenge of Big Data Mixture Data Is there a single population or multiple populations in the data set? Most statistical methods are devised for drawing inference for a single population If the data is a mixture of observations from multiple populations we need to discover the number of populations as well have specific separate inferences. Algorithms such as Flexible Regression and Classification, Machine learning algorithms like CART/ CHAID attempts do this More such methods needed
Challenge of Big Data Real Time Streaming data - focus is on speed Real time problems such as fraud detection need quick analysis. Data streams are analyzed in memory in time windows before being written to disk. Analysis of data stored on disk is time consuming and hence not useful for these kind of applications So only a tiny part of the Big Data is used for drawing conclusion
Challenge of Big Data Real Time Statistical methods generally work with whole data and aims to give the best solution It may not be possible to have best solution by analyzing only a part of the data. Trade-off between speed and accuracy. Change of Philosophy Get Good Answer Fast! instead of Get the best answer Techniques from Statistical Quality Control and Statistical Surveillance can possibly be adapted to deal with real time analytics with streaming data
Google Flu Trend In 2009, Google reported that by analyzing flu-related search queries it had been able to detect the spread of the flu as accurately and more quickly than CDCP, USA In Feb 2013, Nature reported that Google flu-trends weren t working and predicted more than double the proportion of doctor visits for influenza-like illnesses than CDCP. Many reasons are possible including change in search behaviour, emergence of alternative sources of information etc.
Challenge of Big Data Variety of Data Big Data consists of different kinds of data With more devices becoming internet enabled the variety of data is large Images, Audio, Text, Social Media -Twitter, Facebook, Graphs etc. all contribute to the variety Goal is to arrive at good decisions using all the information from all the sources
Challenge of Big Data Variety of Data Symbolic Data Analysis (SDA) attempts to handle this complex problem Present SDA methods are largely descriptive New methods of quantifying association and uncertainty for these kind of data are required. Statistics on Manifolds?
Challenge of Big Data Data Quality Data Quality is a big concern Apart from bias, the data can be inaccurate Over time data definitions as well as method of collection may change For e.g. in credit default prediction studies the persons seeking loan from an institution may change over a period of time based on the bank s behaviour in terms of giving loan.
Challenge of Big Data Data Quality It may be interesting to know that to what extent a customer seeking a loan is indebted Assume that the customer may have taken loans from three financial institution at different times The names on loan records for the same person may be Ajay K Bose, Ajoy Bose and Ajoy Kr Bose. How does one know these are same person? Statistical matching may be helpful in cleaning and building unique customer profiles
Challenge of Big Data Protecting Privacy and Confidentiality The retail chain Target in US could figure out that a teenager was pregnant even before her father knew (Source: Forbes, Feb 2012) A British bank is able to identify potential money launderers who may be linked to terrorism. (Source : Superfreakonomics, Levitt & Dubner) Is it ethical for Target to guess the private information of an individual customer? What if the algorithm identifies a wrong person as a potential money launderer? What happens to his/ her reputation?
Large p, Small N In genomic studies, the number of variables(p) is very large and often exceeds the number of samples (N) by a big amount. We need to reduce the dimension of the problem to be able to draw meaningful conclusions Similar situations arise in many organizational strategy studies What would be best strategy for organization A which is successful in UK to enter emerging markets such as India?
Summary Big data poses new challenges to statisticians both in terms of theory and applications Some of the challenges include - Scalability of statistical computation methods - Non-random data - Mixture data - Real Time Analysis on Streaming Data - Statistical Analysis with multiple kinds of data - Data Quality - Protecting Privacy and Confidentiality - High Dimensional data
Thank you arnab@iimahd.ernet.in