Challenges, Tools and Examples for Big Data Inference Jean-François Plante, HEC Montréal Closing Conference: Statistical and Computational Analytics for Big Data June 12 th, 2015
What is Big Data? Dan Ariely from Duke Univeristy : 1
What is Big Data? 2
Overview of the Opening Conference and Bootcamp Held at Fields January 12 to January 23. 35 scientific talks. Covering all themes of the Big Data program, one theme per day. An overview paper is being prepared by the postdoctoral fellows and longer term visitors at the Fields institute. 3
Themes of the Program Week one: Introductory Lectures and Overview Inference Environmental Science Optimization Week two: Visualization Social Policy Health Policy Deep Learning Networks and Machine Learning 4
Why Do We Talk About Big Data? Because we can! (techonology makes it possible). Because Big Data allows to observe and measure behaviours or events about humans. Because we can measure new things that are otherwise hard or impossible to evaluate. Because imperfect, large, unstructured or hard to handle data may still contain valuable information that we should not dismiss. 5
Example #1: Measuring the Effect of Nutrition David Buckeridge, McGill University, with INSPQ Diet is known as an important factor in the study of disabilities, but very little is known about people s nutritional behaviour. Nielsen: Information about all products sold by groceries and corner stores (from about 10% of all outlets) at the 3-digit postal code level. Match with UPC for nutrition. Loyalty programs: Purchases at the household level. Can be combined to medical records of disabilities (eg. diabetes). 6
Example #2: Predicting Insurgencies Shane Reese, Bringham Young University Insurgencies and riots are frequent in South America: 100s or 1000s in each country every year. 4 years of Twitter messages from South America. The massive database is stored on a Hadoop file system. Gold standard for insurgencies: GSR. Occurrence of an insurgency predicted by the volume of tweets, the presence of some keywords, and an increase in the use of The Onion Router (TOR), an online service to anonymize tweets. 7
Challenges from Volume Methods fail on available computers they do not scale well Exploratory Data Analysis is still crucial, but it is harder and more complex to perform Special infrastructure may be needed (eg. cluster for distributed data) using languages we are not typically trained for. Asymptotics fail: The relative link between n and p is different (eg. n/p k < as n ). 8
Challenges from Variety New types of data are available and must be included in the analysis: o Text o Images o Sound o Video o Networks Data may be heterogeneous : o Patrick Brown, UofT: spatial data with postal codes and census areas: do not match and vary through time. o Bo Li, U. of Illinois: Reconstructing temperature data from many proxies that vary through time (tree rings, pollen, ice cores, etc.) 9
Challenges Related to Veracity Data were collected for a purpose other than the one we want to use them for. They are observational, thus typically not from the population of interest Bias Data quality is hard to maintain in large administrative databases. o Lisa Lix, U. Manitoba: Models to improve the quality. Bias may be induced by model selection o Richard Lockhart, SFU: Inference from the LASSO. o Ejaz Ahmed, Brock U.: Bias from small signals forced to 0. 10
Challenges from Velocity Velocity is often a challenge when real-time decision or predictions must be made. Inference appears to be often done on fixed data and velocity is not the main issue. As a notable exception: models that are designed to make online predictions have to be able to produce those predictions fast. 11
Solution #1: Building More Complex Models With more data available, there is the possibility of fitting a much more complex model. Deep learning is a very successful example of the power of more complex models (eg. talk of Ruslan Salakhutdinov, UofT). Many layers of latent variables. Generates features automatically. Demo: o Finding similar images. o Generating captions for images. 12
Solution #2: Assuming Sparsity High dimensional data may have a lower dimensional underlying structure. Sometimes, the dimension of a model may even exceed the sample size! Assuming sparsity (ie. that most coefficients are 0) is a possible solution. The LASSO assumes that only some variables contribute to the signal. A penalty controls the number of null parameters (indirectly by controlling their magnitude). Regularization (penalty to control the coefficients) is used for other models as well, including deep learning models. Random projections maps a high dimensional space to a smaller space where distances are (almost) preserved. 13
Solution #3: Non-Convex Optimization Regularization with convex functions is easy to optimize, but non-convex penalties offer better behaviour of the estimates. Statistical problems do not tend to be adversary and it is possible to give guarantees of convergence. Martin Wainwright, UC Berkeley: No point in optimizing beyond statistical precision. Local maximum within a range of the global solution are acceptable. Optimization for distributed data (and infrastructure). 14
Solution #4: Developing New Visualization Tools Two examples: 1. Papillio: Sheelagh Carpendale, U. of Calgary. 15
2. Sofia Olhede, UCL: Network histogram 16
Solution #5: Developing New Asymptotics The assumption that n while p is fixed is often violated. Classical results may not apply. New asymptotic results are not only useful to develop methodology, but they help understand better the structure and behaviour of large dimensional problems. 17
Big Data as a Game Changer Sallie Keller s analogy with Hubble: Big Data allows us to observe phenomenon that were always there, but that we could not observe with previous technologies. Applied sciences: the cost of research is shifting from data acquisition to data storage and analysis. Data as a resource: In Business or in Urban Analytics, data are a resource that you must exploit to remain competitive. Multidisciplinarity gives a big boost. 18
Statistics vs Computer Science The Computer Science community has developed infrastructure and tools that make Big Data possible. What can the statisticians bring? A bigger focus on inference. A good intuition on potential sources of bias. A good understanding of stochasticity. Strategies to deal with noise (vs signal). From Steeve Scott, Google: Statistician talk to human and the brain needs very low-dimensional input for interpretation, Computer scientists talk to computers for whom such low dimensional input is not a requirement. 19
Conclusion: A Few Words of Wisdom Knowledge and wisdom about inference is still valid. We should not dismiss what we already know because of the promises of Big Data. Big Data traps according to David Buckeridge: Hubris: Seeing big data as a solution in isolation, rather than as potential added value to existing methods and theory. Dazzle: Starting with the data and looking for problems, rather than defining a problem then finding the data. The hype around the term Big Data will probably fade, but the new challenges will remain. 20