1 Making data predictive why reactive just isn t enough Andrew Peterson, Ph.D. Principal Data Scientist Soltius NZ, Ltd. New Zealand 2014 Big Data and Analytics Forum 18 August, 2014 Caveats and disclaimer: This is a large and complex topic and I cannot do it justice in the time available, so I will focus on my own perspectives (i.e., opinions) and experience. As such, I am not presenting or representing the views or opinions of any current or past company or organisation I may have worked with or for. I'm also going to brush aside all technical details, despite how important they really are. So please keep that in mind if I say something you disagree with. We can chat about details later if that would be helpful for you. Introduction When we talk about predictive analytics, most people probably think about using computers to predict an event or the state of a system at some future point in time. But there are other types of predictions that we can make using analytics that don't necessarily involve time directly although they may indirectly. These include things like estimating the output of a production line if certain control variables are changed, understanding how a change in hospital staffing levels may affect patient occupancy, or estimating the risk of a person developing a specific heritable disease. Answers to these types of problems are predictive because they involve estimating the state of a system under conditions that have not yet been observed, even though they may not necessarily involve explicitly projecting the state of a system forward in time. Going beyond analysing past usage data to estimate future trends In my opinion, and in very general terms, the principal feature that distinguishes predictive analytics from nonpredictive analytics is statistical and mathematical modelling. Consequently, to move from non-predictive analytics to predictive analytics requires at least the following two steps: 1. The development and use of a statistical or mathematical model to generalise historical patterns among input variables and output variables for the system under study. 2. The use of that model to estimate a new, unobserved output state from a set of at least partly unobserved input states. The principal requirement for predictive analytics is therefore a model, that is, a mathematical representation of the nature and the magnitude of relationships among the input and output variables. The model itself is an abstraction of reality. But if the system under study is amenable to predictive modelling (many systems are not), and the modelling is done well, then the model will enhance our understanding of the system through its representation of the relationships among variables. And with enhanced understanding comes greater opportunities for effective management and control of the target system. In other words, enhanced understanding empowers us to make better decisions. This, to me, is the real benefit that comes from predictive analytics.

3 poorly and the cost of doing so can be very high. Candidates for this type of work should have a comprehensive understanding of concepts such as: a. Statistical independence b. Autocorrelation c. Multicolinearity d. Heteroscedasticity e. Pseudoreplication f. Nonstationarity g. Statistical power h. Spurious regression i. Conditional probabilities j. Over fitting k. Etc., etc Data snooping the curse of data mining: Data snooping can take many forms but in essence it means poking around data and making many different comparisons until some interesting relationship is found, then reporting that relationship as if it is real and meaningful. But, unless there has been appropriate statistical control or validation against independent data, it's highly probable that the "interesting" relationship may be spurious and simply the outcome of random chance. To illustrate, I generated 100 random time series using an ARIMA (1,1,0) model with an AR term of 0.1 (examples are shown in the figure below). The key point to understand here is that each separate time series has no causal relationship with any other time series in the set, and that each series wanders up and down randomly in a manner similar to some stock prices. The only thing they have in common is their starting value and the statistical properties of their random wandering. If we calculate all 4950 pairwise correlations between these series, then roughly half of the correlations will be greater than ± 0.5. Most analysts would probably agree that a correlation greater than ± 0.5 should warrant further investigation. However, we already know that all the series in this example are random and independent of each other. Yet our analysis has found roughly 2000 pairs of variables that suggest potentially interesting, if not important, relationships. Even though this is a deliberate and contrived example, it nevertheless illustrates the point that the more variables we have in our data and the more relationships we investigate, the more we need to expect that we will find spurious relationships that have no causal or business basis. The take-home message here is to be wary of those who emphasise the simplicity of finding "hidden relationships and opportunities in masses of data" without detailing the precautions they take to ensure those relationships are real.

6 we still have sufficient statistical power to identify patterns that will have real business value. But, we forgot one very important step in the model building process - we still need to search for the best combination of predictor variables, and every separate combination of those will require another 3750 separate model fittings! This example is deliberately simplistic. There are sophisticated methods for reducing the total number of times a model needs be fitted, but the important point is that the total number of fittings required to build a production-ready model is usually much higher than many people realise. And the more data we use, the longer the model development process will take. 4. Model testing the path to true enlightenment: This is, in my opinion, the single most important phase of predictive modelling. It is when we try to simulate a production setting by using the model to make predictions based on hold-out data that the model has never seen before. These predictions are then compared to the actual historical outcomes from the hold-out data and it is this comparison, and this comparison only, that we use to evaluate the effectiveness of our model. But there's a catch the technically correct way to use hold-out data is to only ever use it once. Say we build a model that did not perform well on our hold-out data. We start again, but to test the new model, we need a new set of hold-out data. If we don't have new hold-out data, all we do is enter a process of optimising our models for a single set of hold-out data. Consequently, model testing is where big data, or at least lots of data, really can be beneficial. In an ideal scenario, we would want to test our models many times using independent sets of hold-out data to get a good understanding of model performance before we risk putting them into production. But we are still not finished 5. Model maintenance: Heraclitus (535 BC 475 BC) is attributed with saying "There is nothing permanent except change". What was true then remains true today we must expect any system we model to change over time, possibly to the extent that our models begin to fail. This is particularly relevant in situations where we are using models to manage or control a system. In this case, the actions of management and control ARE changing the system, so we need to be particularly vigilant of the performance of such models over time. In the best case scenario, our models may simply need to be retrained or retuned on a regular basis to accommodate change. But the system may also change to such a degree that our original modelling paradigm may no longer be relevant and we are forced to go through the entire modelling process again. A classic example of a model failing over time is the Google flu predictor. The folks at Google are clever, and I m confident their model was developed using sound statistical principles. But its release to the public apparently influenced the incidence of the term "flu" appearing on the internet to such an extent that it caused the model to seriously overestimate the next flu outbreak. 6. Integration and model deployment: As mentioned previously, it doesn't matter how good a predictive model is if it remains inaccessible within an organisation. In order for predictive analytics to provide the greatest return on investment, it is essential that people within the organisation feel empowered in their roles through using the insight the models provide. When people feel empowered in this manner, the culture of the organisation will begin to embrace the use of predictive models as a valuable tool that helps people do their jobs better. This type of empowerment comes when the output from the models is reliable, easily accessible, easily interpreted, and meaningful in the context of people's daily roles and responsibilities.

8 take into account the availability of suitably skilled staff or contractors in addition to licensing and maintenance fees, etc. 4. Always subject predictive models to rigorous testing using hold-out data that was not used during any stage of model development. This is the only way to understand if a model can truly perform in a production environment. 5. Ensure the model output is reliable, easy to access, easy to interpret, and that it helps empower people in their daily roles.

