Making data predictive why reactive just isn t enough Andrew Peterson, Ph.D. Principal Data Scientist Soltius NZ, Ltd. New Zealand 2014 Big Data and Analytics Forum 18 August, 2014 Caveats and disclaimer: This is a large and complex topic and I cannot do it justice in the time available, so I will focus on my own perspectives (i.e., opinions) and experience. As such, I am not presenting or representing the views or opinions of any current or past company or organisation I may have worked with or for. I'm also going to brush aside all technical details, despite how important they really are. So please keep that in mind if I say something you disagree with. We can chat about details later if that would be helpful for you. Introduction When we talk about predictive analytics, most people probably think about using computers to predict an event or the state of a system at some future point in time. But there are other types of predictions that we can make using analytics that don't necessarily involve time directly although they may indirectly. These include things like estimating the output of a production line if certain control variables are changed, understanding how a change in hospital staffing levels may affect patient occupancy, or estimating the risk of a person developing a specific heritable disease. Answers to these types of problems are predictive because they involve estimating the state of a system under conditions that have not yet been observed, even though they may not necessarily involve explicitly projecting the state of a system forward in time. Going beyond analysing past usage data to estimate future trends In my opinion, and in very general terms, the principal feature that distinguishes predictive analytics from nonpredictive analytics is statistical and mathematical modelling. Consequently, to move from non-predictive analytics to predictive analytics requires at least the following two steps: 1. The development and use of a statistical or mathematical model to generalise historical patterns among input variables and output variables for the system under study. 2. The use of that model to estimate a new, unobserved output state from a set of at least partly unobserved input states. The principal requirement for predictive analytics is therefore a model, that is, a mathematical representation of the nature and the magnitude of relationships among the input and output variables. The model itself is an abstraction of reality. But if the system under study is amenable to predictive modelling (many systems are not), and the modelling is done well, then the model will enhance our understanding of the system through its representation of the relationships among variables. And with enhanced understanding comes greater opportunities for effective management and control of the target system. In other words, enhanced understanding empowers us to make better decisions. This, to me, is the real benefit that comes from predictive analytics.
Benefits of looking forward developing frameworks for better business decisions As previously mentioned, one of the most important benefits that predictive analytics can provide when it is done well is to assist organisations in making better management and control decisions. Put slightly differently, good predictive models can help organisations make important business and operational decisions proactively instead of reactively. But the true business value of being able to do this can only be assessed on a case-by-case basis and requires a thorough analysis of an organisation's business needs and opportunities. The principal reason most organisations will consider engaging in predictive analytics is the presumption that it will enhance the organisation's financial position either through increasing revenue or reducing costs. Consequently, an organisation must be able to rationally and unemotionally evaluate the true costs, the realistic benefits, and the potential risk of failure before implementing a programme in predictive analytics. Incorporating predictive analytics into a framework for making better business decisions requires the information provided by the models to be disseminated effectively to those who can use it to help the organisation achieve its strategic goals. Therefore, an organisation will need a comprehensive technological, strategic, and operational framework for deploying predictive models and for disseminating the insight they generate. Exactly what this framework should look like will depend on the organisation, its key business, and its objectives. But its importance should not be underestimated because even the best predictive model will be next to useless if it cannot empower people or systems to work more effectively. Predictive modelling pitfalls, challenges, and integration with existing enterprise software The following is a non-exhaustive list of issues I would advise any organisation to consider in detail before investing in predictive analytics. Some of these issues will have more or less relevance to different organisations, but I present them here because I have seen organisations, teams, and individuals trip over them in the past (and sometimes spectacularly). Pitfalls 1. Failing to understand the true costs, the realistic benefits, the risk of failure, and the likely impact each of these may have on the organisation: Costs of implementing predictive analytics often go well beyond software and hardware and can include training costs, costs associated with low productivity as staff learn new technologies and programming languages, costs associated with developing production-ready models, costs associated with maintaining and retuning models over time, etc. Understanding the realistic benefits of predictive analytics requires engaging people with substantive experience in the discipline who can evaluate a given business problem and the data that's available in order to provide an informed opinion of the likelihood of success. Some applications of predictive analytics are so well established and so general across different business sectors (e.g. some market analysis models) that identifying realistic benefits is relatively straight forward. But the more innovation that's required in building a predictive model, the less certain the outcome will be. Simply put, some systems can't be modelled effectively at least to the extent that an organisation could benefit from the effort. Consequently, an organisation must understand the impact that failure may have not only in terms of direct costs, but also in terms of the organisations strategic goals, etc. 2. Failing to employ or engage people who truly understand the domain: Predictive analytics is a specialised discipline and the devil really is in the details. It is very easy to do predictive analytics
poorly and the cost of doing so can be very high. Candidates for this type of work should have a comprehensive understanding of concepts such as: a. Statistical independence b. Autocorrelation c. Multicolinearity d. Heteroscedasticity e. Pseudoreplication f. Nonstationarity g. Statistical power h. Spurious regression i. Conditional probabilities j. Over fitting k. Etc., etc... 3. Data snooping the curse of data mining: Data snooping can take many forms but in essence it means poking around data and making many different comparisons until some interesting relationship is found, then reporting that relationship as if it is real and meaningful. But, unless there has been appropriate statistical control or validation against independent data, it's highly probable that the "interesting" relationship may be spurious and simply the outcome of random chance. To illustrate, I generated 100 random time series using an ARIMA (1,1,0) model with an AR term of 0.1 (examples are shown in the figure below). The key point to understand here is that each separate time series has no causal relationship with any other time series in the set, and that each series wanders up and down randomly in a manner similar to some stock prices. The only thing they have in common is their starting value and the statistical properties of their random wandering. If we calculate all 4950 pairwise correlations between these series, then roughly half of the correlations will be greater than ± 0.5. Most analysts would probably agree that a correlation greater than ± 0.5 should warrant further investigation. However, we already know that all the series in this example are random and independent of each other. Yet our analysis has found roughly 2000 pairs of variables that suggest potentially interesting, if not important, relationships. Even though this is a deliberate and contrived example, it nevertheless illustrates the point that the more variables we have in our data and the more relationships we investigate, the more we need to expect that we will find spurious relationships that have no causal or business basis. The take-home message here is to be wary of those who emphasise the simplicity of finding "hidden relationships and opportunities in masses of data" without detailing the precautions they take to ensure those relationships are real.
4. Forgetting the "Science" in "Data Science": In my opinion, the term "Science" appears in the title "Data Science" for a very good reason to emphasise the need to take a rigorous, hypothesis driven approach to system modelling and data analysis. Such an approach would avoid the data snooping problem just mentioned because a Data Scientist would work with an organisation's subject matter experts to develop a range of hypotheses about the business problem under study. In addition, the Data Scientist would develop appropriate modelling and statistical frameworks for testing those hypotheses before running any type of analysis. Having said that, it is appropriate to "poke around" in data when certain conditions are met, such as segregating data into multiple sets where one set is used only for exploratory analysis, another set is used only for model development and validation, and a third set is used only for model testing. Other approaches can include using strict statistical criteria for making multiple comparisons, etc. The critical point though is that building a valid predictive model that can return business value in a production setting is not straightforward and requires the rigorous application of the scientific method in order to maximise success. 5. Overloading on expensive and/or complex technology that may not be necessary. This can be avoided through informed, rational, and pragmatic choices that constrain costs while providing sufficient flexibility for projected growth. Challenges 1. Data quality vs. quantity: This is paramount if you can't trust your data then it doesn't matter how much of it you have because you will never be able to trust any forecasts derived from it. However, data quality generally follows a continuum from useful to useless, or good to bad. But how "good" does your data need to be in order to answer the questions that are important to your organisation? This can be addressed through careful simulation and is often referred to as a Power Analysis in statistics. That is, given data with a certain amount of variability, how large does our sample need to be in order to identify an effect of a certain size and with a specific level of confidence when that effect really does exist? There is, however, a pragmatic flip-side to the data quality issue: Is it better to have an approximate
answer to an important question or a precise answer to a trivial one? Can we make an informed trade-off between data quality and business value? A robust business case for predictive analytics should address these types of questions. 2. Controlled Experiments vs. Observational Studies: Models based on controlled experiments will always provide more understanding about a system than models based on observational data from the same system. But controlled experiments may be difficult, expensive, or even impossible to perform in many situations. If a controlled experiment can't be done, then data quality and the design and interpretation of the analysis become even more important. 3. Big Data vs. sufficient data: a. Is more data really better? To the best of my understanding there is no statistical theory that shows more data is always better, as long as we have sufficient data for the problem at hand. b. Statistical significance vs. practical significance: Sufficient data means having enough data to be able to identify patterns and trends that will have real business implications. In general, the more data we have then the greater our ability to identify smaller and smaller patterns in our data if those patterns truly exist. But at what point does a pattern or trend become so small that it has no practical business implications at all, even if it is statistically significant? And why would we use the volumes of data required to identify very small patterns if they have no practical significance for the business? Again, a robust business case for predictive analytics should address this type of question. It also leads directly to the big old elephant lounging in the Big Data Analytics room that many people seem to ignore computation time. c. Computation Time: Fitting predictive models to data takes computer time, and the more data we use in those models the longer the time required to fit them. Identifying the best model for a given data set requires fitting the model many time often thousands of times! For example, a support vector machine may have 3 tuning parameters. We need to find the best combination of those parameters for our data. We decide to test all combinations of 5 different values for each tuning parameter to give a total of 125 different parameter combinations. We want a good estimate of the model error because it will be used to make important business decisions, so we use a 10-fold cross-validation, repeated 3 times, for each of the 125 parameter combinations to give a total of 3750 separate model fittings. We also thought it best to use 5 million records in or data file even though a power analysis estimated that 500,000 records would be sufficient for our business requirements. With 5 million records it may take 5 minutes to fit the model once using a single core on our server. This would result in a total fitting time of 312.5 hours, or roughly 13 days. But we are smart and have good software and we know how to parallelise the solution. With 13 cores we can get the solution time down to about 1 day, with 26 cores we can get it down to about a half days, etc. If we assume the computation time is directly proportional to the size of the data (but often it scales nonlinearly), then using 500,000 records instead of 5 million records would reduce our total computation time from 312.5 hours for a single core down to 31.5 hours. This is still a significant period of time, but much more acceptable when parallelised, particularly when
we still have sufficient statistical power to identify patterns that will have real business value. But, we forgot one very important step in the model building process - we still need to search for the best combination of predictor variables, and every separate combination of those will require another 3750 separate model fittings! This example is deliberately simplistic. There are sophisticated methods for reducing the total number of times a model needs be fitted, but the important point is that the total number of fittings required to build a production-ready model is usually much higher than many people realise. And the more data we use, the longer the model development process will take. 4. Model testing the path to true enlightenment: This is, in my opinion, the single most important phase of predictive modelling. It is when we try to simulate a production setting by using the model to make predictions based on hold-out data that the model has never seen before. These predictions are then compared to the actual historical outcomes from the hold-out data and it is this comparison, and this comparison only, that we use to evaluate the effectiveness of our model. But there's a catch the technically correct way to use hold-out data is to only ever use it once. Say we build a model that did not perform well on our hold-out data. We start again, but to test the new model, we need a new set of hold-out data. If we don't have new hold-out data, all we do is enter a process of optimising our models for a single set of hold-out data. Consequently, model testing is where big data, or at least lots of data, really can be beneficial. In an ideal scenario, we would want to test our models many times using independent sets of hold-out data to get a good understanding of model performance before we risk putting them into production. But we are still not finished 5. Model maintenance: Heraclitus (535 BC 475 BC) is attributed with saying "There is nothing permanent except change". What was true then remains true today we must expect any system we model to change over time, possibly to the extent that our models begin to fail. This is particularly relevant in situations where we are using models to manage or control a system. In this case, the actions of management and control ARE changing the system, so we need to be particularly vigilant of the performance of such models over time. In the best case scenario, our models may simply need to be retrained or retuned on a regular basis to accommodate change. But the system may also change to such a degree that our original modelling paradigm may no longer be relevant and we are forced to go through the entire modelling process again. A classic example of a model failing over time is the Google flu predictor. The folks at Google are clever, and I m confident their model was developed using sound statistical principles. But its release to the public apparently influenced the incidence of the term "flu" appearing on the internet to such an extent that it caused the model to seriously overestimate the next flu outbreak. 6. Integration and model deployment: As mentioned previously, it doesn't matter how good a predictive model is if it remains inaccessible within an organisation. In order for predictive analytics to provide the greatest return on investment, it is essential that people within the organisation feel empowered in their roles through using the insight the models provide. When people feel empowered in this manner, the culture of the organisation will begin to embrace the use of predictive models as a valuable tool that helps people do their jobs better. This type of empowerment comes when the output from the models is reliable, easily accessible, easily interpreted, and meaningful in the context of people's daily roles and responsibilities.
One of the great benefits of the recent popularisation of big data and analytics is the wealth of products available today that make the integration and deployment of predictive analytics easier than it ever has been. All the leading commercial vendors in the ERP, data warehousing, business intelligence, and analytics markets have frameworks for deploying models and presenting the results to business users typically through some form of web interface. Many of these vendors also have cloud-based services that are targeted at small and medium businesses, providing a compelling basis for smaller organisations to participate in the benefits that well-designed predictive modelling can provide. Nevertheless, the choice of software for building and deploying models is an important part of any business case for predictive analytics. At a coarse level, there is a choice between commercial products and open source products. Fundamentally (though not strictly true in all cases), at the level of computer code operating on data and returning numerical results, it can be argued that there is not a lot to distinguish the commercial products from the open source products they all perform pretty much as expected. One of the real differences though lies in the level of technical and product support that is available. Generally speaking, help will usually be available quickly if something goes wrong with a commercial product. This is not necessarily true, nor false, for open source products. But an organisation that chooses to use open source products may need to expect to engage technical experts in those products at a frequency that may be higher than for a commercial product. Contractors and consultants who specialise in open source products continue to grow in number, so the open source pathway need not be constraining or unwise. It is up to the individual organisation to determine what it values most (for example, service versus cost) and the longer-term costs and benefits that may be associated with that preference (for example, bespoke solutions based on open source products and built by consultants or contractors may not provide adequate long-term flexibility or upgradeability). Again, the more aware an organisation can be regarding these alternatives, the better it will be placed to make good technical and business choices. Summary There are a number of important issues that organisations need to consider carefully before embarking on a programme of predictive analytics. Some of those issues have been touched on here, but many have not. And while these issues may seem daunting, the intent of this discussion has not been to dissuade people from investigating the utility of predictive analytics for their organisation. Instead, the intent has been to assist those people by exposing at least some of the issues they will need to consider. Predictive analytics is likely to become a necessary function for many organisations. In the public sectors such as healthcare, this will be driven by continued reductions in funding and increases in public and regulatory expectations. In the private sector, increasing competition will drive many businesses in this direction. Even though the benefits that an organisation may realise through the careful and considered application of predictive analytics could be substantial, it is worth keeping in mind that for every publicised success story, there is likely to be numerous expensive and/or embarrassing failures which we simply do not hear about. It is this type of failure that I hope the following five take-home summary points will help minimise: 1. Understand your business requirements and develop a robust business case for implementing predictive analytics. 2. Get the right people involved early in the journey. 3. Get the right technology for your organisation's requirements. Be rational and pragmatic but make sure your choices are capable of meeting your projected growth. Technology choices also need to
take into account the availability of suitably skilled staff or contractors in addition to licensing and maintenance fees, etc. 4. Always subject predictive models to rigorous testing using hold-out data that was not used during any stage of model development. This is the only way to understand if a model can truly perform in a production environment. 5. Ensure the model output is reliable, easy to access, easy to interpret, and that it helps empower people in their daily roles.