Tom Khabaza Hard Hats for Data Miners: Myths and Pitfalls of Data Mining
Hard Hats for Data Miners: Myths and Pitfalls of Data Mining By Tom Khabaza The intrepid data miner runs many risks, including being buried under mountains of data. Some risks are just myths that need to be debunked. Others, however, are real. In this article, I will debunk several of these myths and misconceptions and then describe some problems and pitfalls commonly encountered when conducting data mining, along with steps that you can take to protect yourself from them. A critical point to note is that data mining is a business process-a way of finding patterns in your data that provide insight you can use to conduct your business more effectively. Data mining also makes predictions to guide customer interactions and other business decisions. You'll see these points reinforced numerous times in the information that follows. Myths and misconceptions about data mining Myth #1: Data mining is all about algorithms A businessperson attending a typical data mining conference or reading its proceedings might form the impression that data mining is all about advanced data analysis algorithms. This misconception might be summarized as follows: "All you need for data mining is good algorithms. The better your algorithms, the better your data mining; advancing the effectiveness of data mining means advancing our knowledge of algorithms." To hold this view is to misunderstand the data mining process. Data mining is a process consisting of many elements, such as formulating business goals, mapping business goals to data mining goals, acquiring, understanding, and pre-processing the data, evaluating and presenting the results of analysis and deploying these results to achieve business benefits. This is not to minimize the importance of new or improved data mining algorithms. The problem occurs when data miners focus too much on the algorithms and ignore the other 90-95 percent of the data mining process. The consequences this misconception can be disastrous for a data mining project, possibly resulting in a failure to produce any useful results. Experienced data miners recognize the need for a broader view of the data mining process.
Myth #2: Data mining is all about predictive accuracy While data mining is not all about data analysis algorithms, there is a part of data mining that is about algorithms. This raises the question, "How can you judge the quality of an algorithm?" You might think that the main criterion would be the predictive accuracy of the models it generates. This view, however, misrepresents the role of algorithms in the data mining process. It is true that a predictive model should have some degree of accuracy, because this demonstrates that it has truly discovered patterns in the data. However, the usefulness of an algorithm or model is also determined by a number of other properties, one of which is whether understanding the resulting model requires deep technical knowledge or is something that can be understood by a typical analyst. Data miners who believe that predictive accuracy is the primary criterion of algorithm evaluation might use algorithms that can only be used by technology experts. These algorithms will then play only the most limited role, because data mining is a process that is driven by business expertise; it relies on the input and involvement of non-technical business professionals in order to be successful. Myth #3: Data mining requires a data warehouse Business people often think that a data warehouse is a prerequisite for data mining. This is a subtle misconception about the relationship between the two technologies. It is true that data mining can benefit from warehoused data that is well organized, relatively clean, and easy to access. This is particularly true if the warehouse has been constructed with data mining specifically in mind and with knowledge of the requirements of the data mining project. If this has not been the case, however, the warehoused data may be less useful for data mining than the source or operational data. In the worst case, warehoused data may be completely useless (for example, if only summary data are stored). A more accurate depiction of the relationship between the two would be that data mining benefits from a properly designed data warehouse; and that constructing such a warehouse often benefits from first doing some exploratory data mining. Myth #4: Data mining is all about vast quantities of data Early explanations of data mining often began with statements like, "We now collect more data than ever, yet how are we to benefit from these vast data stores?" Focusing on the size of data stores provided a convenient introduction to the topic of data mining, but subtly misrepresented its nature. While there are many large datasets that organizations can benefit from mining, it would be a mistake to believe that these should be the sole focus of data mining. Many useful data mining projects are performed on small or medium-sized datasets-some, for example, containing only a few hundreds or thousands of records. Subscribing to the erroneous belief that data mining is only appropriate for vast data stores would lead organizations to choose tools that sacrifice usability for scalability when, in fact, both attributes are essential. To quote a customer of a leading data mining tool: "Other data mining tools optimize machine time, but this tool optimizes my time." Whether the datasets are large or small, organizations should choose a data mining tool that optimizes the user's time.
Myth #5: Data mining should be done by a technology expert full million examples, or even 500,000. Consider the following questions and answers: Data mining uses advanced technology, and its workings, particularly those of modeling techniques, are unlikely to be understood by the wider IT community. Does this mean that data mining should be conducted only by those who understand every nuance of the technology that is involved? Quite the opposite is true, due to the paramount importance of business knowledge in data mining. When performed without business knowledge, data mining can produce nonsensical or useless results (see pitfall #3, below), so it is essential that data mining be performed by someone with extensive knowledge of the business problem. Very seldom is this the same person with extensive knowledge of the data mining technology. It is the responsibility of data mining tool providers to ensure that tools are accessible to business users. Pitfalls of data mining and how to avoid them Pitfall #1: Buried under mountains of data Data mining should be an interactive, iterative process in which the analyst applies substantial business knowledge and is "engaged" with the data. However, those who hold myth #4 (that data mining is about vast quantities of data) often suppose that this process must be applied to all of the available data. This can lead to attempts to mine volumes of data for which the available hardware and software cannot provide an acceptable interactive response. In these situations, the data mining process becomes sluggish, and by the time a question is answered, the analyst cannot remember why it was asked. The way to avoid this pitfall is to employ some form of sampling. For example, if we have a million customers and a 20 percent annual attrition (or "churn") rate, we need not plot our graphs or build our models using the Q: How many churn profiles do we expect to find? A: Maybe ten Q: How many examples of each profile do we need? A: Maybe a thousand Therefore, a sample of ten or twenty thousand churners and an equivalent number of non-churners is likely to be sufficient for this analysis. Note that this does not mean that data miners will never encounter the need to build models from millions of examples; only that they should not assume that they must do so, just because the data are available. Pitfall #2: The Mysterious Disappearing Terabyte This is a common phenomenon, but not always a pitfall. It refers to the fact that, for a given data mining problem, the amount of available and relevant data may be much less than initially supposed. Consider the following scenario: You are a data mining consultant, and your client is a large bank, which wishes to mine its customer data to determine credit risk. The bank holds terabytes of data on its customers and is concerned that the available computing resources may be inadequate to mine this volume of data. Here's how the situation might unfold. Different types of credit (personal loans, business loans, overdrafts) present different patterns of credit risk, so each data mining project will concentrate on just one type of borrower. The bank's domain experts judge a number of factors to be relevant, and the bank, planning ahead, began collecting data on these factors about 18 months ago. Since then, almost a thousand cases of bad debt have occurred. Thus, the relevant data consist of less than a thousand cases of bad debt plus a sample from a plentiful supply of cases of good debt-let's say 3,000 records in all. Somehow, the need to mine terabytes of data has disappeared "mysteriously".
Pitfall #3: Disorganized data mining Data mining can occasionally, despite the best of intentions, take place in an ad hoc manner, with no clear goals and no idea of how the results will be used. This leads to wasted time and unusable results. To produce useful results, it is critical to have clearly defined business and data mining goals, formulated early in the project, and clearly articulated deployment plans. A simple way of ensuring this is to use a standard process such as the CRoss-Industry Standard Practice for Data Mining (CRISP-DM) [1]. Such a process ensures the correct preparation for data mining and provides a common language for communicating methods and results. Data mining tools should support standard process models. Pitfall #4: Insufficient business knowledge surprisingly hard to come by. It might be that the data expert has left the organization or moved to another department or, in the case of legacy systems, there may be no data expert at all. This problem is exacerbated when the database or data warehouse management is outsourced: the external supplier is even less motivated than the user organization to maintain this information "just in case it might be needed in future." There is no simple resolution to this problem. IT departments should be made aware of the need to maintain information about their organization's databases. Also, when a data mining project is proposed, data miners should consider how much data knowledge is available and evaluate any risks caused by its absence or scarcity. Pitfall #6: Erroneous assumptions, courtesy of the experts On a number of occasions this article has mentioned the crucial role that business knowledge plays in data mining. Without it, organizations can neither achieve useful results nor guide the data mining process towards them. It is sometimes supposed that the end user can reasonably tell the data miner: "Here are the data, please go away, do your data mining, and come back with the answers." If this were to happen, the project would, at best, take many long and costly iterations to produce useful results. At worst, the results would be gibberish, and the project would fail. This pitfall can only be avoided by involving, at every stage of the data mining process, both the end user and someone with a detailed knowledge of the business. Ideally, the data miner or data mining consultant would have the business knowledge. Lacking it, the data miner should literally sit next to someone with the required business knowledge who understands the question under consideration. For this to work effectively, a highly interactive data mining environment with good response time is required. Pitfall #5: Insufficient data knowledge In order to perform data mining, we must be able to answer questions like "What do the codes in this field mean?" and "Can there be more than one record per customer in this table?". In some cases, this information is Business and data experts are crucial resources, but this does not mean that the data miner should unquestioningly accept every statement they make. The data miner should seek to confirm the validity of experts' statements. Typical examples of erroneous or misleading statements might include: No customer can hold accounts of both these types No case will include more than one event of this type Only the following codes will be present in this field Data miners should verify statements like these by examining the data. This is particularly important when processing of the data will depend on their accuracy. Ideally, mistakes in assumptions about data can be spotted before they lead to errors in the treatment of data. Data mining tools should make this easy to accomplish. Pitfall #7: Incompatibility of data mining tools The data mining process requires a wide range of capabilities, so it's not unusual that during a single project a wide variety of tools might be used. This can, however, lead to high overhead costs due to the time and resources required to switch contexts and convert data from one format to another. At its worst, this can lead to the omission of necessary steps in the data mining process and can seriously interfere with the exploratory character of data mining.
The best solution is to use a data mining toolkit that integrates all the required capabilities. However, no toolkit will provide every possible capability, especially when the individual preferences of analysts are taken into account, so the toolkit should also be "open"-that is, able to interface easily with other available tools and third-party options. Pitfall #8: Locked in the data jail-house In addition to openness with regard to tools, data mining solutions should also be open with regard to data. Some data mining tools require the data to be held in a proprietary format that is not compatible with commonly used database systems. (This is sometimes referred to as the "data jail-house.") This can result in high overhead costs, due to the need for transferring data into the required format, and lead to difficulty in deploying the results into an organization's operational systems. A good data mining tool will interface with your data via common standards. Conclusion Data mining is a business process, requiring extensive business knowledge. It is best practiced by business experts or by data mining experts in close collaboration with business experts. Data mining uses a variety of techniques and should not focus only on modeling algorithms and their predictive accuracy. Each technique can play a variety of roles. During the data mining process, data miners interact and engage with the data in an iterative fashion. A standard data mining process model, such as CRISP-DM [1], helps to ensure the correct preparation for and use of data mining. Data mining tools should be evaluated based on their accessibility to business users, their scalability and usability, and their support for standard processes. Data miners should make intelligent decisions about the amount of data required, assuming neither that all of an organization's data will be relevant nor that all the available data will be required. Effective data mining requires flexible and interoperable techniques. This requirement is best met by integrated, open toolkits that can interface to data by means of open standards. References [1] Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., and Wirth, R. CRISP-DM 1.0 Step-by-step data mining guide, CRISP-DM Consortium, 2000, available at http://www.crisp-dm.org. Weitere Information über SPSS erhalten Sie unter www.spss.ch SPSS Schweiz AG, Schneckenmannstrasse 25, 8044 Zürich Telefon +41 (0) 1 266 90 30, Fax +41 (0) 1 266 90 39 SPSS is a registered trademark and the other SPSS products named are trademarks of SPSS Inc. All other names are trademarks of their respective owners. 2005 SPSS Inc. All rights reserved. DamiD/0404