Data Mining. Anyone can tell you that it takes hard work, talent, and hours upon hours of

Seth Rhine Math 382 Shapiro Data Mining Anyone can tell you that it takes hard work, talent, and hours upon hours of watching videos for a professional sports team to be successful. Finding the leaks in their opponent s strategy is the ultimate goal for the coaches and captains watching in-game footage, allowing them to devise plays and make key decisions in future games. In the National Basketball Association (NBA), the coaches have a good share of the work done for them already with the help of Advanced Scout, a program that helps find patterns derived from game statistics, images, and the movements of the players themselves. When a pattern emerges from the data provided, Advanced Scout will let the user know why the patterns are so significant, leading the user toward valuable video clips and sparing him from many hours in front of in-game footage (Palace, 1996). Such a process is not exclusive to Advanced Scout, or even the NBA for that matter. Similar processes are used everyday by parties of many facets, and comprise a fairly recently coined field known as data mining. Data mining is defined as the process of seeking interesting or valuable information within large databases (Hand, et al., 2000, p.111). At first glance, this definition might seem more like a new name for statistics, rather than a new field itself. However, data mining is actually performed on sets of data that are far larger than statistical methods can accurately analyze. Some of data mining s 1

methods have been used to analyze data sets containing enough data points that their numbers trail far off into the billions. Realistically, these sets would take too much time, money, and painstaking detail for any human to be expected to look over (Hand, p.113). To aid these slow-pokes in the process, it is necessary that we rely on machines to do most of the dirty work, if not all of it. The mere existence of such data sets is allowed by the advancement of modern technologies, i.e. faster computers, larger hard drives, and improved database software, among other things. Many of the techniques used by statisticians on smaller data sets of a few hundred samples simply do not hold when used on larger sets, and must be improved and expanded upon to successfully mine the data. For instance, a company like Wal-Mart will perform over 7 billion transactions annually. To effectively analyze the buying patterns of a customer purchase database of this size requires much more than the human hand and statistical tactics. Consequently, data mining is actually quite complex, consisting of notions from statistics, pattern recognition, computer programming, algorithms, machine learning, and many other disciplines (Hand, et al, 2000, p.111-114). As for how an organization obtains and uses data, Wal-Mart is a prime example. The multi-billion dollar company uses the history of customer transactions as useable data to help the company develop a marketing strategy based upon the structures that can be derived from it. Such structures can be seen as either a model or a pattern, both of which are highly sought by data mining programs. A model is basically defined to be an overall summary of a set or subset of data, while a pattern is a smaller structure that possibly refers to a number of objects that is relatively small compared to the sample size. 2

Fig.1 (Hand, et al, 2000) Essentially, patterns are often defined relative to the overall model of the data set from which it is derived. There are many tools involved in data mining that help find these structures and a few of them are exemplified in the next few paragraphs. Some of the most important tools for an analyst would be clustering, regression, rule extraction, and data visualization. Clustering is the act of partitioning data sets of many random items into subsets of smaller size that show commonality between them (Weisstein, 2010). By looking at such clusters, data miners are able to extract statistical models from the data fields. Regression is defined as a method for fitting a curve through a set of points using some goodness-of-fit criterion (Weisstein, 2010). While examining predefined goodness-of-fit parameters, analysts can locate and describe patterns using regression. Rule extraction is the method of using relationships between 3

variables to establish some sort of rule, most likely for use in a marketing strategy. For instance, in a large set of data from point of sale purchases at a grocery store, it may be observed that customers who bought products A and B typically purchase product C, as well. This information could possibly help the grocery store develop a marketing strategy to further increase profits. Data visualization is also a key element to the success of data mining. The samples of data being mined are so vast that scatter plots and histograms will often fall short representing any information of realistic value (see Figure 1). For that very reason, the analysts concerned with data mining are constantly looking for better ways to graphically represent data, such as depicted in Figure 2 on page 5 (Hand, et al, 2000, p. 113). No matter what tools analysts will have at their fingertips, the patterns and models being mined will only be as good in quality as the data that it is being derived from. If a database contains biased data or incomplete data, this will often lead to inaccurate results and a large chance that patterns found will actually be due to chance. Since the source of the data is such a large entity, it is almost certain that there will be missing or corrupted data within the database being mined (Hand, 1998). This is one of the biggest reasons that data mining is looked down upon by some statisticians. Suppose that a tenth of one percent of the sample size contains missing or corrupted data. In a small sample size, the numbers are almost neglected. In a large sample size of one billion items, however, we can see that one million damaged items are hardly something the analyst can ignore. Some data corruption occur before it is to be cleaned up for data mining, such as when the actual data is recorded in the first place. Often the people 4

recording the data make mistakes or leave out certain information when filling out the appropriate forms, using applications or computer software, etc (Hand, 1998). Fig. 2 (Hand, et al, 2000) Another big problem with data mining is that the programs used to discern structures must use language that is well defined to the computer. For instance, a computer does not know exactly what to look for in the data sets until programmers define what it is exactly that the computer is looking for. As a consequence, programmers must define exactly what they mean by structure, pattern, usefulness, etc. If we look at market basket analysis, the computer programs in this case are told that it is interesting to find products with very high conditional probabilities. In effect, if the probability of buying product A given that the shopper bought product B already is pretty close to 1, the computer will flag it as a structure (Hand, et al, 2000, pp.111-116). Despite the setbacks and criticism that data mining has received over the years, it nonetheless continues to be a part of the global market. To companies like Wal-Mart, Exxon/Mobil, and other Fortune 500 mainstays, data mining is being revered as a 5

valuable marketing tool. In fact, over 40% of the Fortune 500 companies in 2002 said they were developing large data sets with the intent of mining and/or programs to help their company find structures from consumer purchases. Mobil Oil said that they intend to generate and store over 100 terabytes of data concerned with oil exploration. Large companies like these generate enough data such that it can be stored in a data warehouse (Hand, et al, 2000, pp.111-116). By warehousing their data, companies focus on streamlining data from various departments of their company. They do this by extracting data from the departments, then categorizing, trimming, and re-storing the data in its new form. For example, an analyst might look at point-of-sale purchases, where each item of data is recorded with multiple facets such as its price, its cost, the time it was purchased, the store it was purchased from, etc. While a lot of this data is useful, the analyst might only want to know how much money said product is making for the company. To help streamline the analyst s process, data warehousing would have already consolidated the items into various categories, helping the data seem more consistent (Fayyad and Uthurusamy, 2002). Warehousing data gives companies an exciting opportunity to find patterns and create models more readily, and with the storage capacity of computers today, it is a necessary step in the data mining process. But what happens when a company like Wal- Mart records 20 million sales transactions per day, or when Google handles 150 million searches? The information derived from this data is certain to be invaluable to companies that are this large, but by the time standard data warehousing and mining procedures are 6

performed, the information can be relatively useless. Mining a day s worth of data in these cases can take up to one day s worth of time! A solution to this problem, and perhaps one of the biggest players in the future of data mining, is mining massive data streams (Domingos and Hulten, 2003). Since these companies encounter such high volume of traffic on any given day, it is important for data mining programmers to focus on new algorithms. Programs meant to analyze a stationary database would take days upon weeks to sift through data storage of this magnitude. Currently, programmers are trying to create algorithms for systems that are continuously on, processing records at the speed they arrive, incorporating them into the model it is building eve if it never sees them again (Domingos and Hulten, 2003). By imposing various bounds and limits on what the program is actually searching for, there are programs that can mine infinite data in finite time, allowing the program to keep up with the data, despite the massive amount of data arriving each minute. Mining such data streams do not come without a cost, however. The data streams coming into to these computer programs are so massive, that they enable analysts to create more advanced models than previously thought capable. Ironically, the programs are created to look at the streaming data only one time before moving on to the next item, resulting in mining only the simplest of models (Domingos and Hulten, 2003). It is also programs like these that are to blame for backlash toward data mining in the recent decade. Information derived from data mining does not come without social implications. 7

As Danna and Gandy, Jr point out, consumer profiles are created, sorted, and processed, resulting in consumers being graded, sorted, or excluded from opportunities that others enjoy. For instance, two types of customers are found to exist at a bank using mining techniques high income customers with a moderate risk that they might leave, and low income customers with zero risk of leaving. The bank will then cater to the high income customer, offering special rates on loans or accounts, with the full intent of keeping them around. Since the low income customers have almost no risk of leaving the bank, the bank will continue to offer them the same small incentives that have kept them there in the first place, such as no ATM fees, free checking, etc. The problem with this is that the high income customers receive the same benefits as the low income customer, but also receives special treatment to entice him to stay. Preferential treatment such as this leads to the exclusion that Danna and Gandy, Jr. were talking about. Critics like them call for regulation of consumer privacy and data mining techniques a future battle that data mining might very well have to suit up for as its popularity increases. Its no surprise that companies and organizations are interested in the behaviors of the data they collect. Whether it be point-of-sales information, NASA photos, basketball statistics, or credit profiles, the data proves to be a valuable asset to the organization that chooses to store it and mine it. As algorithms are improved upon and computers become more and more powerful, it is only expected to see further advancements in the field of data mining. 8

Works Cited Danna, Anthony and Gandy, Jr., Oscar H. All that Glitters is Not Gold: Digging beneath the Surface of Data Mining. Journal of Business Ethics, Vol.40, No.4 (Nov., 2002), pp.373-386. Published by Springer. Fayyad, Usama and Uthurusamy Ramasamy. Evolving Data Mining into Solutions for Insights. Communications of the ACM, Vol.45, No.8 (Aug., 2002), pp.28-32. Published by ACM. Hand, David J. Data Mining: Statistics and More? The American Statistician, Vol. 52, No.2(May, 1998), pp.112-118. Published by American Statistical Association. Hand, David J.; Blunt, Gordon; Kelly, Mark G.; Adams, Niall M. Data Mining for Fun and Profit. Statistical Science, Vol.15, No. 2 (May, 2000), pp.111-126. Published by Institute of Mathematical Statistics. Palace, Bill. Data Mining. http://www.anderseon.ucla.edu/faculty/jason.frand/teacher/technologies/palace. June, 1996. Accessed on April 2 nd, 2010. Weisstein, Eric W. "Cluster Analysis." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/clusteranalysis.html Weisstein, Eric W. "Regression." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/regression.html 9