Data Mining. Anyone can tell you that it takes hard work, talent, and hours upon hours of



Similar documents
Research Note What is Big Data?

Data Mining Solutions for the Business Environment

Wal-Mart s Data Warehouse

not think the same. So, the consumer, at the end, is the one that decides if a game is fun or not. Whether a game is a good game.

Big Data 101: Harvest Real Value & Avoid Hollow Hype

DATA MINING AND WAREHOUSING CONCEPTS

Data Aggregation and Cloud Computing

INDEX. Introduction Page 3. Methodology Page 4. Findings. Conclusion. Page 5. Page 10

Application of Business Intelligence in Transportation for a Transportation Service Provider

Security Tools and Their Unexpected Uses

Creating an Effective Mystery Shopping Program Best Practices

Battleships Searching Algorithms

A Beginner s Guide to Financial Freedom through the Stock-market. Includes The 6 Steps to Successful Investing

Lead Generation for Logistics Services: Who s Job Is It, Anyway?

A Review of Data Mining Techniques

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns

Application of the Artificial Society Approach to Multiplayer Online Games: A Case Study on Effects of a Robot Rental Mechanism

Fair Price. Math 5 Crew. Department of Mathematics Dartmouth College. Fair Price p.1/??

Introduction to Data Mining

Business Intelligence Solutions for Gaming and Hospitality

Capturing Meaningful Competitive Intelligence from the Social Media Movement

Information Stewardship: Moving From Big Data to Big Value

We are so happy that you have taken an interest in teaching your students computer science!

Foundations of Business Intelligence: Databases and Information Management

NO LUCK NEEDED. How the Right Data Can Improve Casino Marketing Campaigns

Big Data Big Deal? Salford Systems

Perspectives on Data Mining

Formal Methods for Preserving Privacy for Big Data Extraction Software

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Data Analytics in Cloud Computing

Blue: C= 77 M= 24 Y=19 K=0 Font: Avenir. Clockwork LCM Cloud. Technology Whitepaper

Banking On A Customer-Centric Approach To Data

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Outline. What is Big data and where they come from? How we deal with Big data?

Today s mobile ecosystem means shared responsibility

A STATISTICS COURSE FOR ELEMENTARY AND MIDDLE SCHOOL TEACHERS. Gary Kader and Mike Perry Appalachian State University USA

PREDICTIONS FOR 2016

How to Win the Stock Market Game

Dynamic Data in terms of Data Mining Streams

Overview of Pricing Research

Technology in Action. Alan Evans Kendall Martin Mary Anne Poatsy. Eleventh Edition. Copyright 2015 Pearson Education, Inc.

Berkeley CS191x: Quantum Mechanics and Quantum Computation Optional Class Project

Healthcare Measurement Analysis Using Data mining Techniques

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Using Tableau Software with Hortonworks Data Platform

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

An Overview of Knowledge Discovery Database and Data mining Techniques

Big Data Integration: A Buyer's Guide

Data Mining in Telecommunication

TEST 2 STUDY GUIDE. 1. Consider the data shown below.

BOR 6335 Data Mining. Course Description. Course Bibliography and Required Readings. Prerequisites

Sunnie Chung. Cleveland State University

not possible or was possible at a high cost for collecting the data.

THE WHE TO PLAY. Teacher s Guide Getting Started. Shereen Khan & Fayad Ali Trinidad and Tobago

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Data Mining and Statistics: What is the Connection?

Data Mining: Overview. What is Data Mining?

Data Mining and Database Systems: Where is the Intersection?

Ten Mistakes to Avoid

Student-Athletes. Guide to. College Recruitment

A Perspective on Statistical Tools for Data Mining Applications

Information Management course

Strategic Online Advertising: Modeling Internet User Behavior with

Simple Predictive Analytics Curtis Seare

The Power of Social Media in Marketing

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

MBA Data Mining & Knowledge Discovery

Insights from McKinsey s Global iconsumer Research. Six Strategies to Win the Mobile Consumer Showdown

Quantitative Methods Workshop. Graphical Methods for Investigating Missing Data

Using Data Mining to Detect Insurance Fraud

Research on consumer attitude and effectiveness of advertising in computer and video games

Data Mining & Data Stream Mining Open Source Tools

A financial software company

Big Data Just Noise or Does it Matter?

! Insurance and Gambling

The Adwords Companion

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Dashboards with Live Data For Predictive Visualization. G. R. Wagner, CEO GRW Studios, Inc.

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

THE ULTIMATE WORKSHEET TO JUMP-START YOUR FIRST LINKEDIN LEAD-GENERATION CAMPAIGN

Tom Khabaza. Hard Hats for Data Miners: Myths and Pitfalls of Data Mining

A CRE Best Practices Guide To: Actionable Intelligence

Using Data Mining to Detect Insurance Fraud

NEURAL NETWORKS IN DATA MINING

Data Quality; is this the key to driving value out of your investment in SAP? Data Quality; is this the key to

Big Data. Fast Forward. Putting data to productive use

Building Your O2O Funnel

What is Data Mining, and How is it Useful for Power Plant Optimization? (and How is it Different from DOE, CFD, Statistical Modeling)

NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE

Data Mining System, Functionalities and Applications: A Radical Review

Take Control of your future with this residual income, home based business.

Inbound Marketing vs. Outbound A Guide to Effective Inbound Marketing

Gold. Mining for Information

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

GETTING AHEAD OF THE COMPETITION WITH DATA MINING

Transcription:

Seth Rhine Math 382 Shapiro Data Mining Anyone can tell you that it takes hard work, talent, and hours upon hours of watching videos for a professional sports team to be successful. Finding the leaks in their opponent s strategy is the ultimate goal for the coaches and captains watching in-game footage, allowing them to devise plays and make key decisions in future games. In the National Basketball Association (NBA), the coaches have a good share of the work done for them already with the help of Advanced Scout, a program that helps find patterns derived from game statistics, images, and the movements of the players themselves. When a pattern emerges from the data provided, Advanced Scout will let the user know why the patterns are so significant, leading the user toward valuable video clips and sparing him from many hours in front of in-game footage (Palace, 1996). Such a process is not exclusive to Advanced Scout, or even the NBA for that matter. Similar processes are used everyday by parties of many facets, and comprise a fairly recently coined field known as data mining. Data mining is defined as the process of seeking interesting or valuable information within large databases (Hand, et al., 2000, p.111). At first glance, this definition might seem more like a new name for statistics, rather than a new field itself. However, data mining is actually performed on sets of data that are far larger than statistical methods can accurately analyze. Some of data mining s 1

methods have been used to analyze data sets containing enough data points that their numbers trail far off into the billions. Realistically, these sets would take too much time, money, and painstaking detail for any human to be expected to look over (Hand, p.113). To aid these slow-pokes in the process, it is necessary that we rely on machines to do most of the dirty work, if not all of it. The mere existence of such data sets is allowed by the advancement of modern technologies, i.e. faster computers, larger hard drives, and improved database software, among other things. Many of the techniques used by statisticians on smaller data sets of a few hundred samples simply do not hold when used on larger sets, and must be improved and expanded upon to successfully mine the data. For instance, a company like Wal-Mart will perform over 7 billion transactions annually. To effectively analyze the buying patterns of a customer purchase database of this size requires much more than the human hand and statistical tactics. Consequently, data mining is actually quite complex, consisting of notions from statistics, pattern recognition, computer programming, algorithms, machine learning, and many other disciplines (Hand, et al, 2000, p.111-114). As for how an organization obtains and uses data, Wal-Mart is a prime example. The multi-billion dollar company uses the history of customer transactions as useable data to help the company develop a marketing strategy based upon the structures that can be derived from it. Such structures can be seen as either a model or a pattern, both of which are highly sought by data mining programs. A model is basically defined to be an overall summary of a set or subset of data, while a pattern is a smaller structure that possibly refers to a number of objects that is relatively small compared to the sample size. 2

Fig.1 (Hand, et al, 2000) Essentially, patterns are often defined relative to the overall model of the data set from which it is derived. There are many tools involved in data mining that help find these structures and a few of them are exemplified in the next few paragraphs. Some of the most important tools for an analyst would be clustering, regression, rule extraction, and data visualization. Clustering is the act of partitioning data sets of many random items into subsets of smaller size that show commonality between them (Weisstein, 2010). By looking at such clusters, data miners are able to extract statistical models from the data fields. Regression is defined as a method for fitting a curve through a set of points using some goodness-of-fit criterion (Weisstein, 2010). While examining predefined goodness-of-fit parameters, analysts can locate and describe patterns using regression. Rule extraction is the method of using relationships between 3

variables to establish some sort of rule, most likely for use in a marketing strategy. For instance, in a large set of data from point of sale purchases at a grocery store, it may be observed that customers who bought products A and B typically purchase product C, as well. This information could possibly help the grocery store develop a marketing strategy to further increase profits. Data visualization is also a key element to the success of data mining. The samples of data being mined are so vast that scatter plots and histograms will often fall short representing any information of realistic value (see Figure 1). For that very reason, the analysts concerned with data mining are constantly looking for better ways to graphically represent data, such as depicted in Figure 2 on page 5 (Hand, et al, 2000, p. 113). No matter what tools analysts will have at their fingertips, the patterns and models being mined will only be as good in quality as the data that it is being derived from. If a database contains biased data or incomplete data, this will often lead to inaccurate results and a large chance that patterns found will actually be due to chance. Since the source of the data is such a large entity, it is almost certain that there will be missing or corrupted data within the database being mined (Hand, 1998). This is one of the biggest reasons that data mining is looked down upon by some statisticians. Suppose that a tenth of one percent of the sample size contains missing or corrupted data. In a small sample size, the numbers are almost neglected. In a large sample size of one billion items, however, we can see that one million damaged items are hardly something the analyst can ignore. Some data corruption occur before it is to be cleaned up for data mining, such as when the actual data is recorded in the first place. Often the people 4

recording the data make mistakes or leave out certain information when filling out the appropriate forms, using applications or computer software, etc (Hand, 1998). Fig. 2 (Hand, et al, 2000) Another big problem with data mining is that the programs used to discern structures must use language that is well defined to the computer. For instance, a computer does not know exactly what to look for in the data sets until programmers define what it is exactly that the computer is looking for. As a consequence, programmers must define exactly what they mean by structure, pattern, usefulness, etc. If we look at market basket analysis, the computer programs in this case are told that it is interesting to find products with very high conditional probabilities. In effect, if the probability of buying product A given that the shopper bought product B already is pretty close to 1, the computer will flag it as a structure (Hand, et al, 2000, pp.111-116). Despite the setbacks and criticism that data mining has received over the years, it nonetheless continues to be a part of the global market. To companies like Wal-Mart, Exxon/Mobil, and other Fortune 500 mainstays, data mining is being revered as a 5

valuable marketing tool. In fact, over 40% of the Fortune 500 companies in 2002 said they were developing large data sets with the intent of mining and/or programs to help their company find structures from consumer purchases. Mobil Oil said that they intend to generate and store over 100 terabytes of data concerned with oil exploration. Large companies like these generate enough data such that it can be stored in a data warehouse (Hand, et al, 2000, pp.111-116). By warehousing their data, companies focus on streamlining data from various departments of their company. They do this by extracting data from the departments, then categorizing, trimming, and re-storing the data in its new form. For example, an analyst might look at point-of-sale purchases, where each item of data is recorded with multiple facets such as its price, its cost, the time it was purchased, the store it was purchased from, etc. While a lot of this data is useful, the analyst might only want to know how much money said product is making for the company. To help streamline the analyst s process, data warehousing would have already consolidated the items into various categories, helping the data seem more consistent (Fayyad and Uthurusamy, 2002). Warehousing data gives companies an exciting opportunity to find patterns and create models more readily, and with the storage capacity of computers today, it is a necessary step in the data mining process. But what happens when a company like Wal- Mart records 20 million sales transactions per day, or when Google handles 150 million searches? The information derived from this data is certain to be invaluable to companies that are this large, but by the time standard data warehousing and mining procedures are 6

performed, the information can be relatively useless. Mining a day s worth of data in these cases can take up to one day s worth of time! A solution to this problem, and perhaps one of the biggest players in the future of data mining, is mining massive data streams (Domingos and Hulten, 2003). Since these companies encounter such high volume of traffic on any given day, it is important for data mining programmers to focus on new algorithms. Programs meant to analyze a stationary database would take days upon weeks to sift through data storage of this magnitude. Currently, programmers are trying to create algorithms for systems that are continuously on, processing records at the speed they arrive, incorporating them into the model it is building eve if it never sees them again (Domingos and Hulten, 2003). By imposing various bounds and limits on what the program is actually searching for, there are programs that can mine infinite data in finite time, allowing the program to keep up with the data, despite the massive amount of data arriving each minute. Mining such data streams do not come without a cost, however. The data streams coming into to these computer programs are so massive, that they enable analysts to create more advanced models than previously thought capable. Ironically, the programs are created to look at the streaming data only one time before moving on to the next item, resulting in mining only the simplest of models (Domingos and Hulten, 2003). It is also programs like these that are to blame for backlash toward data mining in the recent decade. Information derived from data mining does not come without social implications. 7

As Danna and Gandy, Jr point out, consumer profiles are created, sorted, and processed, resulting in consumers being graded, sorted, or excluded from opportunities that others enjoy. For instance, two types of customers are found to exist at a bank using mining techniques high income customers with a moderate risk that they might leave, and low income customers with zero risk of leaving. The bank will then cater to the high income customer, offering special rates on loans or accounts, with the full intent of keeping them around. Since the low income customers have almost no risk of leaving the bank, the bank will continue to offer them the same small incentives that have kept them there in the first place, such as no ATM fees, free checking, etc. The problem with this is that the high income customers receive the same benefits as the low income customer, but also receives special treatment to entice him to stay. Preferential treatment such as this leads to the exclusion that Danna and Gandy, Jr. were talking about. Critics like them call for regulation of consumer privacy and data mining techniques a future battle that data mining might very well have to suit up for as its popularity increases. Its no surprise that companies and organizations are interested in the behaviors of the data they collect. Whether it be point-of-sales information, NASA photos, basketball statistics, or credit profiles, the data proves to be a valuable asset to the organization that chooses to store it and mine it. As algorithms are improved upon and computers become more and more powerful, it is only expected to see further advancements in the field of data mining. 8

Works Cited Danna, Anthony and Gandy, Jr., Oscar H. All that Glitters is Not Gold: Digging beneath the Surface of Data Mining. Journal of Business Ethics, Vol.40, No.4 (Nov., 2002), pp.373-386. Published by Springer. Fayyad, Usama and Uthurusamy Ramasamy. Evolving Data Mining into Solutions for Insights. Communications of the ACM, Vol.45, No.8 (Aug., 2002), pp.28-32. Published by ACM. Hand, David J. Data Mining: Statistics and More? The American Statistician, Vol. 52, No.2(May, 1998), pp.112-118. Published by American Statistical Association. Hand, David J.; Blunt, Gordon; Kelly, Mark G.; Adams, Niall M. Data Mining for Fun and Profit. Statistical Science, Vol.15, No. 2 (May, 2000), pp.111-126. Published by Institute of Mathematical Statistics. Palace, Bill. Data Mining. http://www.anderseon.ucla.edu/faculty/jason.frand/teacher/technologies/palace. June, 1996. Accessed on April 2 nd, 2010. Weisstein, Eric W. "Cluster Analysis." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/clusteranalysis.html Weisstein, Eric W. "Regression." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/regression.html 9