Data Mining Shahram Hassas California State University, Northridge General Terms: Data Mining Additional Key Words and Phrases: Data Mining Data mining is described as the method of comparing large volumes of data, looking for more information from the data (in statistics, data are any facts, numbers, or text that can be processed by a computer), and it is defined as the process of analyzing data from different perspectives and summarizing it into useful information which can be used to increase revenue, and cut costs. Today, data mining is considered an important tool by modern businesses to transform data into business intelligence, giving an informational advantage. Basically, a primary reason for using data mining is to assist in the analysis of collections Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 20YY ACM 0000-0000/20YY/0000-0001 $5.00 ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1 0??.
2 Shahram Hassas of observations and finding correlations among all of the fields and relationships between the facts in databases. It enables companies to determine relationships among the internal factors such as price, cost, product, or staff skills, inventory, payroll, accounting and corporate profits and external factors such as economic status, competition, and customer interest and their satisfaction. For example, by using data mining, a grocery chain store discovered that when men bought diapers on Thursdays and Saturdays, they also attempt to buy beer. It is also showed that these shoppers typically did their weekly grocery shopping on Saturdays and only bought a few items on Thursdays. The grocery chain used this newly discovered information in various ways to increase revenue for instance; they moved the beers closer to the diapers and, they could make sure beer and diapers were sold at full price on Thursdays. This example can be identified as associative mining. Data mining is the results of Classical statistics, artificial intelligence and machine learning. Classical statistical plays the main role in data mining, the concepts such as regression analysis (a method for fitting a curve through a set of points using some goodness of fit criterion), standard distribution, standard deviation, standard variance, discriminant analysis, cluster analysis, and confidence intervals, all of which are used to study data and data relationships. Artificial intelligence applies human interest processing to statistical problems. AI concepts have been adopted by some high end commercial products, such as query optimization modules for Relational Database Management Systems (RDBMS). [?] Machine learning attempts to let software learn about the data they study, such that future decisions are based on the quality of the studied data. [?] Data mining consists of five
3 major elements in order to work properly: I. Extract, transform, and load transaction data onto the data warehouse system. II. Store and manage the data in a multidimensional database system. III. Provide data access to business analysts and information technology professionals. IV. Analyze the data using application software. V. Present the data in a readable format. WalMart Company captures point of sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries. Data mining applications goal are prediction and it allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships by exploration or data preparation which may involve cleaning data, data transformations, then considering various models and choosing the best one based on prediction apply it to new data in order to generate predictions or estimates of the expected result. For example, The National Basketball Association (NBA) is exploring a data mining application that can combine and analyze the records of basketball games by using advanced Scout software. This program helps to find patterns derived from game statistics, images, and the movements of the players. [?] Today, data mining applications are available on all size systems for, client server, and PC platforms and the prices range from several thousand dollars for the smallest applications up to
4 Shahram Hassas $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. NCR has the capacity to deliver applications exceeding 100 terabytes. Data mining has been used in area of science and engineering, such as bioinformatics, genetics, medicine, education, agricultural, electrical power engineering, and law enforcement. In the field of human genetics, usage of data mining attempts to find out how the changes in an individual s DNA sequence affect the risk of developing common diseases such as cancer. This is very important to help improve the diagnosis, prevention and treatment of the diseases. The data mining technique that is used to perform this task is known as multifactor dimensionality reduction. http://www.niehs.nih.gov/research/resource/databases/gac/guid.cfm One of the most recent research topics in data mining is in agricultural industry same as in the area of electrical power engineering. In electrical power engineering data mining techniques have been used for condition monitoring of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on the insulation s health status of the equipment. Data mining techniques have also been applied for dissolved gas analysis on power transformers. Also data mining has been used in education researches to study the factors leading students to choose to engage in behaviors which reduce their learning and to understand the factors influencing university student. Law enforcement also taking advantage of data mining techniques which provides them some information can identify criminal suspects as well as catching these criminals by crime type, habit, and other patterns of behaviors. In general, Data mining
5 techniques in different areas speed up data analyzing process; thus, allowing them more time to work on other projects. There is an article in UCLA Anderson library regarding how Blockbuster mines its video rental history database to recommend rentals to individual customers. American Express also suggests products to its cardholders based on analysis of their monthly expenditures. There is also some disadvantages in data mining which one of the main disadvantages is privacy Issues, for example, American Express sold their customers credit card purchases to another company or according to Washing Post, in 1998, CVS had sold their patient s prescription purchases to a different companies. Frauds because of lack of security are another disadvantage of data mining. Companies record and save customers personal information online, and they may not have sufficient security systems in place to protect that information. Data mining also cause Improper, unlawful customer service, since companies have access to customer s records, they may discriminate customers based on purchase history. If you have spent a lot of money or buying a lot of product from one company, your call will be answered soon. So you should not think that your call is really being answer in the order in which it was receive. Data mining techniques contains Classical Techniques and Next Generation Techniques. Classical techniques contain as Classes and clustering. Classes is locating stored data. Clustering is the process of grouping data items according to logical relationships or consumer preferences. Next generation techniques contain Rules and Networks Trees. Each branch is a classification question and the leaves of the tree are partitions of the dataset with their classification. Decision trees can be viewed as segmentations of the original dataset where each segment would be one of the leaves of the tree. The decision tree technology can be used for exploration of datasets or business problems. This is often done by looking at the predictors and values that are chosen for each split of the tree. Often times these predictors provide usable insights or propose questions that need to be answered. Classification tree analysis is a term used when the pre-
6 Shahram Hassas dicted outcome is the class to which the data belongs. Regression tree analysis is a term used when the predicted outcome can be considered a real number. CART analysis is a term used to refer to both of the above procedures. Sometime data mining may impose patterns on data where none exist. This imposition of irrelevant correlation is called data dredging or data fishing. Data dredging is described as seeking more information from a data set than it actually contains. Data dredging is results in relationships between variables announced as significant when, in fact, the data require more study before such an association can be determined. Large data sets invariably happen to have some unusual exciting relationships to that data. Therefore any conclusions reached are likely to be highly suspected. In conclusion, data mining is a powerful asset to the organizations and corporations that mine their data. Data mining has been used in different area of science, engineering, research, education, sports and law enforcement. Today, data mining consider as an important tool by modern business to transform data into business intelligence giving an informational advantage. Although, data mining techniques has been a valuable contribution to modern society, but, there is some disadvantage such as privacy issues, security, and Misuse of information by some company based on customer purchase history. REFERENCES
7 Weisstein, E. Mathematica Wolfram. Head, D. Data Mining: Statistics and More? The American Statistician, Vol 52.