Business Intelligence and Data Mining

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Business Intelligence and Data Mining"

Transcription

1 Business Intelligence and Data Mining Dr. Hui Xiong Rutgers University Learning Objectives Understand the need for business intelligence systems. Know the characteristics of reporting systems. Know the purpose and role of data warehouses and data marts. Understand dfundamental data mining i techniques. Know the purpose, features, and functions of knowledge management systems. The Need for Business Intelligence Systems According to a study done at the University of California at Berkeley, a total of 403 petabytes of new data were created. 403 petabytes is roughly the amount of all printed material ever written. The printed collection of the Library of Congress is.01 petabytes. 400 petabytes equals 40,000 copies of the print collection of the Library of Congress. The Need for Business Intelligence Systems (Continued) The generation of all these data has much to do with Moore s Law. The capacity of storage devices increases as their costs ot decrease. ea e Today, storage capacity is nearly unlimited. We are drowning in data and starving for information. Figure 9 1 How big is an Exabyte? Figure 9 2 Hard Disk Storage Capacity Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley. Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley. 1

2 Business Intelligence Tools Tools for searching business data in an attempt to find patterns is called business intelligence (BI) tools. Reporting tools are programs that read data from a variety it of sources, process that t data, produce formatted reports, and deliver those reports to the users who need them. Business Intelligence Tools The processing of data is simple: Data are sorted and grouped. Simple totals and averages are calculated. Reporting tools are used primarily for assessment They are used to address questions like: What has happened in the past? What is the current situation? How does the current situation compare to the past? Business Intelligence Tools (Continued) Data mining tools process data using statistical techniques, many of which are sophisticated and mathematically complex. Data mining involves searching for patterns and relationships among data. In most cases, data mining tools are used to make predictions. For example, we can use one form of analysis to compute the probability that a customer will default on a loan. Another way to distinguish the differences of reporting tools and data mining tools is : Reporting tools use simple operations like sorting, grouping, and summing. Data mining tools use sophisticated techniques. Business Intelligence Systems An information system is a collection of hardware, software, data, procedures, and people. The purpose of a business intelligence (BI) system is to provide the right information, to the right user, at the right time. BI systems help users accomplish their goals and objectives by producing insights that lead to actions. Business Intelligence Systems (Continued) A reporting tool can generate a report that shows a customer has canceled an important order. A reporting system, however, alerts that customer s salesperson with this unwanted news, and does so in time for the salesperson to try to alter the customer s decision. A data mining tool can create an equation that computes the probability that a customer will default on a loan. A data mining system uses that equation to enable banking personnel to assess new loan applications. Reporting Systems The purpose of a reporting system is to create meaningful information from disparate data sources and to deliver that information to the proper user on a timely basis. Reporting systems generate information from Reporting systems generate information from data as a result of four operations: Filtering data Sorting data Grouping data Making simple calculations on the data 2

3 Figure 9 3 Trade Data for NDX.X (NASDAQ 0) Figure 9 4 Report Based on Trade Data in Figure 9 3 Components of Reporting Systems Figure 9 5 Components of a Reporting System A reporting system maintains a database of reporting metadata. The metadata describes the reports, users, groups, roles, events, and other entities involved in the reporting activity. The reporting system uses the metadata to prepare and deliver reports to the proper users on a timely basis. Figure 9 6 Summary of Report Characteristics Report Type In terms of a report type, reports can be static or dynamic. Static reports are prepared once from the underlying data, and they do not change. Example, a report of past year s sales Dynamic reports: the reporting system reads the most current data and generates the report using that fresh data. Examples are: a report on sales today and a report on current stock prices 3

4 Report Type (Continued) Query reports are prepared in response to data entered by users. Online analytical processing (OLAP) reports allow the user to dynamically change the report grouping structures. Report Media Reports are delivered via many different report media or channels. Some reports are printed on paper, and others are created in a format like PDF whereby they can be pi printed or viewed e ee electronically. Other reports are delivered to computer screens. Companies sometimes place reports on internal corporate Web sites for employees to access. Report Media (Continued) Another report medium is a digital dashboard, which is an electronic display customized for a particular user. Vendors like Yahoo! and MSN provide common examples. Users of these services can define content they wantsay, a local weather forecast, a list of stock prices, or a list of news sources. The vendor constructs the display customized for each user. Report Media (Continued) Other dashboards are particular to an organization. The organization might have a dashboard that shows up to theminute production and sales activities. Alerts are another form of report. Users can declare that they wish to receive notifications of events, say, via or on their cell phones. Reports can be published via a Web service. The Web service produces the report in response to requests from the service consuming application. Figure 9 7 Digital Dashboard Example Report Mode The report mode can be either push report or pull report. Organizations send a push report to users according to a preset schedule. Users receive the report without any activity on their part. Users must request a pull report. To obtain a pull report, a user goes to a Web portal or digital dashboard and clicks a link or button to cause the reporting system to produce and deliver the report. 4

5 Functions of Reporting Systems Three functions of reporting systems are: Authoring Management Delivery Report authoring involves connecting to data sources, creating the reporting structure, and formatting the report. Report Management The purpose of report management is to define who receives what reports, when, and by what means. Most report management systems allow the report administrator to define user accounts and user groups and to assign particular users to particular groups. Reports that have been created using the reportauthoring system are assigned groups and users. Report Management (Continued) Assigning reports to groups saves the administrator work. When a report is created, changed, or removed, the administrator need only change the report assignments to the group. All of the users in the group will inherit the changes. Metadata also indicates what channel is to be used and whether the report is to be pushed or pulled. If the report is to be pushed, the administrator declares whether the report is to be generated on a regular schedule or as an alert. Report Delivery The report delivery function of a reporting system pushes reports or allows them to be pulled according to report management metadata. Reports can be delivered via an server, Web site, XML Web services, or by other program specific means. The report delivery system uses the operating system and other program security components to ensure that only authorized users receive authorized reports. Report Delivery (Continued) The report delivery system also ensures that push reports are produced at appropriate times. For query reports, the report delivery system serves as an intermediary between the user and the report generator. It receives user query data, such as item numbers in an inventory query, passes the query data to the report generator, receives the resulting report, and delivers the report to the user. Online Analytical Processing Online analytical processing (OLAP) provides the ability to sum, count, average, and perform other simple arithmetic operations on groups of data. The remarkable characteristics of OLAP reports is that they are aedynamic. The viewer of the report can change the report s format, hence, the term online. 5

6 Online Analytical Processing An OLAP report has measures and dimensions. A measure is the data item of interest. It is the item that is to be summed or averaged or otherwise processed in the OLAP report. A dimension is a characteristic of a measure. Purchase data, customer type, customer location, and sales region are all examples of dimension. Online Analytical Processing (Continued) With an OLAP report, it is possible to drill down into the data. This term means to further divide the data into more detail. Special purpose products called OLAP servers have been developed to perform OLAP analysis. An OLAP server reads data from an operational database, performs preliminary calculations, and stores the results of those operations in an OLAP database. Figure 9 13 OLAP Family and Store Location by Store Type Figure 9 14 Role of OLAP Server and OLAP Database Data Warehouses and Data Marts Basic reports and simple OLAP analyses can be made directly from operational data. For the most part, such reports display the current state of the business; and if there are a few missing values or small inconsistencies with the data, no one is too concerned. Operational data are unsuited to more sophisticated analyses, particularly, data mining analyses that require high quality input for accurate and useful results. Data Warehouses and Data Marts (Continued) Many organizations choose to extract operational data into facilities called data warehouses and data marts, both of which are facilities that prepare, store, and manage data specifically for data mining and other analyses. Programs read operational data and extract, clean, and prepare that data for BI processing. The prepared data are stored in a data warehouse database using data warehouse DBMS, which can be different from the organization s operational DBMS. 6

7 Data Warehouses and Data Marts Figure 9 15 Components of a Data Warehouse Data warehouses include data that are purchased from outside sources. Metadata concerning the data, its source, its format, its assumptions and constraints, and other facts about the data is kept in a data warehouse metadata database. The data warehouse DBMS extracts and provides data to business intelligence tools such as data mining programs. Figure 9 16 Consumer Data Available for Purchase from Data Vendors Problems with Operational Data (Continued) Inconsistent data are particularly common for data that have been gathered over time. When an area code changes, for example, the phone number for a given customer before the change will not match the customer s number after the change. Some data inconsistencies occur from the nature of the business activity. nintegrated data can cause problems when data comes from different management information systems. Figure 9 17 Problems of Using Transaction Data for Analysis and Data Mining Data Warehouses Versus Data Marts The data warehouse takes data from the data manufacturers (operational systems and purchased data), cleans and processes the data, and locates the data on the shelves, so to speak, of the data warehouse. A data mart is a data collection, smaller than the data warehouse, that addresses a particular component or functional area of the business. 7

8 Data Warehouse Versus Data Marts (Continued) Figure 9 18 Data Mart Examples The data warehouse is like the distributor in the supply chain and the data mart is like the retail store in the supply chain. Users in the data mart obtain data that pertain to a particular business function from the data warehouse. It is expensive to create, staff, and operate data warehouses and data marts. Data Mining and Business Intelligence Knowledge Discovery in Data Dr. Hui Xiong Rutgers University Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e commerce purchases at department/ grocery stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management) Why Mine Data? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation 8

9 Mining Large Data Sets Motivation There is often information hidden in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all 4,000,000 3,500, ,000,000 2,500,000 2,000,000 1,500,000 1,000, ,000 0 The Data Gap Total new disk (TB) since 1995 Number of analysts From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications Scale of Data Organization Walmart Google Yahoo NASA satellites NCBI GenBank France Telecom UK Land Registry AT&T Corp Scale of Data ~ 20 million transactions/day ~ 8.2 billion Web pages ~ GB Web data/hr ~ 1.2 TB/day ~ 22 million genetic sequences TB 18.3 TB 26.2 TB The great strength of computers is that they can reliably manipulate vast amounts of data very quickly. Their great weakness is that they don t have a clue as to what any Why Do We Need Data Mining? Leverage organization s data assets Only a small portion (typically 5% %) of the collected data is ever analyzed Data that may never be analyzed continues to be collected, at a great expense, out of fear that something which may prove important in the future is missing. Growth rates of data precludes traditional manually intensive approach Why Do We Need Data Mining? As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible Many queries of interest are difficult to state in a query language (Query formulation problem) find all cases of fraud find all individuals likely to buy a FORD expedition find all documents that are similar to this customers problem (Latitude, Longitude) 1 What is Data Mining? Many Definitions n trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi automatic means, of large quantities of data in order to discover meaningful patterns What is (not) Data Mining? What is not Data Mining? Look up phone number in phone directory Check the dictionary for the meaning of a word What is Data Mining? Certain names are more prevalent in certain US locations (O Brien, O Rurke, O Reilly in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) 9

10 Data Mining: Confluence of Multiple Disciplines? 20x20 ~ 2^400 ^120 patterns Data Mining Applications Market analysis Risk analysis and management Fraud detection and detection of unusual patterns (outliers) Text mining (news group, , documents) and Web mining Stream data mining DNA and bio data analysis Fraud Detection & Mining Unusual Patterns Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests Telecommunications: phone call fraud Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm Retail industry Analysts estimate that 38% of retail shrink is due to dishonest employees Anti terrorism Data Mining and Business Intelligence Data Mining Tasks Data Tid Refund Marital Taxable Status Income Cheat 1 Yes Single 125K 2 Married 0K 3 Single 70K 4 Yes Married 120K 5 Divorced 95K Yes 6 Married 60K 7 Yes Divorced 220K 8 Single 85K Yes 9 Married 75K Single 90K Yes 11 Married 60K 12 Yes Divorced 220K 13 Single 85K Yes 14 Married 75K 15 Single 90K Yes Milk

11 Clustering Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter- cluster distances are maximize d Intra- cluster distances are minimize d Applications of Cluster Analysis Understanding Group related documents for browsing Group genes and proteins that have similar functionality Group stocks with similar price fluctuations Summarization Reduce the size of large data sets Use of K means to partition Sea Surface Temperature (SST) and Net Primary Production (NPP) into clusters that reflect the rthern and Southern Hemispheres. Discovered Clusters Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed--Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, 4 Schlumberger-UP Industry Group Technology1-DOWN Technology2-DOWN Financial-DOWN Oil-UP Clustering: Application 1 Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. Clustering: Application 2 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. What is not Cluster Analysis? Simple segmentation Dividing students into different registration groups alphabetically, by last name tion of a Cluster can be Ambiguous Results of a query Groupings are a result of an external specification Clustering is a grouping of objects based on the data Supervised classification Have class label information Association Analysis Local vs. global connections How many clusters? Two Clusters Six Clusters Four Clusters 11

12 Types of Clusterings A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering A division data objects into non overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree Partitional Clustering Original Points A Partitional Clustering Hierarchical Clustering p1 p2 Traditional Hierarchical Clustering p1 p2 p3 n-traditional Hierarchical Clustering p3 p4 p4 p1 p2 p3 p4 Traditional Dendrogram p1 p2 p3 p4 n-traditional Dendrogram Other Distinctions Between Sets of Clusters Exclusive versus non exclusive In non exclusive clusterings, points may belong to multiple clusters. Can represent multiple classes or border points Fuzzy versus non fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Probabilistic clustering has similar characteristics Partial versus complete In some cases, we only want to cluster some of the data Heterogeneous versus homogeneous Clusters of widely different sizes, shapes, and densities Types of Clusters Well separated clusters Center based clusters Contiguous clusters Types of Clusters: Well Separated Well Separated Clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. Density based clusters Property or Conceptual Described by an Objective Function 3 well-separated clusters 12

13 Types of Clusters: Center Based Center based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most representative point of a cluster Types of Clusters: Contiguity Based Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 4 center-based clusters 8 contiguous clusters Types of Clusters: Density Based Density based A cluster is a dense region of points, which is separated by low density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. Types of Clusters: Conceptual Clusters Shared Property or Conceptual Clusters Finds clusters that share some common property or represent a particular concept. 6 density-based clusters 2 Overlapping Circles Characteristics of the Input Data Are Important Type of proximity or density measure This is a derived measure, but central to clustering Sparseness Dictates type of similarity Adds to efficiency Attribute type Dictates type of similarity Type of Data Dictates type of similarity Other characteristics, e.g., autocorrelation Dimensionality ise and Outliers Type of Distribution Data Mining Tasks Data Milk Tid Refund Marital Taxable Status Income Cheat 1 Yes Single 125K 2 Married 0K 3 Single 70K 4 Yes Married 120K 5 Divorced 95K Yes 6 Married 60K 7 Yes Divorced 220K 8 Single 85K Yes 9 Married 75K Single 90K Yes 11 Married 60K 12 Yes Divorced 220K 13 Single 85K Yes 14 Married 75K 15 Single 90K Yes 13

14 Association Rule Discovery: Definition TID Given a set of records each of which contain some number of items from a given collection Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Association Analysis: Applications Market basket analysis Rules are used for sales promotion, shelf management, and inventory management Telecommunication alarm diagnosis Rules are used to find combination of alarms that occur together frequently in the same time period Medical Informatics Rules are used to find combination of patient symptoms and complaints associated with certain diseases Application Deployment Challenge Data Mining Tasks Data Tid Refund Marital Taxable Status Income Cheat 1 Yes Single 125K 2 Married 0K 3 Single 70K 4 Yes Married 120K 5 Divorced 95K Yes 6 Married 60K 7 Yes Divorced 220K 8 Single 85K Yes 9 Married 75K Single 90K Yes 11 Married 60K 12 Yes Divorced 220K 13 Single 85K Yes 14 Married 75K 15 Single 90K Yes Milk Predictive Modeling: Classification Find a model for class attribute as a function of the values of other attributes Tid Employed Level of Education # years at present address Credit Worthy 1 Yes Graduate 5 Yes 2 Yes High School 2 3 Undergrad 1 4 Yes High School Yes Yes Model for predicting credit worthiness Employed Graduate Number of years > 3 yr < 3 yr Yes Education { High school, Undergrad } Yes Number of years > 7 yrs < 7 yrs Classification Example Tid Employed Level of Education # years at present address Credit Worthy 1 Yes Graduate 5 Yes 2 Yes High School 2 3 Undergrad 1 4 Yes High School Yes Training Set Tid Employed Level of Education # years at present address Credit Worthy 1 Yes Undergrad 7? 2 Graduate 3? 3 Yes High School 2? Learn Classifier Test Set Model 14

15 Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha helix, beta sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc Identifying intruders in the cyberspace Classification: Application 1 Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account. Classification: Application 2 Churn prediction for telephone customers Goal: To predict whether a customer is likely to be lost to a competitor. Approach: Use detailed record of transactions with each of the past and present customers, to find attributes. How often the customer calls, where he calls, what time of the day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty. Classification: Application 3 Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory) images with 23,040 x 23,040 pixels per image. Approach: Segment the image. Measure image attributes (features) 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red shift quasars, some of the farthest objects that are difficult to find! From [Berry & Linoff] Data Mining Techniques, 1997 From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Classifying Galaxies Early Intermediate Attributes: Image features, Characteristics of light waves received, etc. Class: Stages of Formation Late Classification Techniques Base Classifiers Decision Tree based Methods Rule based Methods Nearest neighbor Neural lnetworks Naïve Bayes and Bayesian Belief Networks Support Vector Machines Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB Ensemble Classifiers Boosting, Bagging, Random Forests 15

16 Example of a Decision Tree Another Example of Decision Tree ID Marital Status Annual Income 1 Yes Single 125K 2 Married 0K 3 Single 70K 4 Yes Married 120K 5 Divorced 95K Yes 6 Married 60K 7 Yes Divorced 220K 8 Single 85K Yes 9 Married 75K Single 90K Yes Training Data Defaulted Borrower Yes Splitting Attributes Income Single, Divorced MarSt < 80K > 80K YES Model: Decision Tree Married MarSt Single, ID Married Divorced Marital Annual Defaulted Status Income Borrower 1 Yes Single 125K Yes 2 Married 0K 3 Single 70K Income 4 Yes Married 120K < 80K > 80K 5 Divorced 95K Yes YES 6 Married 60K 7 Yes Divorced 220K 8 Single 85K Yes 9 Married 75K There could be more than one tree that fits the same data! Single 90K Yes Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K 2 Medium 0K 3 Small 70K 4 Yes Medium 120K 5 Large 95K Yes 6 Medium 60K 7 Yes Large 220K 8 Small 85K Yes 9 Medium 75K Small 90K Yes Learn Model Apply Model to Test Data Yes Start from the root of tree. Single, Divorced MarSt Married Test Data Marital Status Annual Defaulted Income Borrower Married 80K? Tid Attrib1 Attrib2 Attrib3 Class 11 Small 55K? Apply Model Decision Tree Income < 80K > 80K 12 Yes Medium 80K? 13 Yes Large 1K? 14 Small 95K? 15 Large 67K? YES Apply Model to Test Data Test Data Yes Marital Status Annual Income Married 80K? Defaulted Borrower Apply Model to Test Data Yes Marital Status Annual Defaulted Income Borrower Married 80K? MarSt MarSt Single, Divorced Married Single, Divorced Married Income Income < 80K > 80K < 80K > 80K YES YES 16

17 Apply Model to Test Data Apply Model to Test Data Marital Status Annual Defaulted Income Borrower Marital Status Annual Income Defaulted Borrower Yes Married 80K? Yes Married 80K? MarSt MarSt Single, Divorced Married Single, Divorced Married Income Income < 80K > 80K < 80K > 80K YES YES Apply Model to Test Data Decision Tree Classification Task Marital Status Annual Defaulted Income Borrower Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K 2 Medium 0K Yes Income Single, Divorced MarSt < 80K > 80K Married Married 80K? Assign Defaulted to 3 Small 70K 4 Yes Medium 120K 5 Large 95K Yes 6 Medium 60K 7 Yes Large 220K 8 Small 85K Yes 9 Medium 75K Small 90K Yes Tid Attrib1 Attrib2 Attrib3 Class 11 Small 55K? Learn Model Apply Model Decision Tree YES 12 Yes Medium 80K? 13 Yes Large 1K? 14 Small 95K? 15 Large 67K? Decision Tree Induction Many Algorithms: Hunt s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Data Mining Tasks Data Tid Refund Marital Taxable Status Income Cheat 1 Yes Single 125K 2 Married 0K 3 Single 70K 4 Yes Married 120K 5 Divorced 95K Yes 6 Married 60K 7 Yes Divorced 220K 8 Single 85K Yes 9 Married 75K Single 90K Yes 11 Married 60K 12 Yes Divorced 220K 13 Single 85K Yes 14 Married 75K 15 Single 90K Yes Milk 17

18 Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Anomaly Detection Challenges How many outliers are there in the data? Method is unsupervised Validation can be quite challenging (just like for clustering) Finding needle in a haystack Working assumption There are considerably more normal observations than abnormal observations (outliers/anomalies) in the data Anomaly Detection Schemes General Steps Build a profile of the normal behavior Profile can be patterns or summary statistics for the overall population Use the normal profile to detect anomalies Anomalies are observations whose characteristics ti differ significantly from the normal profile Types of anomaly detection schemes Graphical& Statistical based Distance based Model based Graphical Approaches Boxplot (1 D), Scatter plot (2 D), Spin plot (3 D) Limitations Time consuming Subjective Statistical Approaches Assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distribution Parameter of distribution (e.g., mean, variance) Number of expected outliers (confidence limit) Limitations of Statistical Approaches Most of the tests are for a single attribute In many cases, data distribution may not be known For high dimensional data, it may be difficult to estimate the true distribution 18

19 Distance based Approaches Data is represented as a vector of features Three major approaches Nearest neighbor based Density based Clustering based Nearest Neighbor Based Approach Approach: Compute the distance between every pair of data points There are various ways to define outliers: Data points for which h there e are aefewer e than p neighboring points within a distance D The top n data points whose distance to the kth nearest neighbor is greatest The top n data points whose average distance to the k nearest neighbors is greatest Density based: LOF approach For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value p 2 p 1 In the NN approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers Clustering Based Basic idea: Cluster the data into groups of different density Choose points in small cluster as candidate outliers Compute the distance between candidate points and non candidate clusters. If candidate points are far from all other non candidate points, they are outliers KDD Process Develop an understanding of the application domain Relevant prior knowledge, problem objectives, success criteria, current solution, inventory resources, constraints, terminology, cost and benefits Create target data set Collect initial data, describe, focus on a subset of variables, verify data quality Data cleaning and preprocessing Remove noise, outliers, missing fields, time sequence information, known trends, integrate data Data Reduction and projection Feature subset selection, feature construction, discretizations, aggregations KDD Process Selection of data mining task Classification, segmentation, deviation detection, link analysis Select data mining approach D i i dl Data mining to extract patterns or models Interpretation and evaluation of patterns/models Consolidating discovered knowledge 19

20 Knowledge Discovery Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data ship and Distribution Privacy Preservation Streaming Data Data from Multi Sources Similarities Between Data Miners and Doctors Commercial and Research Tools WEKA: SAS: Data Characteristics Clementine: Intelligent Miner 3.ibm.com/software/data/iminer/ Insightful Miner Data Mining Techniques Medical Devices Textbooks Knowledge Management Knowledge management systems concern the sharing of knowledge that is already known to exist, either in libraries of documents, in the heads of employees, or in other known sources. Knowledge management (KM) is the process of creating value from intellectual l capital and sharing that knowledge with employees, managers, suppliers, customers, and others who need that capital. 20

21 Knowledge Management (Continued) Knowledge management is a process that is supported by the five components of an information system. Its emphasis is on people, their knowledge, and effective means for sharing that knowledge with others. The benefits of KM concern the application of knowledge to enable employees and others to leverage organizational knowledge to work smarter. KM preserves organizational memory by capturing and storing the lessons learned and best practices of key employees. Content Management Systems Content management systems are information systems that track organizational documents, Web pages, graphics, and related materials. Such systems differ from operational document systems in that they do not directly support business operations. KM content management systems are concerned with the creation, management, and delivery of documents that exist for the purpose of imparting knowledge. Content Management Systems (Continued) Typical users of content management systems are companies that sell complicated products and want to share their knowledge of those products with employees and customers. The basic functions of content management systems are the same as for report management systems: author, manage, and deliver. The only requirement that content managers place on document authoring is that the document has been created in a standardized format. Content Management Problems Documents may refer to one another or multiple documents may refer to the same product or procedure. When one of them changes, others must change as well. Some content management systems keep semantic linkages among documents so that content dependencies can be known and used to maintain document consistency. Document contents are perishable. Documents become obsolete and need to be altered, removed, or replaced. Multinational companies have to ensure document language translations. Figure 9 23 Document Management at Microsoft.com (as of December 2003) Figure 9 24 Reporting Services: United States Source: microsoft.com/backstage/inside.htm (accessed February 2004) Microsoft Corporation. All rights reserved. Source: Used with permission of Tom Rizzo of Microsoft Corporation. 21

22 Figure 9 25 Reporting Services: China Content Delivery Almost all users of content management systems pull the contents. Users cannot pull content if they do not know it exists. The content must be arranged and indexed, and a facility for searching the content devised. Documents that reside behind a corporate firewall, however, are not publicly accessible and will not be reachable by Google or other search engines. Organizations must index their own proprietary documents and provide their own search capability for them. Source: Used with permission of Tom Rizzo of Microsoft Corporation. KM Systems to Facilitate the Sharing of Human Knowledge thing is more frustrating for a manager to contemplate than the situation in which one employee struggles with a problem that another employee knows how to solve easily. KM systems are concerned with the sharing not only of content, but also with the sharing of knowledge among humans. How can one person share her knowledge with another? How can one person learn of another person s great idea? KM Systems to Facilitate the Sharing of Human Knowledge (Continued) Three forms of technology are used for knowledge sharing among humans: Portals, discussion groups, and Collaborations systems Expert systems Portals Employees can share ideas by posting knowledge on a Web portal whereby managers and employees can pull the knowledge from the portal. Figure 9 26 Technology Support of Sharing Human Knowledge KM Systems to Facilitate the Sharing of Human Knowledge (Continued) Discussion Groups Discussion groups allow employees or customers to post questions and queries seeking solutions to problems they have. Oracle, IBM, PeopleSoft, and other vendors support product discussion groups where users can post questions and where employees, vendors, and other users can answer them. Later, the organization can edit and summarize the questions from such discussion groups into frequently asked questions (FAQs). 22

23 KM Systems to Facilitate the Sharing of Human Knowledge (Continued) Discussion groups (continued) Basic can also be used for knowledge sharing, especially if lists have been constructed with KM in mind. Two human factors inhibit knowledge sharing. Employees can be reluctant to exhibit their ignorance. Competition exists between employees. A KM application may be ill suited to a competitive group. The company may be able to restructure rewards and incentives to foster sharing of ideas among employees. KM Systems to Facilitate the Sharing of Human Knowledge (Continued) Collaboration Systems Collaboration systems are information systems that enable people to work together more effectively. The Internet can be used as a broadcast medium for speeches, panel discussion, and other types of meetings. Web broadcasts, because they are digital, can be readily saved andreplayed at the viewer s convenience. Web broadcasts can also be made interactive by combining them with discussion group bulletin boards that are live during the broadcast. Video conferencing is another popular form of IT supported meetings. Video conferencing equipment is expensive and normally is located in selected sites in the organization. Figure 9 27 Net Meeting Graphic KM Systems to Facilitate the Sharing of Human Knowledge (Continued) Expert Systems Expert systems are created by interviewing experts in a given business domain and codifying the rules stated by those experts. Many expert systems were created in the late 1980s and 1990s, and some of them have been successful. Expert systems suffer from three major disadvantages. They are difficult and expensive to develop. They are difficult to maintain. They were unable to live up to the high expectations set by their name. 23

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Data Mining: Introduction Lecture Notes for Chapter 1 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused - Web

More information

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining Introduction of Information Visualization and Visual Analytics Chapter 4 Data Mining Books! P. N. Tan, M. Steinbach, V. Kumar: Introduction to Data Mining. First Edition, ISBN-13: 978-0321321367, 2005.

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Dr. Hui Xiong Rutgers University Questions? Instructor: Dr. Hui Xiong Office Hours: Ackerson 200K Wednesday 11:00AM 12:00pm Office Phone: 973 353 5261 Email: hxiong@rutgers.edu

More information

Foundations of Artificial Intelligence. Introduction to Data Mining

Foundations of Artificial Intelligence. Introduction to Data Mining Foundations of Artificial Intelligence Introduction to Data Mining Objectives Data Mining Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees Present

More information

Quick Introduction of Data Mining Techniques

Quick Introduction of Data Mining Techniques Quick Introduction of Data Mining Techniques *Sources partially from Introduction to Data Mining, by P.-N. Tan, M. Steinbach, V. Kumar, Addison-Wesley, 2005. Main Data Mining Techniques Link Analysis Associations

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining Introduction to Artificial Intelligence G51IAI An Introduction to Data Mining Learning Objectives Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Data Mining: Introduction

Data Mining: Introduction Data Mining: Introduction Introducing the course How the course is organized How students are evaluated Deadlines Data Mining [Chapt. 1 of course book] What is it about? The KDD process Relations to other

More information

Introduction to Data Mining

Introduction to Data Mining Bioinformatics Ying Liu, Ph.D. Laboratory for Bioinformatics University of Texas at Dallas Spring 2008 Introduction to Data Mining 1 Motivation: Why data mining? What is data mining? Data Mining: On what

More information

Data Mining. Yeow Wei Choong Anne Laurent

Data Mining. Yeow Wei Choong Anne Laurent Data Mining Yeow Wei Choong Anne Laurent Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

DATA MINING - 1DL105, 1Dl111

DATA MINING - 1DL105, 1Dl111 1 DATA MINING - 1DL105, 1Dl111 Fall 2006 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht06 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

15.564 Information Technology I. Business Intelligence

15.564 Information Technology I. Business Intelligence 15.564 Information Technology I Business Intelligence Outline Operational vs. Decision Support Systems What is Data Mining? Overview of Data Mining Techniques Overview of Data Mining Process Data Warehouses

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Data Warehousing and Data Mining in Business Applications

Data Warehousing and Data Mining in Business Applications 133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Course Overview Introduction to Data Mining

More information

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016 CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining Chengkai Li University of Texas at Arlington Spring 2016 Big Data http://dilbert.com/strip/2012-07-29 Big Data http://www.ibmbigdatahub.com/infographic/four-vs-big-data

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

Big Data. Introducción. Santiago González

Big Data. Introducción. Santiago González <sgonzalez@fi.upm.es> Big Data Introducción Santiago González Contenidos Por que BIG DATA? Características de Big Data Tecnologías y Herramientas Big Data Paradigmas fundamentales Big Data Data Mining

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

More information

Hexaware E-book on Predictive Analytics

Hexaware E-book on Predictive Analytics Hexaware E-book on Predictive Analytics Business Intelligence & Analytics Actionable Intelligence Enabled Published on : Feb 7, 2012 Hexaware E-book on Predictive Analytics What is Data mining? Data mining,

More information

CHAPTER 3 DATA MINING AND CLUSTERING

CHAPTER 3 DATA MINING AND CLUSTERING CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D. Data Mining on Social Networks Dionysios Sotiropoulos Ph.D. 1 Contents What are Social Media? Mathematical Representation of Social Networks Fundamental Data Mining Concepts Data Mining Tasks on Digital

More information

Data Mining and Machine Learning in Bioinformatics

Data Mining and Machine Learning in Bioinformatics Data Mining and Machine Learning in Bioinformatics PRINCIPAL METHODS AND SUCCESSFUL APPLICATIONS Ruben Armañanzas http://mason.gmu.edu/~rarmanan Adapted from Iñaki Inza slides http://www.sc.ehu.es/isg

More information

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO What is Data Mining? Data Mining (Knowledge discovery in database) Data Mining: "The non trivial extraction of implicit, previously unknown, and potentially useful information from data" William J Frawley,

More information

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Chapter 5. Warehousing, Data Acquisition, Data. Visualization Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization 5-1 Learning Objectives

More information

OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS. Data Mining 1 OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots? Class 1 Data Mining Data Mining and Artificial Intelligence We are in the 21 st century So where are the robots? Data mining is the one really successful application of artificial intelligence technology.

More information

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users 1 IT and CRM A basic CRM model Data source & gathering Database Data warehouse Information delivery Information users 2 IT and CRM Markets have always recognized the importance of gathering detailed data

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Course Overview Introduction to Data Mining

More information

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

More information

Data Mining + Business Intelligence. Integration, Design and Implementation

Data Mining + Business Intelligence. Integration, Design and Implementation Data Mining + Business Intelligence Integration, Design and Implementation ABOUT ME Vijay Kotu Data, Business, Technology, Statistics BUSINESS INTELLIGENCE - Result Making data accessible Wider distribution

More information

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis ElegantJ BI White Paper The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis Integrated Business Intelligence and Reporting for Performance Management, Operational

More information

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing

More information

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Importance or the Role of Data Warehousing and Data Mining in Business Applications Journal of The International Association of Advanced Technology and Science Importance or the Role of Data Warehousing and Data Mining in Business Applications ATUL ARORA ANKIT MALIK Abstract Information

More information

What is Customer Relationship Management? Customer Relationship Management Analytics. Customer Life Cycle. Objectives of CRM. Three Types of CRM

What is Customer Relationship Management? Customer Relationship Management Analytics. Customer Life Cycle. Objectives of CRM. Three Types of CRM Relationship Management Analytics What is Relationship Management? CRM is a strategy which utilises a combination of Week 13: Summary information technology policies processes, employees to develop profitable

More information

Web Data Mining: A Case Study. Abstract. Introduction

Web Data Mining: A Case Study. Abstract. Introduction Web Data Mining: A Case Study Samia Jones Galveston College, Galveston, TX 77550 Omprakash K. Gupta Prairie View A&M, Prairie View, TX 77446 okgupta@pvamu.edu Abstract With an enormous amount of data stored

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Data Mining for Fun and Profit

Data Mining for Fun and Profit Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools

More information

relevant to the management dilemma or management question.

relevant to the management dilemma or management question. CHAPTER 5: Clarifying the Research Question through Secondary Data and Exploration (Handout) A SEARCH STRATEGY FOR EXPLORATION Exploration is particularly useful when researchers lack a clear idea of the

More information

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90 FREE echapter C H A P T E R1 Big Data and Analytics Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90 percent of the data in the

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Chapter 5 Foundations of Business Intelligence: Databases and Information Management 5.1 Copyright 2011 Pearson Education, Inc. Student Learning Objectives How does a relational database organize data,

More information

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić Business Intelligence Solutions Cognos BI 8 by Adis Terzić Fairfax, Virginia August, 2008 Table of Content Table of Content... 2 Introduction... 3 Cognos BI 8 Solutions... 3 Cognos 8 Components... 3 Cognos

More information

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Chapter 6. Foundations of Business Intelligence: Databases and Information Management Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

DATA MINING AND WAREHOUSING CONCEPTS

DATA MINING AND WAREHOUSING CONCEPTS CHAPTER 1 DATA MINING AND WAREHOUSING CONCEPTS 1.1 INTRODUCTION The past couple of decades have seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition

More information

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Faculty of Computer Science, University of Indonesia Objectives

More information

Data Warehouse: Introduction

Data Warehouse: Introduction Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of base and data mining group,

More information

MBA 8473 - Data Mining & Knowledge Discovery

MBA 8473 - Data Mining & Knowledge Discovery MBA 8473 - Data Mining & Knowledge Discovery MBA 8473 1 Learning Objectives 55. Explain what is data mining? 56. Explain two basic types of applications of data mining. 55.1. Compare and contrast various

More information

Q1 Define the following: Data Mining, ETL, Transaction coordinator, Local Autonomy, Workload distribution

Q1 Define the following: Data Mining, ETL, Transaction coordinator, Local Autonomy, Workload distribution Q1 Define the following: Data Mining, ETL, Transaction coordinator, Local Autonomy, Workload distribution Q2 What are Data Mining Activities? Q3 What are the basic ideas guide the creation of a data warehouse?

More information

Chapter 3: Cluster Analysis

Chapter 3: Cluster Analysis Chapter 3: Cluster Analysis 3.1 Basic Concepts of Clustering 3.2 Partitioning Methods 3.3 Hierarchical Methods 3.4 Density-Based Methods 3.5 Model-Based Methods 3.6 Clustering High-Dimensional Data 3.7

More information

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms Data Mining Techniques forcrm Data Mining The non-trivial extraction of novel, implicit, and actionable knowledge from large datasets. Extremely large datasets Discovery of the non-obvious Useful knowledge

More information

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.

More information

Customer Analytics. Turn Big Data into Big Value

Customer Analytics. Turn Big Data into Big Value Turn Big Data into Big Value All Your Data Integrated in Just One Place BIRT Analytics lets you capture the value of Big Data that speeds right by most enterprises. It analyzes massive volumes of data

More information

Data Mining for Successful Healthcare Organizations

Data Mining for Successful Healthcare Organizations Data Mining for Successful Healthcare Organizations For successful healthcare organizations, it is important to empower the management and staff with data warehousing-based critical thinking and knowledge

More information

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA ABSTRACT Current trends in data mining allow the business community to take advantage of

More information

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Nine Common Types of Data Mining Techniques Used in Predictive Analytics 1 Nine Common Types of Data Mining Techniques Used in Predictive Analytics By Laura Patterson, President, VisionEdge Marketing Predictive analytics enable you to develop mathematical models to help better

More information

Data Mining. Introduction to Modern Information Retrieval from Databases and the Web. Administrivia

Data Mining. Introduction to Modern Information Retrieval from Databases and the Web. Administrivia Administrivia Data Mining Introduction to Modern Information Retrieval from Databases and the Web Instructor: Kostis Sagonas (MIC, Hus 1, 352) Course home page: http://user.it.uu.se/~kostis/teaching/dm-05/

More information

Data Mining: An Introduction

Data Mining: An Introduction Data Mining: An Introduction Michael J. A. Berry and Gordon A. Linoff. Data Mining Techniques for Marketing, Sales and Customer Support, 2nd Edition, 2004 Data mining What promotions should be targeted

More information

http://datamining.rutgers.edu Financial Fraud Detection and Prevention with Data Mining Techniques Professor Hui Xiong Rutgers Business School

http://datamining.rutgers.edu Financial Fraud Detection and Prevention with Data Mining Techniques Professor Hui Xiong Rutgers Business School Financial Fraud Detection and Prevention with Data Mining Techniques Professor Hui Xiong Rutgers Business School What is Financial Fraud General Violation of Good Behavior in regards to Financial Issues

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Business Intelligence Solutions for Gaming and Hospitality

Business Intelligence Solutions for Gaming and Hospitality Business Intelligence Solutions for Gaming and Hospitality Prepared by: Mario Perkins Qualex Consulting Services, Inc. Suzanne Fiero SAS Objective Summary 2 Objective Summary The rise in popularity and

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

Statistical Challenges with Big Data in Management Science

Statistical Challenges with Big Data in Management Science Statistical Challenges with Big Data in Management Science Arnab Kumar Laha Indian Institute of Management Ahmedabad Analytics vs Reporting Competitive Advantage Reporting Prescriptive Analytics (Decision

More information

Data Mining System, Functionalities and Applications: A Radical Review

Data Mining System, Functionalities and Applications: A Radical Review Data Mining System, Functionalities and Applications: A Radical Review Dr. Poonam Chaudhary System Programmer, Kurukshetra University, Kurukshetra Abstract: Data Mining is the process of locating potentially

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction Machine Learning, Data Mining, and Knowledge Discovery: An Introduction AHPCRC Workshop - 8/17/10 - Dr. Martin Based on slides by Gregory Piatetsky-Shapiro from Kdnuggets http://www.kdnuggets.com/data_mining_course/

More information

Nagarjuna College Of

Nagarjuna College Of Nagarjuna College Of Information Technology (Bachelor in Information Management) TRIBHUVAN UNIVERSITY Project Report on World s successful data mining and data warehousing projects Submitted By: Submitted

More information

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,

More information

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III www.cognitro.com/training Predicitve DATA EMPOWERING DECISIONS Data Mining & Predicitve Training (DMPA) is a set of multi-level intensive courses and workshops developed by Cognitro team. it is designed

More information

Master of Science in Health Information Technology Degree Curriculum

Master of Science in Health Information Technology Degree Curriculum Master of Science in Health Information Technology Degree Curriculum Core courses: 8 courses Total Credit from Core Courses = 24 Core Courses Course Name HRS Pre-Req Choose MIS 525 or CIS 564: 1 MIS 525

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Business Intelligence and Decision Support Systems

Business Intelligence and Decision Support Systems Chapter 12 Business Intelligence and Decision Support Systems Information Technology For Management 7 th Edition Turban & Volonino Based on lecture slides by L. Beaubien, Providence College John Wiley

More information

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers 60 Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative

More information