How To Mine Data From A Database

Similar documents
Data Warehouse: Introduction

Business Intelligence represents a fundamental shift in the purpose, objective and use of information

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

Research Report. Abstract: The Emerging Intersection Between Big Data and Security Analytics. November 2012

Case Study. Sonata develops. comprehensive BI Application for a leading provider of Animal Nutrition Solutions. Ananthakrishnan

Analytical Techniques created for the offline world can they yield benefits online?

Usage of data mining for analyzing customer mindset

TOWARDS OF AN INFORMATION SERVICE TO EDUCATIONAL LEADERSHIPS: BUSINESS INTELLIGENCE AS ANALYTICAL ENGINE OF SERVICE

First Global Data Corp.

Talking Bout. a Revolution 100% 110% 120% 90% 80% 70% 130% 140%

To transform information into knowledge- a firm must expend additional resources to discover, patterns, rules, and context where the knowledge works

Mobile Telecom Expense Management

COURSE PROFILE. Business Data Analysis IT431 Fall

BRISTOL CITY COUNCIL ROLE AND EMPLOYEE PROFILE: Architect (Practitioner Level) Specific Role Data Architect

Solution. Industry. Challenges. Client Case Study. Legacy Systems too Costly to Maintain. Supply Chain Advantage. Delivered.

IMT Standards. Standard number A GoA IMT Standards. Effective Date: Scheduled Review: Last Reviewed: Type: Technical

Basics of Supply Chain Management

Data mining methodology extracts hidden predictive information from large databases.

Legacy EMR Data Conversions

2008 BA Insurance Systems Pty Ltd

The Importance Advanced Data Collection System Maintenance. Berry Drijsen Global Service Business Manager. knowledge to shape your future

How To Write Insurance Quotation Software For Gthaer Vericherungen Insurance Prducts

Business Intelligence and DataWarehouse workshop

Performance Test Modeling with ANALYTICS

Data Science. IRDS: Data Mining Process. The term data mining. The term data mining

School of Management College of Sport and Human Dynamics School of Management... 4

Job Profile Data & Reporting Analyst (Grant Fund)

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This report provides Members with an update on of the financial performance of the Corporation s managed IS service contract with Agilisys Ltd.

ready. aiim. learn. 2-day BPM Specialist Training Class - Learn global best practices for improving business processes

Your Blooming Curriculum. Using Bloom s Taxonomy to Determine Student Performance Expectations

Succession Planning & Leadership Development: Your Utility s Bridge to the Future

How to Reduce Project Lead Times Through Improved Scheduling

Symantec User Authentication Service Level Agreement

WEB APPLICATION SECURITY TESTING

THOMSON REUTERS C-TRACK CASE MANAGEMENT SYSTEM SOFTWARE AS A SERVICE SERVICE DEFINITION FOR G-CLOUD 6

Data Mining & Advanced Analytics

ITU-T IdMFG Framework Work Group

WSI White Paper. Prepared by: Feras Alhlou Web Analytics Expert, WSI

QBT - Making business travel simple

Guidelines on Data Management in Horizon 2020

Annuities and Senior Citizens

Introduction. A. Bellaachia Page: 1

Presentation: The Demise of SAS 70 - What s Next?

How To Write A Secure Cloud Computing For Critical Infrastructure

POLICY 1390 Information Technology Continuity of Business Planning Issued: June 4, 2009 Revised: June 12, 2014

Data Abstraction Best Practices with Cisco Data Virtualization

Research Report. Abstract: Advanced Malware Detection and Protection Trends. September 2013

101 E-Commerce Start-up Checklist

Request for Resume (RFR) CATS II Master Contract. All Master Contract Provisions Apply

The Cost Benefits of the Cloud are More About Real Estate Than IT

GENERAL EDUCATION. Communication: Students will effectively exchange ideas and information using multiple methods of communication.

NC3A SOA Techwatch Day Call for Presentations

Data Warehouse Scope Recommendations

Software Quality Assurance Plan

A Model for Automatic Preventive Maintenance Scheduling and Application Database Software

CSU STANISLAUS INFORMATION TECHNOLOGY PLAN SUMMARY

Secretary of Energy Steven Chu, U.S. Department of Energy. Acting Under Secretary David Sandalow, U.S. Department of Energy

Small Business Fraud Custom Study among Small Business Owners Conducted for SunTrust Banks/National Small Business Association/Edelman

Revised October 27, 2011 Page 1 of 6

LASA. Swansea s Credit Union. Loan Application Form. Loans and Savings Abertawe

Nursing Process Outline - Kim Baily RN, MSN, PhD

Action. Situation. Client: Constant Contact Industry: Marketing Services Geography: United States RampRate Solutions: Data Center HyperSourcing

PAYMENT GATEWAY ACCOUNT SETUP FORM

An Oracle White Paper January Comprehensive Data Quality with Oracle Data Integrator and Oracle Enterprise Data Quality

President Obama and Secretary Geithner Announce Plans to Unlock Credit for Small Businesses

Privacy Policy. The Central Equity Group understands how highly people value the protection of their privacy.

Research Report. Abstract: Security Management and Operations: Changes on the Horizon. July 2012

TRANSPARENCY INTO BIG DATA FROM IT INFRASTRUCTURE

White Paper for Mobile Workforce Management and Monitoring Copyright 2014 by Patrol-IT Inc.

Accelerating Application Delivery with Continuous Testing

Defining Sales Campaign Automation How , the Killer App, is best applied to marketing

South Australia Police POSITION INFORMATION DOCUMENT

Best Practices on Monitoring Hotel Review Sites By Max Starkov and Mariana Mechoso Safer

Systems Load Testing Appendix

Improved Data Center Power Consumption and Streamlining Management in Windows Server 2008 R2 with SP1

Personal Data Security Breach Management Policy

USABILITY TESTING PLAN. Document Overview. Methodology

How to Finance your Investment

SaaS Listing CA Cloud Service Management

Appendix A Page 1 of 5 DATABASE TECHNICAL REQUIREMENTS AND PRICING INFORMATION. Welcome Baby and Select Home Visitation Programs Database

Research Report. Abstract: Data Center Networking Trends. January By Jon Oltsik With Bob Laliberte and Bill Lundell

Credit Work Group Recommendation

Chapter 7 Business Continuity and Risk Management

RECONCILIATION OF FUNDS

The Role of Multidimensional Databases in Modern Organizations

Introduction to Marketing of Financial Services

EASTERN ARIZONA COLLEGE Database Design and Development

MITEL OPEN INTEGRATION GATEWAY (OIG): END- CUSTOMER DEVELOPMENT & LICENSING

Chapter 3: Cluster Analysis

CASSOWARY COAST REGIONAL COUNCIL POLICY ENTERPRISE RISK MANAGEMENT

Interworks Cloud Platform Citrix CPSM Integration Specification

Duration of job. Context and environment: (e.g. dept description, region description, organogram)

Plus500CY Ltd. Statement on Privacy and Cookie Policy

ITIL Release Control & Validation (RCV) Certification Program - 5 Days

MSc Software Engineering Projects and Management - E562

THE CITY UNIVERSITY OF NEW YORK IDENTITY THEFT PREVENTION PROGRAM

Integrate Marketing Automation, Lead Management and CRM

Predictive Analytics

Transcription:

Intrductin t KDD and data mining Nguyen Hung Sn This presentatin was prepared n the basis f the fllwing public materials: 1. Jiawei Han and Micheline Kamber, Data mining, cncept and techniques http://www.cs.sfu.ca 2. Gregry Piatetsky-Shapir, kdnuggest, http://www.kdnuggets.cm/data_mining_curse/ KDD and DM 1

Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 2

Mtivatin: large scale databases Advanced methds in data etractin and data string techniques Grwth f many applicatin areas Mre generated data: Bank, telecm, ther business transactins... Scientific data: astrnmy, bilgy, etc Web, tet, and e-cmmerce KDD and DM 3

Massive data surces Huge number f recrds 10 6-10 12 in case f databases abut celestial bjects (astrnmy) Huge number f attributes (features, measurements, clumns) Hundreds f variables in patient recrds crrespnding t results f medical eaminatins KDD and DM 4

Mtivatin We are melting in a cean f data, but we need a knwledge PROBLEM: Hw t get a useful infrmatin/knwledge frm large databases? SOLUTION: Data wherehuse + data mining KDD and DM 5

Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 6

What Is Data Mining? An iterative and interactive prcess f discvering nvel, valid, useful, cmprehensive and understandable patterns and mdels in MASSIVE data surces (databases). Nvel: smething we are nt aware f Valid: generalise t the future Useful: sme reactin is pssible Understandable: leading t insight Iterative: many steps and many passes Interactive: human is a part f the system KDD and DM 7

What is Data Mining Alternative names and their inside stries : Data mining: a misnmer? Knwledge discvery (mining) in databases (KDD), knwledge etractin, data/pattern analysis, data archelgy, data dredging, infrmatin harvesting, business intelligence, etc. What is nt data mining? (Deductive) query prcessing. Epert systems r small ML/statistical prgrams DATA DATA MINING PATTERNS KDD and DM 8

Evlutin f Database Technlgy 1960s: 1970s: 1980s: Data cllectin, database creatin, IMS and netwrk DBMS Relatinal data mdel, relatinal DBMS implementatin RDBMS, advanced data mdels (etended-relatinal, OO, deductive, etc.) and applicatin-riented DBMS (spatial, scientific, engineering, etc.) 1990s 2000s: Data mining and data warehusing, multimedia databases, and Web databases KDD and DM 9

Big Data Eamples Eurpe's Very Lng Baseline Interfermetry (VLBI) has 16 telescpes, each f which prduces 1 Gigabit/secnd f astrnmical data ver a 25-day bservatin sessin strage and analysis a big prblem AT&T handles billins f calls per day s much data, it cannt be all stred -- analysis has t be dne n the fly, n streaming data KDD and DM 10

Largest databases in 2003 Cmmercial databases: Winter Crp. 2003 Survey: France Telecm has largest decisinsupprt DB, ~30TB; AT&T ~ 26 TB Web Alea internet archive: 7 years f data, 500 TB Ggle searches 4+ Billin pages, many hundreds TB IBM WebFuntain, 160 TB (2003) Internet Archive (www.archive.rg),~ 300 TB KDD and DM 11

5 millin terabytes created in 2002 UC Berkeley 2003 estimate: 5 eabytes (5 millin terabytes) f new data was created in 2002. www.sims.berkeley.edu/research/prjects/hw-much-inf-2003/ US prduces ~40% f new stred data wrldwide KDD and DM 12

Data Grwth Rate Twice as much infrmatin was created in 2002 as in 1999 (~30% grwth rate) Other grwth rate estimates even higher Very little data will ever be lked at by a human Knwledge Discvery is NEEDED t make sense and use f data. KDD and DM 13

Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 14

Data Mining Applicatin areas Science astrnmy, biinfrmatics, drug discvery, Business advertising, CRM (Custmer Relatinship management), investments, manufacturing, sprts/entertainment, telecm, e- Cmmerce, targeted marketing, health care, Web: search engines, bts, Gvernment law enfrcement, prfiling ta cheaters, anti-terrr(?) KDD and DM 15

Data Mining fr Custmer Mdeling Custmer Tasks: attritin predictin targeted marketing: crss-sell, custmer acquisitin credit-risk fraud detectin Industries banking, telecm, retail sales, KDD and DM 16

Custmer Attritin: Case Study Situatin: Attritin rate at fr mbile phne custmers is arund 25-30% a year! Task: Given custmer infrmatin fr the past N mnths, predict wh is likely t attrite net mnth. Als, estimate custmer value and what is the csteffective ffer t be made t this custmer. KDD and DM 17

Custmer Attritin Results Verizn Wireless built a custmer data warehuse Identified ptential attriters Develped multiple, reginal mdels Targeted custmers with high prpensity t accept the ffer Reduced attritin rate frm ver 2%/mnth t under 1.5%/mnth (huge impact, with >30 M subscribers) (Reprted in 2003) KDD and DM 18

Assessing Credit Risk: Case Study Situatin: Persn applies fr a lan Task: Shuld a bank apprve the lan? Nte: Peple wh have the best credit dn t need the lans, and peple with wrst credit are nt likely t repay. Bank s best custmers are in the middle KDD and DM 19

Credit Risk - Results Banks develp credit mdels using variety f machine learning methds. Mrtgage and credit card prliferatin are the results f being able t successfully predict if a persn is likely t default n a lan Widely deplyed in many cuntries KDD and DM 20

Successful e-cmmerce Case Study A persn buys a bk (prduct) at Amazn.cm. Task: Recmmend ther bks (prducts) this persn is likely t buy Amazn des clustering based n bks bught: custmers wh bught Advances in Knwledge Discvery and Data Mining, als bught Data Mining: Practical Machine Learning Tls and Techniques with Java Implementatins Recmmendatin prgram is quite successful KDD and DM 21

Unsuccessful e-cmmerce case study (KDD-Cup 2000) Data: clickstream and purchase data frm Gazelle.cm, legwear and legcare e-tailer Q: Characterize visitrs wh spend mre than $12 n an average rder at the site Dataset f 3,465 purchases, 1,831 custmers Very interesting analysis by Cup participants thusands f hurs - $X,000,000 (Millins) f cnsulting Ttal sales -- $Y,000 Obituary: Gazelle.cm ut f business, Aug 2000 KDD and DM 22

Genmic Micrarrays Case Study Given micrarray data fr a number f samples (patients), can we Accurately diagnse the disease? Predict utcme fr given treatment? Recmmend best treatment? KDD and DM 23

Eample: ALL/AML data 38 training cases, 34 test, ~ 7,000 genes 2 Classes: Acute Lymphblastic Leukemia (ALL) vs Acute Myelid Leukemia (AML) Use train data t build diagnstic mdel ALL AML Results n test data: 33/34 crrect, 1 errr may be mislabeled KDD and DM 24

Security and Fraud Detectin - Case Study Credit Card Fraud Detectin Detectin f Mney laundering FAIS (US Treasury) Securities Fraud NASDAQ KDD system Phne fraud AT&T, Bell Atlantic, British Telecm/MCI Bi-terrrism detectin at Salt Lake Olympics 2002 KDD and DM 25

Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 26

Architecture f a Typical Data Mining System Graphical user interface Pattern evaluatin Data mining engine Database r data warehuse server Data cleaning & data integratin Filtering Knwledge-base Databases Data Warehuse KDD and DM 27

Data Mining: On What Kind f Data? Relatinal databases Data warehuses Transactinal databases Advanced DB and infrmatin repsitries Object-riented and bject-relatinal databases Spatial databases Time-series data and tempral data Tet databases and multimedia databases Hetergeneus and legacy databases WWW KDD and DM 28

Data Mining Functinalities (1) Cncept descriptin: Characterizatin and discriminatin Generalize, summarize, and cntrast data characteristics, e.g., dry vs. wet regins Assciatin (crrelatin and causality) Multi-dimensinal vs. single-dimensinal assciatin age(x, 20..29 ) ^ incme(x, 20..29K ) => buys(x, PC ) [supprt = 2%, cnfidence = 60%] cntains(t, cmputer ) => cntains(t, sftware ) [1%, 75%] KDD and DM 29

Data Mining Functinalities (2) Classificatin and Predictin Finding mdels (functins) that describe and distinguish classes r cncepts fr future predictin E.g., classify cuntries based n climate, r classify cars based n gas mileage Presentatin: decisin-tree, classificatin rule, neural netwrk Predictin: Predict sme unknwn r missing numerical values Cluster analysis Class label is unknwn: Grup data t frm new classes, e.g., cluster huses t find distributin patterns Clustering based n the principle: maimizing the intra-class similarity and minimizing the interclass similarity KDD and DM 30

Data Mining Functinalities (3) Outlier analysis Outlier: a data bject that des nt cmply with the general behavir f the data It can be cnsidered as nise r eceptin but is quite useful in fraud detectin, rare events analysis Trend and evlutin analysis Trend and deviatin: regressin analysis Sequential pattern mining, peridicity analysis Similarity-based analysis Other pattern-directed r statistical analyses KDD and DM 31

Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 32

Cmpnent f a Data Mining algrithm Knwledge representatin mdel Evaluatin criteria Search strategy KDD and DM 33

Knwledge representatin Using lgical language t describe mined patterns. E.g., Lgical frmulas Decisin tree Neural netwrks (!) KDD and DM 34

Search strategy Parameter searching Mdel searching KDD and DM 35

Are All the Discvered Patterns Interesting? A data mining system/query may generate thusands f patterns, nt all f them are interesting. Suggested apprach: Human-centered, query-based, fcused mining Interestingness measures: A pattern is interesting if it is easily understd by humans, valid n new r test data with sme degree f certainty, ptentially useful, nvel, r validates sme hypthesis that a user seeks t cnfirm Objective vs. subjective interestingness measures: Objective: based n statistics and structures f patterns, e.g., supprt, cnfidence, etc. Subjective: based n user s belief in the data, e.g., unepectedness, nvelty, actinability, etc. KDD and DM 36

Can We Find All and Only Interesting Patterns? Find all the interesting patterns: Cmpleteness Can a data mining system find all the interesting patterns? Assciatin vs. classificatin vs. clustering Search fr nly interesting patterns: Optimizatin Can a data mining system find nly the interesting patterns? Appraches First general all the patterns and then filter ut the uninteresting nes. Generate nly the interesting patterns mining query ptimizatin KDD and DM 37

Majr Data Mining Methds Classificatin: predicting an item class Clustering: finding clusters in data Assciatins: e.g. A & B & C ccur frequently Visualizatin: t facilitate human discvery Summarizatin: describing a grup Deviatin Detectin: finding changes Estimatin: predicting a cntinuus value Link Analysis: finding relatinships KDD and DM 38

Related techniques Neural Netwrks Fuzzy Sets Rugh Sets Time series analysis Bayesian Netwrks Decisin trees Evlutinary prgramming and GA Markv mdelling. KDD and DM 39

Eample Debt Incme KDD and DM 40

Linear classificatin Debt n lan Incme KDD and DM 41

Linear regresin Debt Regressin line Incme KDD and DM 42

Clustering Debt incme KDD and DM 43

Single threshld (cut) Debt N Lan Lan t Incme KDD and DM 44

Nnlinear classifier Debt N Lan Lan Incme KDD and DM 45

Nearest neighbur Debt N Lan Lan Incme KDD and DM 46

Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 47

KDD KDD and DM 48

Data Mining: a KDD prcess KDD and DM 49

Steps f a KDD Prcess 1. Learning the applicatin dmain: relevant prir knwledge and gals f applicatin 2. Creating a target data set: data selectin 3. Data cleaning and preprcessing: (may take 60% f effrt!) 4. Data reductin and transfrmatin: Find useful features, dimensinality/variable reductin, invariant representatin. 5. Chsing functins f data mining summarizatin, classificatin, regressin, assciatin, clustering. 6. Chsing the mining algrithm(s) 7. Data mining: search fr patterns f interest 8. Pattern evaluatin and knwledge presentatin visualizatin, transfrmatin, remving redundant patterns, etc. 9. Use f discvered knwledge KDD and DM 50

The gals f Data Mining Predictin: T fresee the pssible future situatin n the basis f previus events. Given sales recrdings frm previus years can we predict what amunt f gds we need t have in stck fr the frthcming seasn? Descriptin: What is the reasn that sme events ccur? What are the reasns fr the cars f ne prducer t sell better that equal prducts f ther prducers? Verificatin: We think that sme relatinship between entities ccur. Can we check if (and hw) the thread f cancer is related t envirnmental cnditins? Eceptin detectin: There may be situatins (recrds) in ur databases that crrespnd t smething unusual. Is it pssible t identify credit card transactins that are in fact frauds? KDD and DM 51

Classificatin f Data Mining systems General functinality Descriptive data mining Predictive data mining Different views, different classificatins Kinds f databases t be mined Kinds f knwledge t be discvered Kinds f techniques utilized Kinds f applicatins adapted KDD and DM 52

A Multi-Dimensinal View f Data Mining Classificatin Databases t be mined Relatinal, transactinal, bject-riented, bject-relatinal, active, spatial, time-series, tet, multi-media, hetergeneus, legacy, WWW, etc. Knwledge t be mined Characterizatin, discriminatin, assciatin, classificatin, clustering, trend, deviatin and utlier analysis, etc. Multiple/integrated functins and mining at multiple levels Techniques utilized Database-riented, data warehuse (OLAP), machine learning, statistics, visualizatin, neural netwrk, etc. Applicatins adapted Retail, telecmmunicatin, banking, fraud analysis, DNA mining, stck market analysis, Web mining, Weblg analysis, etc. KDD and DM 53

Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 54

Data Mining and Business Intelligence Increasing ptential t supprt business decisins Making Decisins End User Data Presentatin Visualizatin Techniques Data Mining Infrmatin Discvery Data Eplratin Statistical Analysis, Querying and Reprting Business Analyst Data Analyst Data Warehuses / Data Marts OLAP, MDA Data Surces Paper, Files, Infrmatin Prviders, Database Systems, OLTP DBA KDD and DM 55

Majr Issues in Data Mining (1) Mining methdlgy and user interactin Mining different kinds f knwledge in databases Interactive mining f knwledge at multiple levels f abstractin Incrpratin f backgrund knwledge Data mining query languages and ad-hc data mining Epressin and visualizatin f data mining results Handling nise and incmplete data Pattern evaluatin: the interestingness prblem Perfrmance and scalability Efficiency and scalability f data mining algrithms Parallel, distributed and incremental mining methds KDD and DM 56

Majr Issues in Data Mining (2) Issues relating t the diversity f data types Handling relatinal and cmple types f data Mining infrmatin frm hetergeneus databases and glbal infrmatin systems (WWW) Issues related t applicatins and scial impacts Applicatin f discvered knwledge Dmain-specific data mining tls Intelligent query answering Prcess cntrl and decisin making Integratin f the discvered knwledge with eisting knwledge: A knwledge fusin prblem Prtectin f data security, integrity, and privacy KDD and DM 57

References Data Mining: Cncepts and Techniques. J. Han and M. Kamber. Mrgan Kaufmann, 2000. Knwledge Discvery in Databases. G. Piatetsky-Shapir and W. J. Frawley. AAAI/MIT Press, 1991. Data Mining Techniques: fr Marketing, Sales and Custmer Supprt. M. Berry, G. Linff (Wiley) Advances in Knwledge Discvery and Data Mining. U.S. Fayyad, G. Piatetsky-Shapir, P. Smyth, R. Uthurusamy, AAAI/MIT Press, 1996. Rugh Sets in Knwledge Discvery I & II. L. Plkwski, A. Skwrn (Springer) KDD and DM 58