Intrductin t KDD and data mining Nguyen Hung Sn This presentatin was prepared n the basis f the fllwing public materials: 1. Jiawei Han and Micheline Kamber, Data mining, cncept and techniques http://www.cs.sfu.ca 2. Gregry Piatetsky-Shapir, kdnuggest, http://www.kdnuggets.cm/data_mining_curse/ KDD and DM 1
Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 2
Mtivatin: large scale databases Advanced methds in data etractin and data string techniques Grwth f many applicatin areas Mre generated data: Bank, telecm, ther business transactins... Scientific data: astrnmy, bilgy, etc Web, tet, and e-cmmerce KDD and DM 3
Massive data surces Huge number f recrds 10 6-10 12 in case f databases abut celestial bjects (astrnmy) Huge number f attributes (features, measurements, clumns) Hundreds f variables in patient recrds crrespnding t results f medical eaminatins KDD and DM 4
Mtivatin We are melting in a cean f data, but we need a knwledge PROBLEM: Hw t get a useful infrmatin/knwledge frm large databases? SOLUTION: Data wherehuse + data mining KDD and DM 5
Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 6
What Is Data Mining? An iterative and interactive prcess f discvering nvel, valid, useful, cmprehensive and understandable patterns and mdels in MASSIVE data surces (databases). Nvel: smething we are nt aware f Valid: generalise t the future Useful: sme reactin is pssible Understandable: leading t insight Iterative: many steps and many passes Interactive: human is a part f the system KDD and DM 7
What is Data Mining Alternative names and their inside stries : Data mining: a misnmer? Knwledge discvery (mining) in databases (KDD), knwledge etractin, data/pattern analysis, data archelgy, data dredging, infrmatin harvesting, business intelligence, etc. What is nt data mining? (Deductive) query prcessing. Epert systems r small ML/statistical prgrams DATA DATA MINING PATTERNS KDD and DM 8
Evlutin f Database Technlgy 1960s: 1970s: 1980s: Data cllectin, database creatin, IMS and netwrk DBMS Relatinal data mdel, relatinal DBMS implementatin RDBMS, advanced data mdels (etended-relatinal, OO, deductive, etc.) and applicatin-riented DBMS (spatial, scientific, engineering, etc.) 1990s 2000s: Data mining and data warehusing, multimedia databases, and Web databases KDD and DM 9
Big Data Eamples Eurpe's Very Lng Baseline Interfermetry (VLBI) has 16 telescpes, each f which prduces 1 Gigabit/secnd f astrnmical data ver a 25-day bservatin sessin strage and analysis a big prblem AT&T handles billins f calls per day s much data, it cannt be all stred -- analysis has t be dne n the fly, n streaming data KDD and DM 10
Largest databases in 2003 Cmmercial databases: Winter Crp. 2003 Survey: France Telecm has largest decisinsupprt DB, ~30TB; AT&T ~ 26 TB Web Alea internet archive: 7 years f data, 500 TB Ggle searches 4+ Billin pages, many hundreds TB IBM WebFuntain, 160 TB (2003) Internet Archive (www.archive.rg),~ 300 TB KDD and DM 11
5 millin terabytes created in 2002 UC Berkeley 2003 estimate: 5 eabytes (5 millin terabytes) f new data was created in 2002. www.sims.berkeley.edu/research/prjects/hw-much-inf-2003/ US prduces ~40% f new stred data wrldwide KDD and DM 12
Data Grwth Rate Twice as much infrmatin was created in 2002 as in 1999 (~30% grwth rate) Other grwth rate estimates even higher Very little data will ever be lked at by a human Knwledge Discvery is NEEDED t make sense and use f data. KDD and DM 13
Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 14
Data Mining Applicatin areas Science astrnmy, biinfrmatics, drug discvery, Business advertising, CRM (Custmer Relatinship management), investments, manufacturing, sprts/entertainment, telecm, e- Cmmerce, targeted marketing, health care, Web: search engines, bts, Gvernment law enfrcement, prfiling ta cheaters, anti-terrr(?) KDD and DM 15
Data Mining fr Custmer Mdeling Custmer Tasks: attritin predictin targeted marketing: crss-sell, custmer acquisitin credit-risk fraud detectin Industries banking, telecm, retail sales, KDD and DM 16
Custmer Attritin: Case Study Situatin: Attritin rate at fr mbile phne custmers is arund 25-30% a year! Task: Given custmer infrmatin fr the past N mnths, predict wh is likely t attrite net mnth. Als, estimate custmer value and what is the csteffective ffer t be made t this custmer. KDD and DM 17
Custmer Attritin Results Verizn Wireless built a custmer data warehuse Identified ptential attriters Develped multiple, reginal mdels Targeted custmers with high prpensity t accept the ffer Reduced attritin rate frm ver 2%/mnth t under 1.5%/mnth (huge impact, with >30 M subscribers) (Reprted in 2003) KDD and DM 18
Assessing Credit Risk: Case Study Situatin: Persn applies fr a lan Task: Shuld a bank apprve the lan? Nte: Peple wh have the best credit dn t need the lans, and peple with wrst credit are nt likely t repay. Bank s best custmers are in the middle KDD and DM 19
Credit Risk - Results Banks develp credit mdels using variety f machine learning methds. Mrtgage and credit card prliferatin are the results f being able t successfully predict if a persn is likely t default n a lan Widely deplyed in many cuntries KDD and DM 20
Successful e-cmmerce Case Study A persn buys a bk (prduct) at Amazn.cm. Task: Recmmend ther bks (prducts) this persn is likely t buy Amazn des clustering based n bks bught: custmers wh bught Advances in Knwledge Discvery and Data Mining, als bught Data Mining: Practical Machine Learning Tls and Techniques with Java Implementatins Recmmendatin prgram is quite successful KDD and DM 21
Unsuccessful e-cmmerce case study (KDD-Cup 2000) Data: clickstream and purchase data frm Gazelle.cm, legwear and legcare e-tailer Q: Characterize visitrs wh spend mre than $12 n an average rder at the site Dataset f 3,465 purchases, 1,831 custmers Very interesting analysis by Cup participants thusands f hurs - $X,000,000 (Millins) f cnsulting Ttal sales -- $Y,000 Obituary: Gazelle.cm ut f business, Aug 2000 KDD and DM 22
Genmic Micrarrays Case Study Given micrarray data fr a number f samples (patients), can we Accurately diagnse the disease? Predict utcme fr given treatment? Recmmend best treatment? KDD and DM 23
Eample: ALL/AML data 38 training cases, 34 test, ~ 7,000 genes 2 Classes: Acute Lymphblastic Leukemia (ALL) vs Acute Myelid Leukemia (AML) Use train data t build diagnstic mdel ALL AML Results n test data: 33/34 crrect, 1 errr may be mislabeled KDD and DM 24
Security and Fraud Detectin - Case Study Credit Card Fraud Detectin Detectin f Mney laundering FAIS (US Treasury) Securities Fraud NASDAQ KDD system Phne fraud AT&T, Bell Atlantic, British Telecm/MCI Bi-terrrism detectin at Salt Lake Olympics 2002 KDD and DM 25
Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 26
Architecture f a Typical Data Mining System Graphical user interface Pattern evaluatin Data mining engine Database r data warehuse server Data cleaning & data integratin Filtering Knwledge-base Databases Data Warehuse KDD and DM 27
Data Mining: On What Kind f Data? Relatinal databases Data warehuses Transactinal databases Advanced DB and infrmatin repsitries Object-riented and bject-relatinal databases Spatial databases Time-series data and tempral data Tet databases and multimedia databases Hetergeneus and legacy databases WWW KDD and DM 28
Data Mining Functinalities (1) Cncept descriptin: Characterizatin and discriminatin Generalize, summarize, and cntrast data characteristics, e.g., dry vs. wet regins Assciatin (crrelatin and causality) Multi-dimensinal vs. single-dimensinal assciatin age(x, 20..29 ) ^ incme(x, 20..29K ) => buys(x, PC ) [supprt = 2%, cnfidence = 60%] cntains(t, cmputer ) => cntains(t, sftware ) [1%, 75%] KDD and DM 29
Data Mining Functinalities (2) Classificatin and Predictin Finding mdels (functins) that describe and distinguish classes r cncepts fr future predictin E.g., classify cuntries based n climate, r classify cars based n gas mileage Presentatin: decisin-tree, classificatin rule, neural netwrk Predictin: Predict sme unknwn r missing numerical values Cluster analysis Class label is unknwn: Grup data t frm new classes, e.g., cluster huses t find distributin patterns Clustering based n the principle: maimizing the intra-class similarity and minimizing the interclass similarity KDD and DM 30
Data Mining Functinalities (3) Outlier analysis Outlier: a data bject that des nt cmply with the general behavir f the data It can be cnsidered as nise r eceptin but is quite useful in fraud detectin, rare events analysis Trend and evlutin analysis Trend and deviatin: regressin analysis Sequential pattern mining, peridicity analysis Similarity-based analysis Other pattern-directed r statistical analyses KDD and DM 31
Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 32
Cmpnent f a Data Mining algrithm Knwledge representatin mdel Evaluatin criteria Search strategy KDD and DM 33
Knwledge representatin Using lgical language t describe mined patterns. E.g., Lgical frmulas Decisin tree Neural netwrks (!) KDD and DM 34
Search strategy Parameter searching Mdel searching KDD and DM 35
Are All the Discvered Patterns Interesting? A data mining system/query may generate thusands f patterns, nt all f them are interesting. Suggested apprach: Human-centered, query-based, fcused mining Interestingness measures: A pattern is interesting if it is easily understd by humans, valid n new r test data with sme degree f certainty, ptentially useful, nvel, r validates sme hypthesis that a user seeks t cnfirm Objective vs. subjective interestingness measures: Objective: based n statistics and structures f patterns, e.g., supprt, cnfidence, etc. Subjective: based n user s belief in the data, e.g., unepectedness, nvelty, actinability, etc. KDD and DM 36
Can We Find All and Only Interesting Patterns? Find all the interesting patterns: Cmpleteness Can a data mining system find all the interesting patterns? Assciatin vs. classificatin vs. clustering Search fr nly interesting patterns: Optimizatin Can a data mining system find nly the interesting patterns? Appraches First general all the patterns and then filter ut the uninteresting nes. Generate nly the interesting patterns mining query ptimizatin KDD and DM 37
Majr Data Mining Methds Classificatin: predicting an item class Clustering: finding clusters in data Assciatins: e.g. A & B & C ccur frequently Visualizatin: t facilitate human discvery Summarizatin: describing a grup Deviatin Detectin: finding changes Estimatin: predicting a cntinuus value Link Analysis: finding relatinships KDD and DM 38
Related techniques Neural Netwrks Fuzzy Sets Rugh Sets Time series analysis Bayesian Netwrks Decisin trees Evlutinary prgramming and GA Markv mdelling. KDD and DM 39
Eample Debt Incme KDD and DM 40
Linear classificatin Debt n lan Incme KDD and DM 41
Linear regresin Debt Regressin line Incme KDD and DM 42
Clustering Debt incme KDD and DM 43
Single threshld (cut) Debt N Lan Lan t Incme KDD and DM 44
Nnlinear classifier Debt N Lan Lan Incme KDD and DM 45
Nearest neighbur Debt N Lan Lan Incme KDD and DM 46
Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 47
KDD KDD and DM 48
Data Mining: a KDD prcess KDD and DM 49
Steps f a KDD Prcess 1. Learning the applicatin dmain: relevant prir knwledge and gals f applicatin 2. Creating a target data set: data selectin 3. Data cleaning and preprcessing: (may take 60% f effrt!) 4. Data reductin and transfrmatin: Find useful features, dimensinality/variable reductin, invariant representatin. 5. Chsing functins f data mining summarizatin, classificatin, regressin, assciatin, clustering. 6. Chsing the mining algrithm(s) 7. Data mining: search fr patterns f interest 8. Pattern evaluatin and knwledge presentatin visualizatin, transfrmatin, remving redundant patterns, etc. 9. Use f discvered knwledge KDD and DM 50
The gals f Data Mining Predictin: T fresee the pssible future situatin n the basis f previus events. Given sales recrdings frm previus years can we predict what amunt f gds we need t have in stck fr the frthcming seasn? Descriptin: What is the reasn that sme events ccur? What are the reasns fr the cars f ne prducer t sell better that equal prducts f ther prducers? Verificatin: We think that sme relatinship between entities ccur. Can we check if (and hw) the thread f cancer is related t envirnmental cnditins? Eceptin detectin: There may be situatins (recrds) in ur databases that crrespnd t smething unusual. Is it pssible t identify credit card transactins that are in fact frauds? KDD and DM 51
Classificatin f Data Mining systems General functinality Descriptive data mining Predictive data mining Different views, different classificatins Kinds f databases t be mined Kinds f knwledge t be discvered Kinds f techniques utilized Kinds f applicatins adapted KDD and DM 52
A Multi-Dimensinal View f Data Mining Classificatin Databases t be mined Relatinal, transactinal, bject-riented, bject-relatinal, active, spatial, time-series, tet, multi-media, hetergeneus, legacy, WWW, etc. Knwledge t be mined Characterizatin, discriminatin, assciatin, classificatin, clustering, trend, deviatin and utlier analysis, etc. Multiple/integrated functins and mining at multiple levels Techniques utilized Database-riented, data warehuse (OLAP), machine learning, statistics, visualizatin, neural netwrk, etc. Applicatins adapted Retail, telecmmunicatin, banking, fraud analysis, DNA mining, stck market analysis, Web mining, Weblg analysis, etc. KDD and DM 53
Lecture plan Mtivatins: why data mining? Definitins f data mining? Eamples f applicatins Data mining systems and functinality Methds in data mining Data mining: a KDD prcess Data mining issues KDD and DM 54
Data Mining and Business Intelligence Increasing ptential t supprt business decisins Making Decisins End User Data Presentatin Visualizatin Techniques Data Mining Infrmatin Discvery Data Eplratin Statistical Analysis, Querying and Reprting Business Analyst Data Analyst Data Warehuses / Data Marts OLAP, MDA Data Surces Paper, Files, Infrmatin Prviders, Database Systems, OLTP DBA KDD and DM 55
Majr Issues in Data Mining (1) Mining methdlgy and user interactin Mining different kinds f knwledge in databases Interactive mining f knwledge at multiple levels f abstractin Incrpratin f backgrund knwledge Data mining query languages and ad-hc data mining Epressin and visualizatin f data mining results Handling nise and incmplete data Pattern evaluatin: the interestingness prblem Perfrmance and scalability Efficiency and scalability f data mining algrithms Parallel, distributed and incremental mining methds KDD and DM 56
Majr Issues in Data Mining (2) Issues relating t the diversity f data types Handling relatinal and cmple types f data Mining infrmatin frm hetergeneus databases and glbal infrmatin systems (WWW) Issues related t applicatins and scial impacts Applicatin f discvered knwledge Dmain-specific data mining tls Intelligent query answering Prcess cntrl and decisin making Integratin f the discvered knwledge with eisting knwledge: A knwledge fusin prblem Prtectin f data security, integrity, and privacy KDD and DM 57
References Data Mining: Cncepts and Techniques. J. Han and M. Kamber. Mrgan Kaufmann, 2000. Knwledge Discvery in Databases. G. Piatetsky-Shapir and W. J. Frawley. AAAI/MIT Press, 1991. Data Mining Techniques: fr Marketing, Sales and Custmer Supprt. M. Berry, G. Linff (Wiley) Advances in Knwledge Discvery and Data Mining. U.S. Fayyad, G. Piatetsky-Shapir, P. Smyth, R. Uthurusamy, AAAI/MIT Press, 1996. Rugh Sets in Knwledge Discvery I & II. L. Plkwski, A. Skwrn (Springer) KDD and DM 58