DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin Data Warehuse: Intrductin Database and data mining grup, Plitecnic di Trin Database and data mining grup, Plitecnic di Trin Decisin supprt systems Data warehuse Intrductin Plitecnic di Trin Huge peratinal databases are available in mst cmpanies these databases may prvide a large wealth f useful infrmatin Decisin supprt systems prvide means fr in depth analysis f a cmpany s business faster and better decisins Cpyright All rights reserved INTRODUCTION - 1 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 2 Plitecnic di Trin Database and data mining grup, Plitecnic di Trin Strategic decisin supprt Demand evlutin analysis and frecast Critical business areas identificatin Budgeting and management transparency reprting, practices against frauds and mney laundering Identificatin and implementatin f winning strategies cst reductin and prfit increase Business Intelligence Database and data mining grup, Plitecnic di Trin BI prvides supprt t strategic decisin supprt in cmpanies Objective: transfrming cmpany data int actinable infrmatin at different detail levels fr analysis applicatins Users may have hetergeneus needs BI requires an apprpriate hardware and sftware infrastructure Cpyright All rights reserved INTRODUCTION - 3 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 4 Plitecnic di Trin Applicatins Database and data mining grup, Plitecnic di Trin Manifacturing cmpanies: rder management, client supprt Distributin: user prfile, stck management Financial services: buyer behavir (credit cards) Insurance: claim analysis, fraud detectin Telecmmunicatin: call analysis, churning, fraud detectin Public service: usage analysis Health: service analysis and evaluatin... and many mre... Lan Amunt Eample Database and data mining grup, Plitecnic di Trin Bank clients with a lan : bad clients wing peridic payments t the bank after due : gd clients respecting peridic payment due Incme Cpyright All rights reserved INTRODUCTION - 5 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 6 Plitecnic di Trin Plitecnic di Trin Pag. 1
DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin Data Warehuse: Intrductin Database and data mining grup, Plitecnic di Trin Database and data mining grup, Plitecnic di Trin Eample Eample Lan Amunt k Incme Lan Amunt Incme If Incme < k then bad client Cpyright All rights reserved INTRODUCTION - 7 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 8 Plitecnic di Trin Data management Database and data mining grup, Plitecnic di Trin Traditinal DBMS usage, characterized by detailed data, relatinal representatin snapsht f current data state well-knwn, structured and repetitive peratins read/write access t few recrds shrt transactins islatin, reliability and integrity (ACID) are critical database size 100MB-GB Data analysis Database and data mining grup, Plitecnic di Trin Data prcessing fr decisin supprt, characterized by histrical data cnslid and integrated data ad hc applicatins read access t millins f recrd cmple queries data cnsistency befre and after peridical lads database size 100GB-TB Cpyright All rights reserved INTRODUCTION - 9 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 10 Plitecnic di Trin Data warehuse Database and data mining grup, Plitecnic di Trin Database devted t decisin supprt, which is kept separate frm cmpany peratinal databases Data which is integrated time dependent, nn vlatile devted t a specific subject used fr decisin supprt in a cmpany W. H. Inmn, Building the data warehuse, 1992 Why separate data? Database and data mining grup, Plitecnic di Trin Perfrmance cmple queries reduce perfrmance f peratinal transactin management different access methds at the physical level Data management missing infrmatin (e.g., histry) data cnslidatin data quality (incnsistency prblems) Cpyright All rights reserved INTRODUCTION - 11 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 12 Plitecnic di Trin Plitecnic di Trin Pag. 2
DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin Data Warehuse: Intrductin (Eternal) data surces Database and data mining grup, Plitecnic di Trin Data warehuse: architecture Metadata DW management Back-end tls Data warehuse Data marts OLAP servers Analysis tls Data Analysis Database and data mining grup, Plitecnic di Trin Data warehuse and data mart Cmpany data warehuse: it cntains all the infrmatin n the cmpany business etensive functinal mdelling prcess design and implementatin require a lng time Data mart: departimental infrmatin subset fcused n a given subject tw architectures dependent, fed by the cmpany data warehuse independent, fed directly by the surces faster implementatin requires careful design, t avid subsequent data mart integratin prblems Cpyright All rights reserved INTRODUCTION - 13 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 14 Plitecnic di Trin Back-end tls Database and data mining grup, Plitecnic di Trin Feed the data warehuse (ETL = Etractin Transfrmatin Lading) data etractin frm data surces data cleaning (errrs, missing r duplicated data) frmat trasfrmatins and cnversins data lading and peridical refresh Database and data mining grup, Plitecnic di Trin Multidimensinal representatin Data are represented as an (hyper)cube with three r mre dimensins Measures n which analysis is perfrmed: cells at dimensin intersectin Data warehuse fr tracking sales in a supermarket chain: dimensins: prduct, shp, time measures: sld quantity, sld amunt,... Cpyright All rights reserved INTRODUCTION - 15 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 16 Plitecnic di Trin Database and data mining grup, Plitecnic di Trin Multidimensinal representatin Data analysis tls Database and data mining grup, Plitecnic di Trin 3 shp SupShp prduct OLAP analysis: cmple aggregate functin cmputatin supprt t different types f aggregate functins (e.g., mving average, tp ten) Data analysis by means f data mining techniques varius analysis types significant algrithmic cntributin Cpyright All rights reserved 2-3-2000 INTRODUCTION - 17 MilkTTT Frm Glfarelli, Rizzi, Data warehuse, teria e pratica della prgettazine, McGraw Hill 2006 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 18 Plitecnic di Trin Plitecnic di Trin Pag. 3
DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin Data Warehuse: Intrductin Data analysis tls Database and data mining grup, Plitecnic di Trin Presentatin separate activity: data returned by a query may be rendered by means f different presentatin tls Mtivatin search Data eplratin by means f prgressive, incremental refinements (e.g., drill dwn) Slicing and dicing Aggregatin OLAP analysis prduct shp shp Database and data mining grup, Plitecnic di Trin city= Turin ' SupShp shp prduct categry= fd prducts' year=2000 city prduct categry Cpyright All rights reserved INTRODUCTION - 19 Plitecnic di Trin year Frm Glfarelli, Rizzi, Data warehuse, teria e pratica della prgettazine, McGraw Hill 2006 Cpyright All rights reserved INTRODUCTION - 20 Plitecnic di Trin Database and data mining grup, Plitecnic di Trin Types f data mining activities Classificatin and regressin: predictive mdel generatin requires a previusly labeled data set Assciatin rules: etractin f data crrelatins Clustering: data partined in hmgeneus grups requires the ntin f distance between tw elements high Eample: classificatin Age < 26 Car type = sprt Database and data mining grup, Plitecnic di Trin Age Car type Risk categry 40 SW lw 65 sprt high 20 utility high 25 sprt high 50 utility lw high lw Decisin tree Cpyright All rights reserved INTRODUCTION - 21 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 22 Plitecnic di Trin Database and data mining grup, Plitecnic di Trin Eample: assciatin rules Given a cllectin f cunter transactins in a supermarket (receipts) Assciatin rules diapers beer 2% f transactins cntains bth elements 30% f transactins cntaining diapers als cntains beer Database and data mining grup, Plitecnic di Trin Servers fr Data Warehuses ROLAP (Relatinal OLAP) server etended relatinal DBMS cmpact representatin fr sparse data SQL etensins fr aggregate cmputatin specialized access methds which implement efficient OLAP data access MOLAP (Multidimensinal OLAP) server data represented in prprietary (multidimensinal) matri frmat sparse data require cmpressin special OLAP primitives HOLAP (Hybrid OLAP) server Cpyright All rights reserved INTRODUCTION - 23 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 24 Plitecnic di Trin Plitecnic di Trin Pag. 4
DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin DataBase and Data Mining Grup f Plitecnic di Trin Data Warehuse: Intrductin Database and data mining grup, Plitecnic di Trin Relatinal representatin: star mdel (Numerical) measures stred in the fact table attribute dmain is numeric Dimensins describe the cntet f each measure in the fact table characterized by many descriptive attributes Eample: Data warehuse fr tracking sales in a supermarket chain Shp Sale Prduct Data warehuse size Database and data mining grup, Plitecnic di Trin Time dimensin: 2 years 365 days Shp dimensin: 300 shps Prduct dimensin: 30.000 prducts, f which 3.000 sld every day in every shp Number f rws in the fact table: 730 300 3000 = 657 millins Size f the fact table 21GB Cpyright All rights reserved INTRODUCTION - 25 Date Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 26 Plitecnic di Trin Meta data Database and data mining grup, Plitecnic di Trin Different types f meta data: fr data transfrmatin and lading: describe data surces and needed transfrmatin peratins fr data management: describe the structure f the data in the data warehuse (als fr materialized view) fr query management: data n query structure and eecutin SQL cde fr the query eecutin plan memry and CPU usage Tetbks Database and data mining grup, Plitecnic di Trin Data warehusing Glfarelli, Rizzi, Data warehuse: teria e pratica della prgettazine, McGraw-Hill 2006 Kimbal et al., tetbks n metdlgy and case studies, Wiley Data mining Han, Kamber, Data mining: cncepts and techniques, Mrgan Kaufmann 2006 Tan, Steinbach, Kumar, Intrductin t data mining, Pearsn 2006 Cpyright All rights reserved INTRODUCTION - 27 Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 28 Plitecnic di Trin Useful links Data warehuse http://www.dwinfcenter.rg http://www.dwreview.cm http://kimballuniversity.cm Data mining http://www.kdnuggets.cm/ Database and data mining grup, Plitecnic di Trin Cpyright All rights reserved INTRODUCTION - 29 Plitecnic di Trin Plitecnic di Trin Pag. 5