Data Warehousing and elements of Data Mining



Similar documents
BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

Data W a Ware r house house and and OLAP Week 5 1

Overview of Data Warehousing and OLAP

Chapter 3, Data Warehouse and OLAP Operations

Data Warehouse. MIT-652 Data Mining Applications. Thimaporn Phetkaew. School of Informatics, Walailak University. MIT-652: DM 2: Data Warehouse 1

DATA WAREHOUSING AND OLAP TECHNOLOGY

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Data Mining for Knowledge Management. Data Warehouses

Data Warehousing and OLAP Technology

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Data Warehousing and Online Analytical Processing

Data W a Ware r house house and and OLAP II Week 6 1

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Lecture 2 Data warehousing

Data Warehousing & OLAP

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

CHAPTER 4 Data Warehouse Architecture

DATA WAREHOUSING - OLAP

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Data Warehousing Systems: Foundations and Architectures


2 Data Warehouse and OLAP Technology for Data Mining What is a data warehouse? Amultidimensional data model... 6

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

Data Warehousing and OLAP Technology for Knowledge Discovery

Introduction to Data Mining

Data Warehouse: Introduction

Building Data Cubes and Mining Them. Jelena Jovanovic

Introduction. A. Bellaachia Page: 1

Fluency With Information Technology CSE100/IMT100

DATA MINING - 1DL105, 1Dl111

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

14. Data Warehousing & Data Mining

An Overview of Database management System, Data warehousing and Data Mining

Part 22. Data Warehousing

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

Lection 3-4 WAREHOUSING

Introduction to Data Mining

BUSINESS ANALYTICS AND DATA VISUALIZATION. ITM-761 Business Intelligence ดร. สล ล บ ญพราหมณ

Week 13: Data Warehousing. Warehousing

CS2032 Data warehousing and Data Mining Unit II Page 1

Designing a Dimensional Model

Dimensional Modeling for Data Warehouse

Data Warehousing and Data Mining

Data Warehousing and Data Mining in Business Applications

Data Mining: Introduction

Data Warehousing and Data Mining. A.A Datawarehousing & Datamining 1

Data Warehousing, OLAP, and Data Mining

Data Mining: Concepts and Techniques

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Week 3 lecture slides

B.Sc (Computer Science) Database Management Systems UNIT-V

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

A Design and implementation of a data warehouse for research administration universities

A Technical Review on On-Line Analytical Processing (OLAP)

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

Lecture 2: Introduction to Business Intelligence. Introduction to Business Intelligence

Understanding Data Warehousing. [by Alex Kriegel]

Why Business Intelligence

Business Intelligence, Analytics & Reporting: Glossary of Terms

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Technology-Driven Demand and e- Customer Relationship Management e-crm

IST722 Data Warehousing

TIES443. Lecture 3: Data Warehousing. Lecture 3. Data Warehousing. Course webpage:

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Quick Introduction of Data Mining Techniques

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

IT0457 Data Warehousing. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

DATA MINING AND WAREHOUSING CONCEPTS

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Foundations of Artificial Intelligence. Introduction to Data Mining

Advanced Data Management Technologies

Introduction to Data Mining

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

CHAPTER 3. Data Warehouses and OLAP

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

BUILDING OLAP TOOLS OVER LARGE DATABASES

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Data Mining. Yeow Wei Choong Anne Laurent

Information Management course

Transcription:

Data Warehousing and elements of Data Mining prof. e-mail: maurizio.pighin@uniud.it Dipartimento di Matematica e Informatica Università di Udine - Italy Motivation: Necessity is the Mother of Invention Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases and other information repositories Difficult to analyze data Complex query, long time of analysis We are drowning in data, but starving for knowledge! Solution: Data warehousing and Data mining Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases Slide 2 Copyright 2008 by Pagina 1

Evolution of Database Technology 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extendedrelational, OO, deductive, etc.) and applicationoriented DBMS (spatial, scientific, engineering, etc.) 1990s 2000s: Data mining and data warehousing, multimedia databases, and Web databases Slide 3 Evolution of data analysis 1960s: batch reports Difficult to find and analyze data Expensive, every request needs a new report (today a lot of systems offers only this kind of analysis) 1970s: First procedures to help decision process Usually very poor and do not integrated with office automation tools 1980s: Office automation tools Query tools, spreadsheets, GUIs Access to operational data (usually very complex) 1990s: Data warehousing and data mining Slide 4 Copyright 2008 by Pagina 2

Data Warehousing and Data Mining What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation OLAP analysis From data warehousing to data mining Principles of data mining Slide 5 What is Data Warehouse? Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. Slide 6 Copyright 2008 by Pagina 3

What is Data Warehouse? A data warehouse is a subject-oriented, integrated, time-variant, and non volatile collection of data in support of management s decision-making process. - W. H. Inmon (1985) A single, complete and consistent data warehouse, obtained by different sources, available to final users to be immediately utilized IBM System Journal (1990) Data warehousing: The process of constructing and using data warehouses Slide 7 Data Warehouse - Subject-Oriented Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Slide 8 Copyright 2008 by Pagina 4

Data Warehouse - Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. Slide 9 Data Warehouse - Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element. Slide 10 Copyright 2008 by Pagina 5

Data Warehouse - Non-Volatile A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data access of data. Slide 11 Data Warehouse Data analysis system characteristics: FASMI OLAP Report 1995 Fast Analytical Shared Multidimensional Informational Slide 12 Copyright 2008 by Pagina 6

Why do we need all that? Operational databases are for On Line Transaction Processing (OLTP) automate day-to-day operations (purchasing, banking etc) transactions access (and modify!) a few records at a time database design is application (process) oriented metric: transactions/sec Slide 13 Why do we need all that? Data Warehouse is for On Line Analytical Processing (OLAP) complex queries that access millions of records need historical data for trend analysis long scans would interfere with normal operations synchronizing data-intensive queries among physically separated databases would be a nightmare! metric: query response time Slide 14 Copyright 2008 by Pagina 7

Examples of OLAP Comparisons (this period v.s. last period) Show me the sales per region for this year and compare it to that of the previous year to identify discrepancies Multidimensional ratios (percent to total) Show me the contribution to weekly profit made by all items sold in the northeast stores between may 1 and may 7 Slide 15 Examples of OLAP Ranking and statistical profiles (top N/bottom N) Show me sales, profit and average call volume per day for my 10 most profitable salespeople Custom consolidation (market segments, ad hoc groups) Show me an abbreviated income statement by quarter for the last four quarters for my northeast region operations Slide 16 Copyright 2008 by Pagina 8

Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set Complex information filtering, compete for resources Data warehouse: update-driven, high performance Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis Slide 17 Data Warehouse vs. Operational DBMS OLTP (on-line transaction processing) Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Data analysis and decision making Distinct features (OLTP vs. OLAP): System orientation: process vs. business subject Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. Multidimensional + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries Slide 18 Copyright 2008 by Pagina 9

OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated usage repetitive ad-hoc historical, summarized, multidimensional integrated, consolidated access read/write lots of scans index/hash on prim. key unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response Slide 19 Why Separate Data Warehouse? High performance for both systems DBMS - tuned for OLTP: access methods, indexing, concurrency control, recovery Warehouse - tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. Different functions and different data: missing data: Decision Support requires historical data which operational DBs do not typically maintain data consolidation: Decision Support requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled Slide 20 Copyright 2008 by Pagina 10

Data Warehousing and Data Mining What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation OLAP analysis From data warehousing to data mining Principles of data mining Slide 21 Multidimensional model A data warehouse is based on a multidimensional data model which views data in the form of a data cube (hypercube) An hypercube is a multidimensional array which represents particular event We define fact a point of this multidimensional array obtained crossing exiting co-ordinates Dimension: fact co-ordinate Measure: numerical value characterizing the event Slide 22 Copyright 2008 by Pagina 11

Multidimensional model - example A data cube, such as sales, allows numerical data (measures) to be modeled and viewed in multiple dimensions Measures such as transaction value (dollars_sold), quantity (item_quantity) Dimension, such as item (item_name, brand, type), or time (day, week, month, quarter, year), or customer (customer_name, city, region, state) Slide 23 Measures Every fact can contain more than one measure A measure may be Saved on the Data Warehouse (effective) Run-time evaluated from effective measures Implicit (presence or absence of a fact) Slide 24 Copyright 2008 by Pagina 12

Fact aggregation It is possible to aggregate elementary facts to obtain synthetic facts The measures of the synthetic facts can be obtained with aggregation operators Sum, mean, max, min, For each couple measure-dimension it is possible to define different aggregation-operators Slide 25 Fact aggregation The measures can be Addictive: can be aggregate by sum on every dimension (for instance total income) Semi-addictive: can be aggregate by sum on some dimension but not on other (for instance quantity can be summed on item but not on store (where are present different items)) Not-addictive: they never can be summed, you must use other operators (mean, median, max, min) (for instance unitary price) Slide 26 Copyright 2008 by Pagina 13

Dimension hierarchy Hierarchy Set of dimensional attributes hierarchically linked to one dimension Dimensional attributes Are used to aggregate elementary facts Are univocally determined by a dimension Represent a classification of the dimension Slide 27 Example of dimension hierarchy all all region Europe... North_America country Germany... Spain Canada... Mexico city Frankfurt... Vancouver... Toronto office L. Chan... M. Wind Slide 28 Copyright 2008 by Pagina 14

View of Warehouses and Hierarchies Slide 29 Multidimensional Data Sales volume as a function of Product, Location, and Time Location Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product Item City Month Week Office Day Time Slide 30 Copyright 2008 by Pagina 15

Data Warehousing and Data Mining What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation OLAP analysis From data warehousing to data mining Principles of data mining Slide 31 OLAP Server Architectures Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services Greater scalability Slide 32 Copyright 2008 by Pagina 16

OLAP Server Architectures Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array Slide 33 Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures on ROLAP Systems Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Slide 34 Copyright 2008 by Pagina 17

Components of Star Schema Fact tables contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance Dimension tables contain descriptions about the subjects of the business Excellent for ad-hoc queries, but bad for online transaction processing Slide 35 Star Schema example Fact table provides statistics for sales broken down by product, period and store dimensions Slide 36 Copyright 2008 by Pagina 18

Star Schema with sample data Slide 37 Another example of Star Schema time time_key day day_of_the_week month quarter year branch branch_key branch_name branch_type Measures Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales item item_key item_name brand type supplier_type location location_key street city province_or_street country Slide 38 Copyright 2008 by Pagina 19

Example of Snowflake Schema time time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key item item_key item_name brand type supplier_key supplier supplier_key supplier_type branch branch_key branch_name branch_type Measures branch_key location_key units_sold dollars_sold avg_sales location location_key street city_key city city_key city province_or_street country Slide 39 Example of Fact Constellation time time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key item item_key item_name brand type supplier_type Shipping Fact Table time_key item_key shipper_key from_location branch branch_key branch_name branch_type Measures location_key units_sold dollars_sold avg_sales location location_key street city province_or_street country to_location dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type Slide 40 Copyright 2008 by Pagina 20

Main Data Warehouse Architectures Architectures Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational Data Store - Three-Level Architecture All involve some form of extraction, transformation and loading (ETL) Slide 41 Generic Two Level Data Warehousing Architecture T L One, companywide warehouse E Periodic extraction data is not completely current in warehouse Slide 42 Copyright 2008 by Pagina 21

Indipendent data mart Data Warehousing Architecture Data marts: Mini-warehouses, limited in scope L E T Separate ETL for each independent data mart Data access complexity due to multiple data marts Slide 43 Dependent data mart with operational datastore at three level architecture L E T Single ETL for enterprise data warehouse (EDW) Simpler data access Dependent data marts loaded from EDW Slide 44 Copyright 2008 by Pagina 22

General Architecture other sources Operational DBs Metadata Extract Transform Load Refresh Monitor & Integrator Data Warehouse OLAP Server Server Analysis Query Reports Data mining Data Marts Data Sources Data Storage OLAP Engine Front-End Slide 45 General Architecture Enterprise warehouse collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart Slide 46 Copyright 2008 by Pagina 23

ETL function Data extraction: get data from multiple, heterogeneous, and external sources Data cleaning: detect errors in the data and rectify them when possible Data transformation: convert data from legacy or host format to warehouse format Load: sort, summarize, consolidate, compute views, check integrity, and build indices and partitions Refresh: propagate the updates from the data sources to the warehouse Slide 47 Data Warehousing and Data Mining What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation OLAP analysis From data warehousing to data mining Principles of data mining Slide 48 Copyright 2008 by Pagina 24

Design of a Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view allows selection of the relevant information necessary for the data warehouse Data source view exposes the information being captured, stored, and managed by operational systems Data warehouse view consists of fact tables and dimension tables Business query view sees the perspectives of data in the warehouse from the view of end-user Slide 49 Data Warehouse Design Process Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next (top-down) Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around (bottom-up) Slide 50 Copyright 2008 by Pagina 25

Data Warehouse Design Process Typical data warehouse design process with bottom up process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record Design the architecture of the DW Design the ETL Install and test Advantages Results in short time Not too expensive Give to the management a clear perspective of the OLAP world Slide 51 Data Warehousing and Data Mining What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation OLAP analysis From data warehousing to data mining Principles of data mining Slide 52 Copyright 2008 by Pagina 26

Exploration of Data Cubes OLAP Interactive navigation through data Two models Hypothesis-driven: exploration by user driven by hypothesis formulated by the user Discovery-driven: pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation. Then users utilize Hypothesis driven exploration Slide 53 A Sample Data Cube TV PC VCR sum Product Date 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TV in U.S.A. U.S.A Canada Mexico Country sum Slide 54 Copyright 2008 by Pagina 27

All Typical OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slide 55 Roll-up/Drill-down Date Roll-up Roll-up Date Country Roll-up All Drill-Down All All All All Drill-Down Country Country Product Drill-Down Slide 56 Copyright 2008 by Pagina 28

OLAP Operations drill-down Slide 57 OLAP Operations drill-down Slide 58 Copyright 2008 by Pagina 29

OLAP Operations drill-down Slide 59 OLAP Operations roll-up Slide 60 Copyright 2008 by Pagina 30

OLAP Operations roll-up Slide 61 OLAP Operations roll-up Slide 62 Copyright 2008 by Pagina 31

OLAP Operations Slice and Dice: select and project on one or more dimensions country date product customer = Smith Slide 63 Slice Date ( 2 quarters) Slice Date (4 quarters) Country Product Country Product Slide 64 Copyright 2008 by Pagina 32

OLAP Operations slice-and-dice Slide 65 OLAP Operations slice-and-dice Slide 66 Copyright 2008 by Pagina 33

OLAP Operations slice-and-dice Slide 67 OLAP Operations Pivot (rotate): reorient the cube visualization, 3D to series of 2D planes. Slide 68 Copyright 2008 by Pagina 34

OLAP Operations Time Store Pivot Product Time Product Pivot Store Pivot Store Product Time Pivot Slide 69 OLAP Operations pivoting Slide 70 Copyright 2008 by Pagina 35

OLAP Operations pivoting Slide 71 OLAP Operations pivoting Slide 72 Copyright 2008 by Pagina 36

OLAP Operations Drill across: involving (across) more than one fact table Slide 73 OLAP Operations drill-across Slide 74 Copyright 2008 by Pagina 37

OLAP Operations drill-across Slide 75 Exploration of Data Cubes Hypothesis-driven exploration by user, huge search space Discovery-driven Pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation Exception: significantly different from the value anticipated, based on a statistical model Visual cues such as background color are used to reflect the degree of exception of each cell Computation of exception indicator can be overlapped with cube construction Slide 76 Copyright 2008 by Pagina 38

Examples: Discovery-Driven Data Cubes Slide 77 Data Warehousing and Data Mining What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation OLAP analysis From data warehousing to data mining Principles of data mining Slide 78 Copyright 2008 by Pagina 39

Data Warehouse Usage Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks Slide 79 From On-Line Analytical Processing to On Line Analytical Mining (OLAM) Why online analytical mining? High quality of data in data warehouses DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions Slide 80 Copyright 2008 by Pagina 40

Data Warehousing and Data Mining What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation OLAP analysis From data warehousing to data mining Principles of data mining Slide 81 What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names: Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Slide 82 Copyright 2008 by Pagina 41

What Is Data Mining? Other Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Slide 83 Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce Purchases at department stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services (e.g. in Customer Relationship Management) Slide 84 Copyright 2008 by Pagina 42

Mining Large Data Sets Motivation There is often information hidden in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all Slide 85 Why Data Mining? Potential Applications Database analysis and decision support Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, email, documents) and Web analysis. Slide 86 Copyright 2008 by Pagina 43

Market Analysis and Management Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc. Slide 87 Market Analysis and Management Determine customer purchasing patterns over time Changing of customer habits with age Cross-market analysis Associations/co-relations between product sales Prediction based on the association information Customer profiling Indentifying what types of customers buy what products (clustering or classification) Identifying customer requirements identifying the best products for different customers using prediction to find what factors will attract new customers Slide 88 Copyright 2008 by Pagina 44

Corporate Analysis and Risk Management Finance planning and asset evaluation cash flow analysis and prediction cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning summarize and compare the resources and spending Competition monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market Slide 89 Fraud Detection and Management Applications widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. approach: use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions Slide 90 Copyright 2008 by Pagina 45

Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. Slide 91 Principal Data Mining Tasks. Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] Slide 92 Copyright 2008 by Pagina 46

10 10 Classification: Definition Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. Metodology: a test set is used to determine the accuracy of the model. Usually, the given a collection of known data set is randomly divided into training and test sets, with training set used to build the model and test set used to validate it. Slide 93 Classification Example categorical Tid Refund Marital Status categorical Taxable Income continuous Cheat class Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes No Single 75K? Yes Married 50K? No Married 150K? Yes Divorced 90K? No Single 40K? No Married 80K? Training Set Learn Classifier Test Set Model Slide 94 Copyright 2008 by Pagina 47

Classification: Application Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don t buy} decision forms the class attribute. Collect various demographic, lifestyle, and companyinteraction related information about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model. Slide 95 Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures Euclidean Distance if attributes are continuous. Other Problem-specific Measures Slide 96 Copyright 2008 by Pagina 48

Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances are are minimized Intercluster distances are are maximized Slide 97 Clustering: Application Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. Slide 98 Copyright 2008 by Pagina 49

Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> -->{Coke} {Diaper, Milk} Milk} --> -->{Beer} Slide 99 Association Rule Discovery: Application 1 Marketing and Sales Promotion: Let the rule discovered be {Bagels, } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! Slide 100 Copyright 2008 by Pagina 50

Association Rule Discovery: Application 2 Supermarket shelf management. Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. A classic rule -- If a customer buys diaper and milk, then he is very likely to buy beer. So, don t be surprised if you find six-packs stacked next to diapers! Slide 101 Regression To identify unknown values in a continuous domain Build tendency functions with interpolation of known points (regression) Different models Linear regression (two variables) Y = q + m X Multi-linear regression (more variables) Y = q + m 1 X 1 + m 2 X 2 + m 3 X 3 Non-linear regression (polynomial, exponential, logarithmic...) Y = q + m 1 X+ m 2 X 2 + m 3 X 3 Slide 102 Copyright 2008 by Pagina 51

Regression Example Slide 103 Deviation Detection The search of Outlier Outlier: exception, element out of range The search is based on the same principles of clustering Concentrates the efforts in finding elements far from the other Search method Statistical Can be used if a statistical distribution is evaluable Distance based Search for elements with maximize the distance from the other elements of the set Deviation based Search for elements with maximize the deviance from the other elements of the set. Example: fraud detection Slide 104 Copyright 2008 by Pagina 52

Challenges of Data Warehousing and Mining Scalability Dimensionality Complex and Heterogeneous Data Data Ownership and Distribution Privacy Preservation Streaming Data Data Quality Slide 105 Data Quality A process quality measures its adherence to users targets In the following tables you can find some aspects of quality (Wang-Wand (1999): quality dimensions) Slide 106 Copyright 2008 by Pagina 53

Data Quality Slide 107 Main Competitors in DW Systems Microsoft Corporation Hyperion Solutions Corporation Cognos Business Objects MicroStrategy SAP AG Cartesis SA Applix Infor Oracle Corporation Others Total Vendor Global Revenue 2006 (Millions USD) 1,801 1,077 735 416 416 330 210 205 199 159 152 5,700 Slide 108 Copyright 2008 by Pagina 54

Bibliography Data warehousing Berson A. and Smith S.J., Data warehousing, data mining and OLAP, McGraw-Hill, 1997 Berthold M., Hand D.J., Intelligent data analysis: an introduction, Springer-Verlag, 1999 Inmon W.H., Building the data warehouse, John Wiley & Sons, 1996 Inmon W.H., Zachman J.A., Geiger G., Data stores, data warehousing and Zachman framework; managing enterprise knowledge, McGraw-Hill, 1997 Kimball R., Ross M., The Data Warehouse Toolkit. Practical techniques for building dimensional Data Warehouses, 2nd ed. John Wiley, 2002 Thomsen E., OLAP solutions: building multidimensional information systems, John Wiley & Sons, 1997 Slide 109 Bibliography Data mining Bramer M., Principles of Data Mining, Springer, 2007 Han J., Kamber M., Data Mining Concepts and techniques, Academic Press, 2001 Parr Rud O., Data mining cookbook Modeling data for marketing, risk and CRM, John Wiley & Sons, 2000 Pyle D., Data preparation for data mining, Morgan Kaufmann, 1999 Weiss S.M., Indurkhya N., Predictive Data Mining, Morgan Kaufmann, 1998 Witten I.H., Frank E., Data mining, Practical Machine Learning Tools and Techniques, 2nd Edition, Elsivier, 2005 Slide 110 Copyright 2008 by Pagina 55