Data Warehousing and Data Mining



Similar documents
Data Warehousing and Decision Support. Torben Bach Pedersen Department of Computer Science Aalborg University

Data Warehousing Systems: Foundations and Architectures

Advanced Data Management Technologies

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

IST722 Data Warehousing

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Part 22. Data Warehousing

MDM and Data Warehousing Complement Each Other

14. Data Warehousing & Data Mining

Turkish Journal of Engineering, Science and Technology

Designing a Dimensional Model

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Lection 3-4 WAREHOUSING

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Microsoft Data Warehouse in Depth

DATA WAREHOUSING AND OLAP TECHNOLOGY

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Fluency With Information Technology CSE100/IMT100

Data Warehousing and Data Mining in Business Applications

Data Warehousing: A Technology Review and Update Vernon Hoffner, Ph.D., CCP EntreSoft Resouces, Inc.

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

MIS636 AWS Data Warehousing and Business Intelligence Course Syllabus

Week 13: Data Warehousing. Warehousing

B.Sc (Computer Science) Database Management Systems UNIT-V

Extraction Transformation Loading ETL Get data out of sources and load into the DW

Week 3 lecture slides

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Data Warehouse: Introduction

Data Warehouse Overview. Srini Rengarajan

SQL SERVER BUSINESS INTELLIGENCE (BI) - INTRODUCTION

Understanding Data Warehousing. [by Alex Kriegel]

Course Design Document. IS417: Data Warehousing and Business Analytics

CHAPTER 5: BUSINESS ANALYTICS

Microsoft Services Exceed your business with Microsoft SharePoint Server 2010

What is Management Reporting from a Data Warehouse and What Does It Have to Do with Institutional Research?

Business Intelligence, Data Warehousing and Multidimensional Databases. Torben Bach Pedersen

OLAP Theory-English version

The Microsoft Business Intelligence 2010 Stack Course 50511A; 5 Days, Instructor-led

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

A Design and implementation of a data warehouse for research administration universities

A Critical Review of Data Warehouse

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

CHAPTER 4: BUSINESS ANALYTICS

OLAP and Data Warehousing! Introduction!

Building Cubes and Analyzing Data using Oracle OLAP 11g

When to consider OLAP?

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Business Intelligence: Effective Decision Making

Presented by: Jose Chinchilla, MCITP

Data Warehousing and OLAP Technology for Knowledge Discovery

Dimensional Modeling for Data Warehouse

LEARNING SOLUTIONS website milner.com/learning phone

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Data Warehousing, OLAP, and Data Mining

Microsoft Business Intelligence

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

Unlock your data for fast insights: dimensionless modeling with in-memory column store. By Vadim Orlov

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data Warehousing and Data Mining Introduction

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

The Role of Data Warehousing Concept for Improved Organizations Performance and Decision Making

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

SAS BI Course Content; Introduction to DWH / BI Concepts

Data Integration and ETL Process

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Budgeting and Planning with Microsoft Excel and Oracle OLAP

Dr. Osama E.Sheta Department of Mathematics (Computer Science) Faculty of Science, Zagazig University Zagazig, Elsharkia, Egypt

CHAPTER 3. Data Warehouses and OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Turning your Warehouse Data into Business Intelligence: Reporting Trends and Visibility Michael Armanious; Vice President Sales and Marketing Datex,

IT0457 Data Warehousing. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

<Insert Picture Here> Oracle Retail Data Model Overview

The Data Warehouse ETL Toolkit

An Approach for Facilating Knowledge Data Warehouse

Data warehouse and Business Intelligence Collateral

Armanino McKenna LLP Welcomes You To Today s Webinar:

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

SQL Server 2012 End-to-End Business Intelligence Workshop

Outlines. Business Intelligence. What Is Business Intelligence? Data mining life cycle

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

Transcription:

Data Warehousing and Data Mining Part I: Data Warehousing Gao Cong gaocong@cs.aau.dk Slides adapted from Man Lung Yiu and Torben Bach Pedersen

Course Structure Business intelligence: Extract knowledge from large amounts of data collected in a modern enterprise Data warehousing Data mining Purpose Acquire theoretical background in lectures and literature studies Obtain practical experience on (industrial) tools in a mini-project Data warehousing: construction of a database with only data analysis purpose Business Intelligence (BI) Data Mining: find patterns automatically in databases 2

business analysis and data mining Amazon: BA: In the past five years, which product is the most profitable? DM: If a user buy A, what else would the user buy? 3

Contact Information Data Warehousing Teacher: Gao Cong Email: gaocong@cs.aau.dk Data Mining Teacher: Thomas D. Nielsen and Manfred Jaeger Email: tdn@cs.aau.dk, jaeger@cs.aau.dk Course homepage: http://www.cs.aau.dk/~tdn/itev/ Lecture slides, mini-project, Seminar dates Nov 1st, 2008 Data Warehousing Nov 22th, 2008 Data Warehousing, Data Mining December 6th, 2008 Data Mining 4

Literature for Data Warehousing No textbook Books (selected pages available in the class) The Data Warehouse Lifecycle Toolkit, Kimball et. al., Wiley 2008 Fundamentals of Data Warehousing, Jarke et. al., Springer Verlag 2003 Additional references/articles: To be given 5

Today s Agenda 09.00 10.00 10.00 11.00 11.00 12.00 12.00 12.45 12.45 13.45 13.45 15.00 15.00 15.45 15.45 16.00 Introduction (Mini-Project, and Exam) Multidimensional Model Multidimensional Model Lunch hour Multidimensional Model + Demo ETL ETL +Demo+Project Summary of the day 6

Overview Data Warehouse (DW) introduction DW Topics Multidimensional modeling Extract, Transformation and Load (ETL) Data Warehousing architecture Performance optimization 7

What is Business Intelligence (BI)? BI systems help the users make the right decisions, based on available data Combination of technologies Data Warehousing (DW) On-Line Analytical Processing (OLAP) Data Mining (DM) Customer Relationship Management (CRM) The Web makes BI more necessary Customers do not appear physically in the store Customers can change to other stores more easily 8

Definition of a DW R. Kimball s definition of a DW: A data warehouse is a system that extracts, cleans, conforms and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making Data Warehouse is the foundation for business intelligence, value add analystics Inmon s Definition of a Data Warehouse Subject oriented (versus function oriented) Integrated (logically and physically) Time variant (data can always be related to time) Stable (data not deleted) Supporting management decisions (different organization) A good DW is a prerequisite for successful BI Ad-hoc analysis and reports We will cover this soon Data mining: discovery of hidden patterns and trends 9

Subject-Oriented Data Collections Classical operation systems are organized around the applications of the company. Each type of company has its own unique set of subjects For an insurance company, For a manufacturer, the major subject areas might be product, order, vendor, bill of material, and raw goods. For a retailer, the major subject areas may be product, SKU, sale, vendor, and so forth. 10

Integrated Data Collections Integration is the important. Data is fed from multiple disparate sources into the data warehouse. As the data is fed it is converted, reformatted,. The result is that data once it resides in the data warehouse has a single physical corporate image. 11

Stable: Non-volatile Data Collections Data is updated in the operational environment as a regular matter of course Data warehouse data is loaded (usually en masse) and accessed, but it is often updated when data in the data warehouse is loaded, it is loaded in a snapshot, static format. When subsequent changes occur, a new snapshot record is written. 12

Time-variant Data Collections Time variancy implies that every unit of data in the data warehouse is accurate as of some one moment in time. a record is time stamped. a record has a date of transaction. there is some form of time marking to show the moment in time during which the record is accurate. the data warehouse contains much more history than any other environment. A 60-to-90-day time a 5-to-10-year time 13

DW Architecture Data as Materialized Views Existing databases and systems (OLTP) Appl. DB New databases and systems (OLAP) DM OLAP Appl. DB DM Data mining Appl. DB Trans. DW Appl. Appl. DB DB (Global) Data Warehouse (Local) Data Marts DM Visualization Analogy: (data) suppliers warehouse (data) consumers 14

Business Dimensional Lifecycle Technical Architecture Design Product Selection& Installation Project Planning Business Requirements Definition Dimensional Modeling Physical Design Data Staging Design & Development Deployment Maintenance & Growth End-User Application Specification End-User Application Development Project Management 15

Queries Hard/Infeasible for OLTP Business analysis Which public holiday we have the largest sales? Which week we have the largest sales? Does the sales of dairy products increase over time? Difficult to represent these queries by using SQL second query: extract the week value using a function But the user has to learn many transformation functions third query: use a special table to store IDs of all dairy products, in advance We have many other product types as well The need of multidimensional modeling 16

Multidimensional Modeling Example: sales of supermarkets Facts and measures Each sales record is a fact, and its sales value is a measure Dimensions Each sales record is associated with its values of Product, Store, Time Correlated attributes grouped into the same dimension easier for analysis tasks Product Type Category Store City County Day Month Year Sales Top Beer Beverage Trøjborg Århus DK 25 Maj 1997 5.75 Product Store Time 17

Multidimensional Modeling How do we model the Time dimension? A tree structure, with multiple levels Attributes, e.g., holiday, event Week T tid day week month year work day Year 1 1 1 1 2008 No Day Month 2 2 1 1 2008 Yes Advantage of this model? Easy for query (more about this later) Disadvantage? Data redundancy (controlled redundancy is acceptable) 18

OLTP vs. OLAP OLTP OLAP Target operational needs business analysis Data small, operational data large, historical data Model normalized denormalized/ multidimensional Query language SQL not unified Queries small large Updates frequent and small infrequent and batch Transactional recovery necessary not necessary Optimized for update operations query operations 19

Quick Review: Normalized Database Customer ID Product Category Price Date 3301 Beer Beverage 6.00 02-02-2008 3301 Rice Cereal 4.00 02-02-2008 3302 Beer Beverage 6.00 05-02-2008 3303 Wheat Cereal 5.00 07-02-2008 Customer ID ProductID Date 3301 013 02-02-2008 3301 052 02-02-2008 3302 013 05-02-2008 3303 067 07-02-2008 ProductID Product Category Price 013 Beer Beverage 6.00 052 Rice Cereal 4.00 067 Wheat Cereal 5.00 Normalized database avoids Redundant data Modification anomalies How to get the original table? (join them) No redundancy in OLTP, controlled redundancy in OLAP 20

OLAP Data Cube Data cube Useful data analysis tool in DW Generalized GROUP BY queries Aggregate facts based on chosen dimensions Product, store, time dimensions Sales measure of sale facts Why data cube? Good for visualization (i.e., text results hard to understand) Multidimensional, intuitive Support interactive OLAP operations How is it different from a spreadsheet? 21

On-Line Analytical Processing (OLAP) 102 On-Line Analytical Processing Interactive analysis Explorative discovery Fast response times required OLAP operations/queries Aggregation, e.g., SUM Starting level, (Year, City) Roll Up: Less detail Drill Down: More detail Slice/Dice: Selection, Year=2000 250 All Time 9 10 11 15 22

Extract, Transform, Load (ETL) Getting multidimensional data into the DW Problems ETL Data from different sources Data with different formats Handling of missing data and erroneous data Extract Transformations / cleaning Load The most time-consuming process in DW development 80% of development time spent on ETL 24

Data s Way To DW Extraction Extract from many heterogeneous systems Staging area Large, sequential bulk operations flat files best? Cleaning Data checked for missing parts and erroneous values Default values provided and out-of-range values marked Transformation Data transformed to decision-oriented format Data from several sources merged, optimize for querying Aggregation? Are individual business transactions needed in the DW? Loading into DW Large bulk loads rather than SQL INSERTs (Why?) Fast indexing (and pre-aggregation) required 25

Performance Optimization Performance optimization Fine tune performance for important queries Aggregates, indexing, other optimizations (environment, partitioning) Using aggregates How can aggregates improve performance? Choosing aggregates Which aggregates should we materialize? 26

Materialization Example Imagine 1 billion sales rows, 1000 products, 100 locations CREATE VIEW TotalSales (pid,locid,total) AS SELECT s.pid,s.locid,sum(s.sales) FROM Sales s GROUP BY s.pid,s.locid The materialized view has 100,000 rows Rewrite the query to use the view SELECT p.category,sum(s.sales) FROM Products p, Sales s WHERE p.pid=s.pid GROUP BY p.category can be rewritten to SELECT p.category,sum(t.total) FROM Products p, TotalSales t WHERE p.pid=t.pid GROUP BY p.category Query becomes 10,000 times faster! Sales tid pid locid sales 1 1 1 10 2 1 1 20 3 2 3 40 1 billion rows VIEW TotalSales pid locid sales 1 1 30 2 3 40 100,000 rows 27

Common DW Issues Metadata management Need to understand data = metadata needed Greater need in OLAP than in OLTP as raw data is used Need to know about: Data definitions, dataflow, transformations, versions, usage, security DW project management DW projects are large and different from ordinary SW projects 12-36 months and US$ 1+ million per project Data marts are smaller and safer (bottom up approach) Reasons for failure Lack of proper design methodologies High HW+SW cost Deployment problems (lack of training) Organizational change is hard (new processes, data ownership,..) Ethical issues (security, privacy, ) 28

Summary Data Warehouse (DW) introduction DW Topics Multidimensional modeling ETL Performance optimization BI provide many advantages to your organization A good DW is a prerequisite for BI Reading materials: Jarke chapter 1. Page 1-10 Optional: Kimball chapter 1. 29

DW Software DW part of the mini-project DW software Obtain from MSDNAA, and install MS SQL Server 2005 Standard Edition/ 2008 Enterprise MS Analysis Services, Integration Services, Reporting Services Checking after installation Open Component Services and check whether all four services above have been started Open SQL Server Management Studio and see whether you can connect to Database Engine Read the mini-project webpage for installation details 30

Study plan Reading materials and project each week Week Nov 1-7 (Data Warehousing) Week Nov 8-14 (Data Warehousing) Week Nov 15-21 (Data Warehousing) Week Nov 22-28 (Data Warehousing + Data Mining) Week Nov 29-Dec 5 (Data Mining) Week Dec 6-12 (Data Mining) Week Dec 13-19 (Data Mining) http://www.cs.aau.dk/~tdn/itev/index.php?id=46 We also remind you what to do and read every week by email. 31

Exam Exam Individual oral exam, for 30 minutes 15 minutes of DW questions 15 minutes of ML questions Danish 7-point grading scale Submitting project report is the prerequisite to take the exam Prepare Exam Mini-project report as the basis for discussion Exam also covers theoretical background in lectures and literature Question will be related to mini-projects 32

Outline of DW Mini-Project Selecting business process and data sources You are strongly encouraged to find problem and data from your work Dimensional data modeling Both real multidimensional and star/snowflake Consider advanced aspects such as SCDs Implementation of multidimensional DB Star/snowflake schema in SQL Server + cubes ETL: building an ETL flow to populate the DW The hardest part Building reports in Reporting Services Doing performance optimization Details given each week in study plan 33

Mini-Project (Data Warehousing Part) Topic and Data Business process and data from your work Fklub Mini-project Performed in groups of ~3 persons Documented in report of 10-20 pages Firm Deadline: January 9, 2009, at 12.00 (Soft Deadline: Dec 14) The mini project should be sent by email to Lene Even, even@cs.aau.dk 34

Project Supervision Lunch time and break in seminars Emails Each week, reminding what to read/do Phones One hour group meeting using skype each week can be arranged if requested For all students who want to be in (not just for a single group) 35

Demonstration on SQL server To be put on the Web 36

Today s Agenda 09.00 10.00 10.00 11.00 11.00 12.00 12.00 12.45 12.45 13.45 13.45 15.00 15.00 15.45 15.45 16.00 Introduction (Mini-Project, and Exam) Multidimensional Model Multidimensional Model Lunch hour Multidimensional Model + Demo ETL ETL +Demo+Project Summary of the day 37

Sharing your experience Why do you want to take the Data Warehousing course? Have you ever worked on something related to a Data Warehouse project? 38