Data warehouses. Data Mining. Abraham Otero. Data Mining. Agenda



Similar documents
Data Warehousing Systems: Foundations and Architectures

Dimensional Modeling for Data Warehouse

A Design and implementation of a data warehouse for research administration universities

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Subject Description Form

Data Warehouse: Introduction

Presented by: Jose Chinchilla, MCITP

CHAPTER 3. Data Warehouses and OLAP

MIS636 AWS Data Warehousing and Business Intelligence Course Syllabus

A Critical Review of Data Warehouse

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Microsoft Data Warehouse in Depth

Data Warehousing and Data Mining

Data Integration and ETL Process

The Design and the Implementation of an HEALTH CARE STATISTICS DATA WAREHOUSE Dr. Sreèko Natek, assistant professor, Nova Vizija,

Datawarehousing and Analytics. Data-Warehouse-, Data-Mining- und OLAP-Technologien. Advanced Information Management

A Brief Tutorial on Database Queries, Data Mining, and OLAP

The Role of Data Warehousing Concept for Improved Organizations Performance and Decision Making

SAS BI Course Content; Introduction to DWH / BI Concepts

A Survey on Data Warehouse Architecture

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

DATA WAREHOUSING AND OLAP TECHNOLOGY

Part 22. Data Warehousing

Course Design Document. IS417: Data Warehousing and Business Analytics

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

SQL Server 2012 Business Intelligence Boot Camp

What is Management Reporting from a Data Warehouse and What Does It Have to Do with Institutional Research?

14. Data Warehousing & Data Mining

Data Warehousing and OLAP Technology for Knowledge Discovery

Indexing Techniques for Data Warehouses Queries. Abstract

IMPROVING THE QUALITY OF THE DECISION MAKING BY USING BUSINESS INTELLIGENCE SOLUTIONS

Data warehousing and data mining an overview

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Data Warehouse Overview. Srini Rengarajan

Methodology Framework for Analysis and Design of Business Intelligence Systems

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

Dimensional Modeling and E-R Modeling In. Joseph M. Firestone, Ph.D. White Paper No. Eight. June 22, 1998

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Near Real-time Data Warehousing with Multi-stage Trickle & Flip

Data W a Ware r house house and and OLAP Week 5 1

Building a Data Warehouse

Implementing a Data Warehouse with Microsoft SQL Server

Implementing a Data Warehouse with Microsoft SQL Server 2012

Data Warehousing: A Technology Review and Update Vernon Hoffner, Ph.D., CCP EntreSoft Resouces, Inc.

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Data Mining Solutions for the Business Environment

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

CASE PROJECTS IN DATA WAREHOUSING AND DATA MINING

Implementing a Data Warehouse with Microsoft SQL Server

Chapter 5. Learning Objectives. DW Development and ETL

Sterling Business Intelligence

Data Warehousing and Data Mining in Business Applications

CHAPTER 4: BUSINESS ANALYTICS

Data warehouse Architectures and processes

Trends in Data Warehouse Data Modeling: Data Vault and Anchor Modeling

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Week 13: Data Warehousing. Warehousing

Implementing a Data Warehouse with Microsoft SQL Server

CHAPTER 5: BUSINESS ANALYTICS

MDM and Data Warehousing Complement Each Other

LEARNING SOLUTIONS website milner.com/learning phone

Implementing a Data Warehouse with Microsoft SQL Server 2012 (70-463)

OLAP, Knowledge Discovery from Database, Social Security Fund, Oracle Warehouse Builder, Oracle Discoverer.

A Review of Data Warehousing and Business Intelligence in different perspective

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Deductive Data Warehouses and Aggregate (Derived) Tables

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

A Data Warehouse Design for A Typical University Information System

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

Jagir Singh, Greeshma, P Singh University of Northern Virginia. Abstract

Data Warehousing and OLAP

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing Data Models and Reports with Microsoft SQL Server 20466C; 5 Days

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Data Integration and ETL Process

SENG 520, Experience with a high-level programming language. (304) , Jeff.Edgell@comcast.net

II. OLAP(ONLINE ANALYTICAL PROCESSING)

DIMENSION HIERARCHIES UPDATES IN DATA WAREHOUSES A User-driven Approach

Reducing ETL Load Times by a New Data Integration Approach for Real-time Business Intelligence

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

BUILDING OLAP TOOLS OVER LARGE DATABASES

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

Transcription:

Data warehouses 1/36 Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 2/36 1

Why do I need a data warehouse? Why do I need a data warehouse? Maybe you do not need it If the volume of data is small and the data is static, a file can be enough. If we are going to work on multiple data sources, new data arrives continuously and/or the volume of data is very high, in the long term a data warehouse will save us time. OK, but wouldnd t it be enough with the database of the company? Usually not. Operational requirements differ greatly from the analytical ones. 3/36 Why do I need a data warehouse? Case Study An international company wants to identify which products are selling best and worst in each country where it operates in order to refine their marketing campaigns within each country. Do they have all the information they need in their databases? Database EMPLOYEE OFFICE COUNTRY SALE DEPARTMENT STRORE PRODUCT 4/36 2

Why do I need a data warehouse? No Census Database Geographical data Climate EMPLOYEE OFFICE COUNTRY SALE DEPARTMENT STRORE PRODUCT Required Information 5/36 Why do I need a data warehouse? On the other hand, OLTP and OLAP systems have completely different purposes which translates into different requirements and therefore a different design. OLTP, On-Line Transactional Processing Must meet the operational requirements of the company. It supports the operation of the organization applications. OLAP, On-Line Analytical Processing Supports analytical processes that try to help in decision making processes. Typically, companies do not invest in them until they have all their operational requirements satisfied. 6/36 3

Why do I need a data warehouse? OLTP Systems Support operational requirements Current data Dynamic data Response time is small It serves many users Large Size Contain data of the organization SQL Read and write operations Transactional operations Data warehouses Support analytical requirements Historical data Static data (only increases) Response time is large (killer queries) Serve few users Larger size Contain data relating to the organization and other sources SQL and custom tools Read operations Non-transactional operations 7/36 Why do I need a data warehouse? On average, the construction and initial load of a data warehouse are 50% of the work of the data mining process. Do not underestimate the time needed for this task. This task is very important; if the data quality is low, no matter how good the data mining technique is, it will fail. 8/36 4

Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 9/36 Data are often organized into "facts", or instances". The client X on February 20, 2008 bought the products P1, P2, and P3 at the store T" "On May 25 the temperature was 78, there was 75% humidity, and wind, and the game was not played" "The length of the sepal is 5.1 cm, with a width of 3.5 cm; the petal length is 1.4 cm, with a width and 0.2 cm" 10/36 5

The wheather problem 11/36 The most famous data mining test set: 12/36 6

The iris data: 13/36 It is often necessary to transform the data to organize it in this way. How could we learn the relationship "sister" from this data? 14/36 7

One possible representation would be to list all possible pairs indicating whether or not they fulfill the relationship: 15/36 Under the " closed world assumption " the table can be compressed: The closed world assumption considers that all the cases not listed are negative. However, from this table we cannot learn anything that allows us to predict whether or not two people are sisters. We lack kinship information. 16/36 8

The following table contains all the information we need, expressed as facts: The knowledge we need 17/36 The examples we have seen so far are easy. For them, a plain text file would be enough (we do not need a data warehouse). For more complex examples, the data warehouse is advisable. How do we organize the information for the data warehouse? Also as facts or instances, but with more complex attributes that define various dimensions of the fact. The dimensions have an internal hierarchical structure that defines different levels of aggregation. 18/36 9

Hierarchy of different levels of aggregation: 19/36 Star schema "It is aggregated on" STORE City Address Info regarding the area. CITY State Country # of inhabitants Climate Location dimension COUNTRY Country # of inhabitants Climate WORLD REGION # of inhabitants Climate SALE QUARTER Year MONTH Quarter Time dimension Amount # of items Client Item Store Time Item dimension WHOLESALE Country City Valuation YEAR DAY Months Week HOUR Date morning/afternoon Holiday/Work day ITEM Wholesale Price Range 20/36 10

Snowflake schema QUARTER Year "It is aggregated on" MONTH Quarter Time dimension STORE City Address Info regarding the area. SALE Amount # of items Client Item Store Time CITY State Country # of inhabitants Climate Location dimension STATE Country # of inhabitants Climate COUNTRY Country # of inhabitants Climate Item dimension WORLD REGION # of inhabitants Climate WHOLESALE Country City Valuation YEAR WEEK DAY Months Week HOUR Date morning/afternoon Holiday/Work day ITEM Wholesale Price Range RANGE Category Year 21/36 Is it possible to collect all the information into a single star or snowflake? No, more than one are usually needed. Each of the schemes is often called DataMart. Usually we shall have one for every different aspect of the organization that we want to explore. Time Item Sales Supplier Product Location Time Location CAMPAIGN Time Team Staff Project 22/36 11

Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 23/36 ETL systems The ETL systems (extraction, transformation and load) have to be built by the data warehouse team. Its implementation is highly dependent on the application. 24/36 12

ETL systems There are certain patterns in the pre-processing of data before data mining: Integration and cleansing of data. Transformation of attributes. Numerization and discretization. This will be discussed in a separate section because they should be used whether we are using a data warehouse or not. 25/36 Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 26/36 13

Real-Time Data Warehousing Real-time (active) data warehousing: the process of loading and providing data via a data warehouse as they become available Levels of data warehouses: 1. Reports what happened 2. Some analysis occurs 3. Provides prediction capabilities 4. Operationalization 5. Becomes capable of making events happen 27/36 Real-Time Data Warehousing Source: Teradata corporation 28/36 14

Real-Time Data Warehousing The need for real-time data A business often cannot afford to wait a whole day for its operational data to load into the data warehouse for analysis Provides incremental real-time data showing every state change and almost analogous patterns over time Maintaining metadata in sync is possible Reduce or eliminate the nightly batch processes 29/36 Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 30/36 15

Open problems S Rizzi, Open problems in data warehousing: 8 years later. 5th International Workshop on Design and Management of Data Warehouses. 2003 (Keynote). S. Rizzi, A. Abelló, J. Lechtenbörger, J. Trujillo. Research in data warehouse modeling and design: dead or alive?. 9th ACM international workshop on Data warehousing and OLAP, pp 3-10. 2006. Widom, J. Research problems in data warehousing. International Conference on Information and Knowledge Management (CIKM95), ACM Press. 1995. Dinter, B., Sapia, C. Hölfing, G., Blaschka, M. OLAP market and research: initiating the cooperation. Journal of Computer Science and Information Management, 2(3), 1999. 31/36 Open problems Data warehousing conferences: International Workshop on Data Warehousing and OLAP. (DOLAP) International Conference on Data Warehousing and Knowledege Discovery. (DaWaK) International Workshop on Data Warehouse and. (DWDM) 32/36 16

Open problems Journals: International Journal of Data Warehousing and Mining. Data and Knowledge Engineering. Information Sciences. 33/36 Open problems Hot topics How to integrate data arising from multiple sources. Queries: language optimization, processing. Consistency and quality Data Warehouse design: conceptual models, design methodologies. ETL loading and recovery of failures during loading. Planning loads and refreshments. Maintenance of Data Warehouse. Data cleaning and preprocessing OLAP division of tasks between the client and the server. 34/36 17

Bibliography William H. Inmon. Building the Data Warehouse. John Wiley and Sons, 2005. ISBN 0764599445, 9780764599446. A. Sen, AP. Sinha. A comparison of data warehousing methodologies. Communications of the ACM archive, Volume 48, Issue 3, Pages: 79-84. 2005. J. Van den Berg. Integral Warehouse Management: The Next Generation in Transparency, Collaboration and Warehouse Management Systems. Management Outlook, 2007. ISBN: 1419668765. Jiawei Han y Micheline Kamber (2005)., Second Edition, Second Edition : Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems. 35/36 Bibliography Inmon, W.H. et al. "Managing the Data Warehouse", John Wiley, 1997 Inmon, W.H. et al. "Data Warehouse Performance", John Wiley, 1999 Kimball, R. "The Data Warehouse Toolkit", John Wiley, 1996 Kimball, R et al. "The Data Warehouse Lifecycle Toolkit", John Wiley, 1998 Giovinazzo, W. "Object-Oriented Data Warehouse Design", Prentice-Hall, 2000. Jarke, M. et al. "Fundamentals of Data Warehouses", Springer, 2000. 36/36 18