Integrate multiple, heterogeneous data sources. Data cleaning and data integration techniques are applied



Similar documents
Overview of Data Warehousing and OLAP

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Data Warehousing & OLAP

1960s 1970s 1980s 1990s. Slow access to

DATA WAREHOUSING AND OLAP TECHNOLOGY

Lection 3-4 WAREHOUSING

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Week 13: Data Warehousing. Warehousing

Data warehousing. Han, J. and M. Kamber. Data Mining: Concepts and Techniques Morgan Kaufmann.

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

Data Warehousing and OLAP Technology for Knowledge Discovery

Data W a Ware r house house and and OLAP Week 5 1

Data W a Ware r house house and and OLAP II Week 6 1

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

When to consider OLAP?

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

DATA WAREHOUSING - OLAP

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

Part 22. Data Warehousing

Data Warehousing & OLAP

Jagir Singh, Greeshma, P Singh University of Northern Virginia. Abstract

資 料 倉 儲 (Data Warehousing)

Week 3 lecture slides

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Data Warehousing and OLAP

Lecture Data Warehouse Systems

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

Lecture 2: Introduction to Business Intelligence. Introduction to Business Intelligence

Chapter 3, Data Warehouse and OLAP Operations

Namrata 1, Dr. Saket Bihari Singh 2 Research scholar (PhD), Professor Computer Science, Magadh University, Gaya, Bihar

Data Warehousing Systems: Foundations and Architectures

Data Warehousing, OLAP, and Data Mining

Fluency With Information Technology CSE100/IMT100

IST722 Data Warehousing

The Role of Data Warehousing Concept for Improved Organizations Performance and Decision Making

Turkish Journal of Engineering, Science and Technology

(Week 10) A04. Information System for CRM. Electronic Commerce Marketing

BUSINESS ANALYTICS AND DATA VISUALIZATION. ITM-761 Business Intelligence ดร. สล ล บ ญพราหมณ

14. Data Warehousing & Data Mining

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Data Warehousing and OLAP Technology

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

A Critical Review of Data Warehouse

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Mining for Knowledge Management. Data Warehouses

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

A Technical Review on On-Line Analytical Processing (OLAP)

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

INTERACTIVE DECISION SUPPORT SYSTEM BASED ON ANALYSIS AND SYNTHESIS OF DATA - DATA WAREHOUSE

CHAPTER 4 Data Warehouse Architecture

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data Mart/Warehouse: Progress and Vision

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Data Warehousing and Data Mining

Master Data Management and Data Warehousing. Zahra Mansoori

B.Sc (Computer Science) Database Management Systems UNIT-V

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Introduction to Datawarehousing

Data Warehouse. MIT-652 Data Mining Applications. Thimaporn Phetkaew. School of Informatics, Walailak University. MIT-652: DM 2: Data Warehouse 1

Chapter 5. Learning Objectives. DW Development and ETL

Advanced Data Management Technologies

Data Warehousing and Online Analytical Processing

Dimensional Modeling for Data Warehouse

Understanding Data Warehousing. [by Alex Kriegel]

HYPERION MASTER DATA MANAGEMENT SOLUTIONS FOR IT

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

CHAPTER 3. Data Warehouses and OLAP

Data warehouses. Data Mining. Abraham Otero. Data Mining. Agenda

MDM and Data Warehousing Complement Each Other

The Oracle Enterprise Data Warehouse (EDW)

Data Warehousing and Data Mining in Business Applications

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

Building Data Cubes and Mining Them. Jelena Jovanovic

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Applied Business Intelligence. Iakovos Motakis, Ph.D. Director, DW & Decision Support Systems Intrasoft SA

Mario Guarracino. Data warehousing

Data Warehouse. The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way:

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Designing a Dimensional Model

Enterprise Solutions. Data Warehouse & Business Intelligence Chapter-8

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Lecture 2 Data warehousing

Business Intelligence, Analytics & Reporting: Glossary of Terms

What is OLAP - On-line analytical processing

Data Warehousing. Yeow Wei Choong Anne Laurent

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

Life Cycle of a Data Warehousing Project in Healthcare

This tutorial will help computer science graduates to understand the basic-toadvanced concepts related to data warehousing.

IT0457 Data Warehousing. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

8. Business Intelligence Reference Architectures and Patterns

OLAP Theory-English version

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Transcription:

Objectives Motivation: Why data warehouse? What is a data warehouse? Whyy separate p DW? Conceptual modeling of DW Data Mart Data Warehousing Architectures Data Warehouse Development Data Warehouse Vendors R l Real-time DW Data Warehousing and OLAP Lecture 2/DMBI/IKI83403T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Faculty of Computer Science, University of Indonesia 2 Motivation: Why data warehouse? Construction of data warehouses (DW) involves data cleaning and data integration Æ important preprocessing step for data mining (DM). DW provide OLAP for the interactive analysis of multidimensional data,, which facilitates effective DM. Data mining functions can be integrated with OLAP operations to enhance interactive mining of knowledge. DW will provide an effective platform for DM. Whil DWs While DW are nott requirements i t to t do d DM, DM DW store t massive amounts of data that can be uses for DM. [DO] 3 What is a data warehouse? [JH] Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization s ODB. Support information processing by providing a solid platform of consolidated, historical data for analysis. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision-making process. W. H. Inmon Case Study 2: Continental Airlines flies high with its real-time data warehouse 4

What is a data warehouse? [ET] Data warehouse A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format. Characteristics Subject oriented, Integrated, Time Variant, Non-volatile Web-based, Relational/multidimensional, Client/server, Real-time Include metadata Data warehousing Process of constructing and using data warehouses. Requires data integration, data cleaning, and data consolidation. Subject Oriented Organized around major subjects, such as customer, product, sales. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. 5 6 Integrated Integrate multiple, heterogeneous data sources Relational databases, flat-files, on-line transaction records Data cleaning and data integration techniques are applied Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element. 7 8

Non-volatile A physically py yseparate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: iiil initial loading of data and access of data. Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB integration: Build wrappers/mediators on top of multiple, heterogeneous databases. Ex: IBM Data Joiner, Informix DataBlade Query driven approach: When a query is posed to a client site, a metadata-dictionary is used to translate the query into queries appropriate for the individual heterogeneous sites involved. There queries are then mapped and sent to local query processors. The results returned from the different sites are integrated into a global l answer set. Complex information filtering and integration processes, compete for resources. Inefficient and potentially expensive for frequent queries, especially for queries requireing aggregations. g 9 10 Data Warehouse vs. Heterogeneous DBMS (2) Using DW update-driven approach Information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and analysis. Unlike OLTP, DW do not contain the most current information. DW brings high performance to the integrated heterogeneous DB system since data are copied, preprocessed, integrated, annotated, summarized, and restructured into one data store. Query processing in DW does not interfere with the processing at local sources DW can store and integrate historical information and support complex multidimensional queries. DW vs. ODB Major task of ODB OLTP: Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. DW serve for data analysis and decision i making OLAP Distinct Features (OLTP vs. OLAP) User and system orientation: i customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries 11 12

OLTP vs OLAP OLTP OLAP users Clerk, IT professional Knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated usage repetitive ad-hoc access read/write lots of scans index/hash on prim. key unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB historical, summarized, multidimensional integrated, consolidated metric transaction throughput query throughput, response 13 Why Separate DW? High performance for both systems: DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery Warehouse tuned for OLAP: complex OLAP queries, computation of large groups of data at summarized levels, multidimensional view, consolidation. Processing OLAP queries in operational databases would degrade the performance of operational tasks. In ODB, concurrency control and recovery mechanisms (locking, logging) are required to ensure the consistency and robustness of transactions. OLAP read only access. No need for concurrency control and recovery. 14 Why Separate DW? (2) Different functions and different data: missing data: Decision support requires historical data which operational DBs do not typically maintain. So, data in ODB is usually far from complete for decision making. data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources. ODB contain detailed d raw data (transactions) ti which h need to be consolidated before analysis. data quality: different sources typically y use inconsistent data representations, codes and formats which have to be reconciled. Conceptual Modeling of DW Data Cube: see TSBD Lecture Notes on Visualization of Data Cubes Modeling data warehouses: dimensions i & measurements Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables, one for each dimension). Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. Fact constellations: Multiple fact tables share dimension tables. Also known as galaxy schema 15 16

Example of Star Schema Snowflake Schema Date Day Month Year Store StoreID City State Country Region 17 Measurements Sales Fact Table Date Product Store Customer unit_sales dollar_sales Yen_sales Product ProductNo ProdName ProdDesc Category QOH Cust CustId CustName CustCity CustCountry Potensi Redundansi Bandung, Bogor keduanya ada di Jawa Barat 18 Year Year Country Country Region Month Month Year State State Country Date Day Month City City State Store StoreID City Measurements Sales Fact Table Date Product Store Customer unit_sales dollar_ sales Yen_sales Product ProductNo ProdName ProdDesc Category QOH Cust CustId CustName CustCity CustCountry View of Warehouses and Hierarchies Data Cube Importing data Table Browsing Dimension creation TV PC VCR sum Date 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TV in U.S.A. USA U.S.A Canada ada Mexico Count ry Dimension browsing Cube building sum Cube browsing 19 20

Data Cube 21 Visualization OLAP capabilities Interactive manipulation Typical OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving i (across) more than one fact table. drill through: through the bottom level to its back-end relational tables. More info: www.knowledgecenters.org, www.olapreport.com, www.olapcouncil.org 22 Data Mart DW collects information about subjects that span the entire organization, such as customers, products, sales, assets, and personnel. Its scope is enterprise-wide. For DW, fact constellation schema is commonly used since it can model multiple, interrelated subjects. Data Mart is a subset of a DW, focuses on a particular subject. Its scope is department-wide. Typically, a data mart consisting of a single subject area (e.g. marketing, operations). For Data Mart, star or snowflake schema are commonly used since both are geared towards modeling single subjects, although h the star schema is more popular. 23 Data Mart A data mart can be either dependent or independent. A dependent data mart is a subset that is created directly from the DW. Consistent data model Providing quality data DW must be constructed first Ensures that the user viewing the same version of the data that are accessed by all other data warehouse users An independent data mart is a small warehouse designed for a department, and its source is not an EDW. 24

Data Warehousing Process Overview Data Warehousing Process Overview The major components of a data warehousing process Data sources Legacy systems, external data providers (e.g. BPS), OLTP, ERP Systems Data extraction Data loading Comprehensive database Metadata Middleware tools 25 26 Data Warehousing Architectures Data Warehousing Architectures 27 28

Data Warehousing Architectures Data Warehousing Architectures 29 30 Data Integration and the ETL Process Various integration technologies: Enterprise Application Integration (EAI) A technology that provides a vehicle for pushing data from source systems into a data warehouse Integrating application functionality and is focused on sharing functionality across systems Traditionally, API. Nowadays, SOA (web services). Enterprise Information Integration (EII) An evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases A mechanism for pulling data from source systems to satisfy a request for information. Data Integration and the ETL Process ETL 60-70% of the time in a data-centric project. Extraction: Reading data from one or more databases Transformation Converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a DW Load Putting the data into the DW 31 32

Data Warehouse Development Direct benefits Allowing end users to perform extensive analysis in numerous ways A consolidated view of corporate data (i.e a single version of the truth) Better and more timely information Enhanced system performance. DW frees production processing because some operational system reporting requirements are moved to DSS Simplification of data access 33 Data Warehouse Development Some best practices for implementing a DW (Weir, 2002): Project must fit with corporate strategy and business objectives There must be complete buy-in to the project by executives, managers, and users It is important to manage user expectations about the completed project The data warehouse must be built incrementally Build in adaptability Managed by both IT and business professionals Develop a business/supplier relationship Only load data that t have been cleansed and are of a quality understood by the organization Do not overlook training requirements Be politically aware 34 Data Warehouse Vendors Computer Associates DataMirror Data Advantage Group Dell Computer Embarcadero Technologies Business Objects HP Hummingbird Hyperion IBM Informatica 35 Microsoft Oracle SAS Siemens Sybase Teradata Please visit: Data Warehousing Institute (tdwi.com) DM Review (dmreview.com) Data Warehouse Vendors Six guidelines to considered when developing a vendor list: 1. Financial strength 2. ERP linkages 3. Qualified consultants 4. Market share 5. Industry experience 6. Established partnerships p 36

Real-time DW Traditionally, updated on a weekly basis. Unsuitable for some businesses. Real-time (active) data warehousing The process of loading and providing data via a data warehouse as they become available Levels of data warehouses: 1. Reports what happened 2. Some analysis occurs 3. Provides prediction capabilities, 4. Operationalization 5. Becomes capable of making events happen Real-time DW 37 38 Real-time DW 39 From DW to DM [JH] Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining ii results using visualization tools. 40

References [JH] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. [ET] Efraim Turban et al., Decision Support and Business Intelligence Systems, Pearson, 2007. [DO] David Olson and Yong Shi, Introduction to Business Data Mining, McGraw-Hill, 2007. 41