Data Warehousing and Data Mining Introduction



Similar documents
Advanced Data Management Technologies

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

Data Warehousing and Data Mining

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

Lection 3-4 WAREHOUSING

MDM and Data Warehousing Complement Each Other

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

Data Warehouse: Introduction

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Enterprise Solutions. Data Warehouse & Business Intelligence Chapter-8

Data Warehousing and Data Mining in Business Applications

Introduction to Datawarehousing

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

A Survey on Data Warehouse Architecture

Data Warehousing Systems: Foundations and Architectures

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

[callout: no organization can afford to deny itself the power of business intelligence ]

IST722 Data Warehousing

Data Warehousing and Decision Support. Torben Bach Pedersen Department of Computer Science Aalborg University

B.Sc (Computer Science) Database Management Systems UNIT-V

Business Intelligence: Effective Decision Making

Data Mart/Warehouse: Progress and Vision

CONCEPTUALIZING BUSINESS INTELLIGENCE ARCHITECTURE MOHAMMAD SHARIAT, Florida A&M University ROSCOE HIGHTOWER, JR., Florida A&M University

DATA WAREHOUSING AND OLAP TECHNOLOGY

Jagir Singh, Greeshma, P Singh University of Northern Virginia. Abstract

Turkish Journal of Engineering, Science and Technology

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

Data Mining for Successful Healthcare Organizations

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Master Data Management and Data Warehousing. Zahra Mansoori

Part 22. Data Warehousing

Data warehouse Architectures and processes

Whitepaper. Data Warehouse/BI Testing Offering YOUR SUCCESS IS OUR FOCUS. Published on: January 2009 Author: BIBA PRACTICE

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

Data Warehousing and OLAP

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Data Warehouse Design

Understanding Data Warehousing. [by Alex Kriegel]

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

8. Business Intelligence Reference Architectures and Patterns

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Outline Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications

Foundations of Business Intelligence: Databases and Information Management

OLAP Theory-English version

Fluency With Information Technology CSE100/IMT100

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

Week 13: Data Warehousing. Warehousing

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

Data Warehousing and OLAP Technology for Knowledge Discovery

Chapter 3 - Data Replication and Materialized Integration

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2

Data Warehousing: A Technology Review and Update Vernon Hoffner, Ph.D., CCP EntreSoft Resouces, Inc.

14. Data Warehousing & Data Mining

Indexing Techniques for Data Warehouses Queries. Abstract

Hybrid Support Systems: a Business Intelligence Approach

Week 3 lecture slides

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data Warehousing & OLAP

Application Of Business Intelligence In Agriculture 2020 System to Improve Efficiency And Support Decision Making in Investments.

DSS based on Data Warehouse

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Dr. Osama E.Sheta Department of Mathematics (Computer Science) Faculty of Science, Zagazig University Zagazig, Elsharkia, Egypt

Dimensional Modeling for Data Warehouse

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Data warehouse and Business Intelligence Collateral

Introduction to Data Warehousing. Ms Swapnil Shrivastava

ETL and Advanced Modeling

Meta-data and Data Mart solutions for better understanding for data and information in E-government Monitoring

DATA MINING AND WAREHOUSING CONCEPTS

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

SAS BI Course Content; Introduction to DWH / BI Concepts

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Business Intelligence. 1. Introduction September, 2013.

Decision Support and Business Intelligence Systems. Chapter 1: Decision Support Systems and Business Intelligence

CONCEPTUAL FRAMEWORK OF BUSINESS INTELLIGENCE ANALYSIS IN ACADEMIC ENVIRONMENT USING BIRT

A Critical Review of Data Warehouse

A Design and implementation of a data warehouse for research administration universities

INTRODUCTION TO BUSINESS INTELLIGENCE What to consider implementing a Data Warehouse and Business Intelligence

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

Introduction to Business Intelligence

Applied Business Intelligence. Iakovos Motakis, Ph.D. Director, DW & Decision Support Systems Intrasoft SA

A Knowledge Management Framework Using Business Intelligence Solutions

Welcome to online seminar on. Oracle Agile PLM BI. Presented by: Rapidflow Apps Inc. January, 2011

Data Warehouse Overview. Srini Rengarajan

Transcription:

Data Warehousing and Data Mining Introduction General introduction to DWDM Business intelligence OLTP vs. OLAP Data integration Methodological framework DW definition Acknowledgements: I am indebted to Michael Böhlen and Stefano Rizzi for providing me their slides, upon which these lecture notes are based. J. Gamper, Free University of Bolzano, DWDM 2012/13

The Big Picture of DWDM/1 What's important for researchers: Algorithms and Theory Systems J. Gamper, Free University of Bolzano, DWDM 2012/13 2

The Big Picture of DWDM/2 What's important for real world applications: Systems (and System Integration) Database Customer Algorithms J. Gamper, Free University of Bolzano, DWDM 2012/13 3

The Big Picture of DWDM/3 What's important for businesses: Algorithms $$$ Database Customer Systems J. Gamper, Free University of Bolzano, DWDM 2012/13 4

Computer Science and Decision Making An exponential increase in operational data has made computers the only tools suitable for providing data for decision-making performed by business managers. The massive use of techniques for analyzing enterprise data made information systems a key factor to achieve business goals. J. Gamper, Free University of Bolzano, DWDM 2012/13 5

Remarks about the DW part We learn how to design, build, and use a data warehouse. Relevance to the real world is an important guideline. Not only/mainly crisp algorithms, theorems, etc. We will look at a number of concrete and important case studies. A good way to prepare and learn the subject is to participate to lectures. J. Gamper, Free University of Bolzano, DWDM 2012/13 6

Content of the DW Part 1) Data warehousing: business intelligence, data integration, data warehouse, facts, dimensions, DW design 2) SQL OLAP extensions: analytical functions, crosstab, group by extensions, hierarchical cube, moving windows 3) Generalized multi-dimensional join: GMDJ, evaluation, subqueries, optimization rules, distributed evaluation 4) DW performance: pre-aggregation, lattice framework, view selection, view maintenance, bitmap indexing 5) ETL and advanced modeling: ETL process, handling changes in dimensions J. Gamper, Free University of Bolzano, DWDM 2012/13 7

What is Business Intelligence?/1 BI is a set of processes, tools, and technologies to transform business data into timely and accurate information to support decisional processes Data Warehousing (DW) On-Line Analytical Processing (OLAP) Data Mining (DM) and Data Visualization (VIS) Decision Analysis (what-if) Customer Relationship Management (CRM) BI systems are used by decision makers to get a comprehensive knowledge of the business and to define and support their business strategies. The goal is to enable data-based decisions aimed at gaining competitive advantage, improving operative performance, responding more quickly to changes, increasing profitability, and, in general, creating added value for the company. J. Gamper, Free University of Bolzano, DWDM 2012/13 8

What is Business Intelligence?/2 BI is the opposite of Artificial Intelligence (AI) AI systems make decisions for the users BI systems help users make the right decisions, based on the available data Many BI techniques have roots in AI, though. J. Gamper, Free University of Bolzano, DWDM 2012/13 9

The BI Pyramid decisions WHAT-IF ANALYSIS simulation models DATA MINING learning models knowledge INFORMATION EXPLORATION statistical techniques OLAP ANALYSIS data warehouse information OPERATIONAL APPLICATIONS data sources data J. Gamper, Free University of Bolzano, DWDM 2012/13 10

Example BI Queries Q1: On October 11, 2000, find the 5 top-selling products for each product subcategory that contributes more than 20% of the sales within its product category. Q2: As of March 15, 1995, determine shipping priority and potential gross revenue of the orders that have the 10 largest gross revenues among the orders that had not yet been shipped. Consider orders from the book market segment only. Regular database models and systems are not suitable for this type of queries. J. Gamper, Free University of Bolzano, DWDM 2012/13 11

BI is Crucial and Growing/1 Meta Group: DW alone = $15 Bio. in 2000 Palo Alto Management Group: BI = $113 Bio. in 2002 The Web made BI more necessary: Thus: Customers do not appear physically in the store Customers can change to other stores more easily You have to know your customers using data and BI. Web logs make is possible to analyze customer behavior in more detailed than before (what was not bought?) Combine web data with traditional customer data Wireless Internet adds further to this: Customers are always online Customer s position is known Combine position and knowledge about customer => very valuable J. Gamper, Free University of Bolzano, DWDM 2012/13 12

BI is Crucial and Growing/2 Gartner, 2009: Organizations will expect IT leaders in charge of BI and performance management initiatives to help transform and significantly improve their business Because of lack of information, processes, and tools, through 2012, more than 35% of the top 5,000 global companies will regularly fail to make insightful decisions about significant changes in their business and markets. By 2010, 20% of organizations will have an industryspecific analytic application delivered via software as a standard service of their business intelligence portfolio. In 2009, collaborative decision making will emerge as a new product category that combines social software with business intelligence platform capabilities. S. Chaudhuri, U. Dayal, V. Narasayya, CACM 2011: Today, it is difficult to find a successful enterprise that has not leveraged BI technology for their business. J. Gamper, Free University of Bolzano, DWDM 2012/13 13

BI: Key Problems 1) Complex and unusable models Many models are difficult to understand models do not focus on a single clear business purpose 2) Same data found in many different systems Example: customer data in many different systems The same concept is defined differently 3) Data is suited for operational systems Accounting, billing, etc. Do not support analysis across business functions 4) Data quality is bad Missing data, imprecise data, different use of systems 5) Data are volatile Data deleted in operational systems (6 months) Data change over time no historical information J. Gamper, Free University of Bolzano, DWDM 2012/13 14

BI: Solution A new analysis environment with a data warehouse at the core, where data is Integrated (logically and physically) Subject oriented (versus function oriented) Supporting management decisions (different organization) Stable (data is not deleted, several versions) Time variant (data can always be related to time) J. Gamper, Free University of Bolzano, DWDM 2012/13 15

Definition of a Data Warehouse/1 Barry Devlin, IBM Consultant A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context. J. Gamper, Free University of Bolzano, DWDM 2012/13 16

Definition of a Data Warehouse/2 W. H. Inmon, Building the Data Warehouse A data warehouse is a - subject-oriented, - integrated, - time-varying, - non-volatile collection of data that is used primarily in - organizational decision making. J. Gamper, Free University of Bolzano, DWDM 2012/13 17

DW Architecture/1 Basic elements of a Data Warehouse environment J. Gamper, Free University of Bolzano, DWDM 2012/13 18

DW Architecture/2 Source layer EXTRACTION, TRANSFORMATION, AND LOADING: Data marts Operational data External data Loading Reporting tools Operational Data Store ETL tools Data Warehouse OLAP tools Data mining tools What-if analysis tools Data staging Reconciled layer Data warehouse layer Analysis ETL processes extract data from sources, transform and clean them, and finally load them in the ODS and in the data warehouse OPERATIONAL DATA STORE: Operational data obtained after integrating and cleansing source data. As a result, those data are integrated, consistent, appropriate, current, and detailed DATA MART: A subset or an aggregation of the data stored to a primary data warehouse. It includes a set of information pieces relevant to a specific business area, corporate department, or category of users J. Gamper, Free University of Bolzano, DWDM 2012/13 19

Data Integration Problem: Different interfaces Different data representations Duplicated information Inconsistent information Integrated System Goal: Collect and combine information Provide an integrated view Provide a uniform user interface Support sharing of data J. Gamper, Free University of Bolzano, DWDM 2012/13 20

Query-Driven Data Integration Data is integrated on demand (lazy) PROS CONS Access to most up-to-date data (all source data directly available) No duplication of data Delay in query processing Slow (or currently unavailable) information sources Complex filtering and integration Inefficient and expensive for frequent queries Competes with local processing at sources Data loss at the sources (e.g., historical data) cannot be recovered Has not caught on in industry J. Gamper, Free University of Bolzano, DWDM 2012/13 21

Warehouse-Driven Data Integration Data is integrated in advance (eager) Data is stored in DW for querying and analysis PROS CONS High query performance Does not interfere with local processing at sources Assumes that data warehouse update is possible during downtime of local processing Complex queries are run at the data warehouse OLTP queries are run at the source systems Duplication of data The most current source data is not available Has caught on in industry J. Gamper, Free University of Bolzano, DWDM 2012/13 22

OLTP versus OLAP/1 On-Line Transaction Processing (OLTP) Many small queries on a small number of tuples from many tables that need to be joined Frequent updates The system is always available for both updates and reads Smaller data volume (few historical data) Complex data model (normalized) On-Line Analytical Processing (OLAP) Fewer, but bigger queries that typically need to scan a huge amount of records and doing some aggregation Frequent reads, in-frequent updates (daily, weekly) 2-phase operation: either reading or updating Larger data volumes (collection of historical data) Simple data model (multidimensional/de-normalized) J. Gamper, Free University of Bolzano, DWDM 2012/13 23

OLTP versus OLAP/2 A mix of analytical queries (OLAP) with transactional routine queries (OLTP) inevitably slows down the system, and this does not meet the needs of users of both types of queries. Separate OLAP from OLTP by creating a new repository that integrates data from various sources and then makes data available for analysis and evaluation aimed at decision-making processes J. Gamper, Free University of Bolzano, DWDM 2012/13 24

OLTP versus OLAP/3 Existing databases and systems (OLTP) New databases and systems (OLAP) DM OLAP DM Data mining Trans. DW Global Data Warehouse Data Marts DM Visualization J. Gamper, Free University of Bolzano, DWDM 2012/13 25

OLTP versus OLAP/4 Function-oriented Systems (OLTP) Subject-oriented Systems (OLAP) DM DM Trans. All subjects, integrated DW Selected subjects DM J. Gamper, Free University of Bolzano, DWDM 2012/13 26

OLTP Example: CS Dept/1 1, 2, 3, 4 hostgrpnms hostgrpnmid hostgrpnmn hostgrpnm_reason hstgrn_crtdato hstgrn_expdato uids uidid ugid_id_uid idcat_id_uid hostgrp_id_uid uidcrt_data uidexp_dato idcats idcatid idcat ugids ugidid ugid hostgrps hostgrpid fqdn_id_hgr hstgrnm_id_hgr users userid name_id_usr home disklimit userstat_id_usr uid_id_usr hostgrp_id_usr user_crtdato user_expdato userstats userstatid userstat names nameid osname name_crtdato name_expdato fqdns fqdnid fqdn fqdn_crtdato fqdn_expdato pgecos pgecoid person_id_pge user_id_pge personsgroups personsgroupid person_id_prs sgroup_id_prs semester_id_prs pstatus pstatusid prsnstatus persons personid name firstname homeaddress homeemail homephone person_crtdato pstatus_id_ps personwrkgroups personwrkgroupid person_id_prw wrkgroup_id_prw wrkgroups wrkgroupid wrkgroup persinfs persinfid persinf person_id_pinf persinf_crtdato persinf_expdato employees employeeid person_id_emp position_id_emp, initials empl_stdato empl_expdato uguests uguestid person_id_ugu ughost uguest_crtdato uguest_exp J. Gamper, Free University of Bolzano, DWDM 2012/13 27

OLTP Example: CS Dept/2 semesters semesterid semester sgroupsems sgroupsemid sgroup_id_sgs semester_id_sgs sgroups sgroupid sgroup sgroup_crtdato sgroup_expdato supervisor sgrplocs sgrplocid room_id_sgl sgr_id_sgl positioncats positioncatid positioncat emplocations emplocationid employee_id_elo room_id_elo rooms roomid roomname roomalias roomcat_id_rom roomcats roomcatid roomcat positions positionid position eng_position phonelocs phonelocid room_id_phl phonenr_id_phl employees employeeid person_id_emp position_id_emp, initials empl_stdato empl_expdato phones phoneid employee_id_pho phonenr_id_pho phonenrs phonenrid phonenr phonecat_id_phn owner_id_phn statusempls statusemplid statusempl persons personid name firstname homeaddress homeemail homephone person_crtdato pstatus_id_ps phonecats phonecatid phonecat owners ownerid owner J. Gamper, Free University of Bolzano, DWDM 2012/13 28

OLTP Example: OncoNet OncoNet is a system for the management of patients undergoing a cancer therapy > 200 tables Well-suited for daily management of patients But: statistical analysis are expensive takes up to 12 hours tables are locked for that time run queries over weekend A DW approach reduced the runtime of the same queries to a few seconds (BSc-thesis of A. Heinisch) J. Gamper, Free University of Bolzano, DWDM 2012/13 29

Methodological/Design Framework Building a DW is a very complex task, which requires an accurate planning aimed at devising satisfactory answers to organizational and architectural questions. A large number of organizations lack experience and skills that are required to meet the challenges involved in DW projects. The reports of DW project failures state that a major cause lies in the absence of a global view of the design process: in other terms, in the absence of a design methodology. Methodologies are created by closely studying similar experiences and minimizing the risks for failure by basing new approaches on a constructive analysis of the mistakes made previously. J. Gamper, Free University of Bolzano, DWDM 2012/13 30

Many Ways not to Do/1 Trans. Trans. DM DM App App Trans. DM App J. Gamper, Free University of Bolzano, DWDM 2012/13 31

Many Ways not to Do/2 Trans. DM D- DM Trans. DW DM DM J. Gamper, Free University of Bolzano, DWDM 2012/13 32

Top-down Approach Analyze global business needs, plan how to develop a data warehouse, design it, and implement it as a whole This procedure is promising: it is based on a global picture of the goal to achieve, and in principle it ensures consistent, well integrated data warehouses. High-cost estimates with long-term implementations discourage company managers from embarking on these kind of projects. Analyzing and integrating all relevant sources at the same time is a very difficult task, even because it is not very likely that they are all available and stable at the same time. It is extremely difficult to forecast the specific needs of every department involved in a project, which can result in the analysis process coming to a standstill. Since no working system is going to be delivered in the short term, users cannot check for this project to be useful, so they lose trust and interest in it. J. Gamper, Free University of Bolzano, DWDM 2012/13 33

Bottom-up Approach DWs are incrementally built and several data marts are iteratively created. Each data mart is based on a set of facts that are linked to a specific department and that can be interesting for a user group Leads to concrete results in a short time Does not require huge investments Enables designers to investigate one area at a time Gives managers a quick feedback about the actual benefits of the system being built Keeps the interest for the project constantly high May determine a partial vision of the business domain J. Gamper, Free University of Bolzano, DWDM 2012/13 34

Top-down vs. Bottom-up Approach DM DM Trans. DW Top-down: 1. Design of DW 2. Design of DMs In-between : 1. Design of DW for DM1 2. Design of DM2 and integration with DW 3. Design of DM3 and integration with DW 4.... J. Gamper, Free University of Bolzano, DWDM 2012/13 35 DM Bottom-up: 1. Design of DMs 2. Integration of DMs in DW 3. Maybe no physical DW

Top-down vs. Bottom-up Approach DM DM Trans. DW Top-down: 1. Design of DW 2. Design of DMs In-between : 1. Design of DW for DM1 2. Design of DM2 and integration with DW 3. Design of DM3 and integration with DW 4.... J. Gamper, Free University of Bolzano, DWDM 2012/13 36 DM Bottom-up: 1. Design of DMs 2. Integration of DMs in DW 3. Maybe no physical DW

The Life-cycle/1 Goal setting and planning Infrastructure design set system goals, borders, and size select an approach for design and implementation estimate costs and benefits analyze risks and expectations examine the skills of the working team Design and developm. of data marts J. Gamper, Free University of Bolzano, DWDM 2012/13 37

The Life-cycle/2 Goal setting and planning Infrastructure design Design and developm. of data marts analyze and compare the possible architectural solutions assess the available technologies and tools create a preliminary plan of the whole system J. Gamper, Free University of Bolzano, DWDM 2012/13 38

The Life-cycle/3 Goal setting and planning Infrastructure design Design and developm. of data marts Every iteration causes a new data mart and new applications to be created and progressively added to the DW system J. Gamper, Free University of Bolzano, DWDM 2012/13 39

Data mart design phases Source analysis and integration db administrator Requirement analysis business user Conceptual design Workload and data volume Logical design designer ETL design Physical design J. Gamper, Free University of Bolzano, DWDM 2012/13 40

The First Data Mart is the one playing the most strategic role for the enterprise should be a backbone for the whole DW should lean on available and consistent data sources DM5 DM2 DM1 DM3 DM4 Source 3 Source 1 Source 2 J. Gamper, Free University of Bolzano, DWDM 2012/13 41

Summary BI is well-recognized and is a combination of a number of techniques to support decision making. DW is at the core of BI that provides a complete, consistent, subject-oriented and time-varying collection of the data; allows to separte OLTP from OLAP. Applications that use the DW include OLAP, data mining, visualization BI can provide many advantages to an organization Creates added value by transforming data into information Provides comprehensive knowledge about your business A good DW is a prerequisite for BI But, a DW is a means rather than a goal it is only a success if it is heavily used Following a clear design methodology is important. J. Gamper, Free University of Bolzano, DWDM 2012/13 42