Data warehouses 1/36 Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 2/36 1
Why do I need a data warehouse? Why do I need a data warehouse? Maybe you do not need it If the volume of data is small and the data is static, a file can be enough. If we are going to work on multiple data sources, new data arrives continuously and/or the volume of data is very high, in the long term a data warehouse will save us time. OK, but wouldnd t it be enough with the database of the company? Usually not. Operational requirements differ greatly from the analytical ones. 3/36 Why do I need a data warehouse? Case Study An international company wants to identify which products are selling best and worst in each country where it operates in order to refine their marketing campaigns within each country. Do they have all the information they need in their databases? Database EMPLOYEE OFFICE COUNTRY SALE DEPARTMENT STRORE PRODUCT 4/36 2
Why do I need a data warehouse? No Census Database Geographical data Climate EMPLOYEE OFFICE COUNTRY SALE DEPARTMENT STRORE PRODUCT Required Information 5/36 Why do I need a data warehouse? On the other hand, OLTP and OLAP systems have completely different purposes which translates into different requirements and therefore a different design. OLTP, On-Line Transactional Processing Must meet the operational requirements of the company. It supports the operation of the organization applications. OLAP, On-Line Analytical Processing Supports analytical processes that try to help in decision making processes. Typically, companies do not invest in them until they have all their operational requirements satisfied. 6/36 3
Why do I need a data warehouse? OLTP Systems Support operational requirements Current data Dynamic data Response time is small It serves many users Large Size Contain data of the organization SQL Read and write operations Transactional operations Data warehouses Support analytical requirements Historical data Static data (only increases) Response time is large (killer queries) Serve few users Larger size Contain data relating to the organization and other sources SQL and custom tools Read operations Non-transactional operations 7/36 Why do I need a data warehouse? On average, the construction and initial load of a data warehouse are 50% of the work of the data mining process. Do not underestimate the time needed for this task. This task is very important; if the data quality is low, no matter how good the data mining technique is, it will fail. 8/36 4
Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 9/36 Data are often organized into "facts", or instances". The client X on February 20, 2008 bought the products P1, P2, and P3 at the store T" "On May 25 the temperature was 78, there was 75% humidity, and wind, and the game was not played" "The length of the sepal is 5.1 cm, with a width of 3.5 cm; the petal length is 1.4 cm, with a width and 0.2 cm" 10/36 5
The wheather problem 11/36 The most famous data mining test set: 12/36 6
The iris data: 13/36 It is often necessary to transform the data to organize it in this way. How could we learn the relationship "sister" from this data? 14/36 7
One possible representation would be to list all possible pairs indicating whether or not they fulfill the relationship: 15/36 Under the " closed world assumption " the table can be compressed: The closed world assumption considers that all the cases not listed are negative. However, from this table we cannot learn anything that allows us to predict whether or not two people are sisters. We lack kinship information. 16/36 8
The following table contains all the information we need, expressed as facts: The knowledge we need 17/36 The examples we have seen so far are easy. For them, a plain text file would be enough (we do not need a data warehouse). For more complex examples, the data warehouse is advisable. How do we organize the information for the data warehouse? Also as facts or instances, but with more complex attributes that define various dimensions of the fact. The dimensions have an internal hierarchical structure that defines different levels of aggregation. 18/36 9
Hierarchy of different levels of aggregation: 19/36 Star schema "It is aggregated on" STORE City Address Info regarding the area. CITY State Country # of inhabitants Climate Location dimension COUNTRY Country # of inhabitants Climate WORLD REGION # of inhabitants Climate SALE QUARTER Year MONTH Quarter Time dimension Amount # of items Client Item Store Time Item dimension WHOLESALE Country City Valuation YEAR DAY Months Week HOUR Date morning/afternoon Holiday/Work day ITEM Wholesale Price Range 20/36 10
Snowflake schema QUARTER Year "It is aggregated on" MONTH Quarter Time dimension STORE City Address Info regarding the area. SALE Amount # of items Client Item Store Time CITY State Country # of inhabitants Climate Location dimension STATE Country # of inhabitants Climate COUNTRY Country # of inhabitants Climate Item dimension WORLD REGION # of inhabitants Climate WHOLESALE Country City Valuation YEAR WEEK DAY Months Week HOUR Date morning/afternoon Holiday/Work day ITEM Wholesale Price Range RANGE Category Year 21/36 Is it possible to collect all the information into a single star or snowflake? No, more than one are usually needed. Each of the schemes is often called DataMart. Usually we shall have one for every different aspect of the organization that we want to explore. Time Item Sales Supplier Product Location Time Location CAMPAIGN Time Team Staff Project 22/36 11
Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 23/36 ETL systems The ETL systems (extraction, transformation and load) have to be built by the data warehouse team. Its implementation is highly dependent on the application. 24/36 12
ETL systems There are certain patterns in the pre-processing of data before data mining: Integration and cleansing of data. Transformation of attributes. Numerization and discretization. This will be discussed in a separate section because they should be used whether we are using a data warehouse or not. 25/36 Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 26/36 13
Real-Time Data Warehousing Real-time (active) data warehousing: the process of loading and providing data via a data warehouse as they become available Levels of data warehouses: 1. Reports what happened 2. Some analysis occurs 3. Provides prediction capabilities 4. Operationalization 5. Becomes capable of making events happen 27/36 Real-Time Data Warehousing Source: Teradata corporation 28/36 14
Real-Time Data Warehousing The need for real-time data A business often cannot afford to wait a whole day for its operational data to load into the data warehouse for analysis Provides incremental real-time data showing every state change and almost analogous patterns over time Maintaining metadata in sync is possible Reduce or eliminate the nightly batch processes 29/36 Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 30/36 15
Open problems S Rizzi, Open problems in data warehousing: 8 years later. 5th International Workshop on Design and Management of Data Warehouses. 2003 (Keynote). S. Rizzi, A. Abelló, J. Lechtenbörger, J. Trujillo. Research in data warehouse modeling and design: dead or alive?. 9th ACM international workshop on Data warehousing and OLAP, pp 3-10. 2006. Widom, J. Research problems in data warehousing. International Conference on Information and Knowledge Management (CIKM95), ACM Press. 1995. Dinter, B., Sapia, C. Hölfing, G., Blaschka, M. OLAP market and research: initiating the cooperation. Journal of Computer Science and Information Management, 2(3), 1999. 31/36 Open problems Data warehousing conferences: International Workshop on Data Warehousing and OLAP. (DOLAP) International Conference on Data Warehousing and Knowledege Discovery. (DaWaK) International Workshop on Data Warehouse and. (DWDM) 32/36 16
Open problems Journals: International Journal of Data Warehousing and Mining. Data and Knowledge Engineering. Information Sciences. 33/36 Open problems Hot topics How to integrate data arising from multiple sources. Queries: language optimization, processing. Consistency and quality Data Warehouse design: conceptual models, design methodologies. ETL loading and recovery of failures during loading. Planning loads and refreshments. Maintenance of Data Warehouse. Data cleaning and preprocessing OLAP division of tasks between the client and the server. 34/36 17
Bibliography William H. Inmon. Building the Data Warehouse. John Wiley and Sons, 2005. ISBN 0764599445, 9780764599446. A. Sen, AP. Sinha. A comparison of data warehousing methodologies. Communications of the ACM archive, Volume 48, Issue 3, Pages: 79-84. 2005. J. Van den Berg. Integral Warehouse Management: The Next Generation in Transparency, Collaboration and Warehouse Management Systems. Management Outlook, 2007. ISBN: 1419668765. Jiawei Han y Micheline Kamber (2005)., Second Edition, Second Edition : Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems. 35/36 Bibliography Inmon, W.H. et al. "Managing the Data Warehouse", John Wiley, 1997 Inmon, W.H. et al. "Data Warehouse Performance", John Wiley, 1999 Kimball, R. "The Data Warehouse Toolkit", John Wiley, 1996 Kimball, R et al. "The Data Warehouse Lifecycle Toolkit", John Wiley, 1998 Giovinazzo, W. "Object-Oriented Data Warehouse Design", Prentice-Hall, 2000. Jarke, M. et al. "Fundamentals of Data Warehouses", Springer, 2000. 36/36 18