Enterprise Solutions Data Warehouse & Business Intelligence Chapter-8
Learning Objectives Concepts of Data Warehouse Business Intelligence, Analytics & Big Data Tools for DWH & BI
Concepts of Data Warehouse A Data Warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process, that is, they are built to analyse a particular subject area or domain or a subject area within a domain - for example, "sales" within retail banking would be a particular subject area Data Warehouses integrates data from multiple data sources or collection of data from various sources of the organization Obviously, the data in a DWH is historical and hence time variant By the very fact that it is historical, the data is also permanent and non-volatile and should never be deleted or modified
Data Warehouse Architecture Enterprise DWH are extremely complex entities and designs and there is no generic architecture However all DWHs have the following components or layers as in the representation below-
Data Warehouse Architecture Data Source Layer obviously, represents the different data sources that feed data into the DWH. These data sources may be of any types & format Data Extraction Layer is the layer where data gets pulled from the data source into the DWH with some data cleansing though without any major Data Transformation Staging Area - is where the data sits prior to being cleansed and transformed into a DWH or Data Mart quality ETL Layer - is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature Data Storage Layer is the final storage for the cleansed and transformed data Data Logic Layer - is where business rules are stored Data Presentation Layer - Reporting Tools are used here and this layer gives shape to the format in which information is presented to the user Metadata Layer Metadata is Data about Information stored in the DWH system System Operations Layer - includes information on how the DWH system operates, such as - ETL job status, access history, system performance, etc
Components of a Data Warehouse Data Integrity - Data integrity refers to the validity of data - data is consistent and correct. Conceptual Data Model - identifies the highest-level relationships between the different entities. Logical Data Model - A logical data model describes the data in as much detail as possible, without regard to how they will be physical implemented in the database. Physical data Model - Physical data model represents how the model will be built in the database. Data Mart - Data marts are small slices of the data warehouse - they have a more limited audience and/or data content Fact Table A table that stores quantitative information for analysis Dimensional Model - Dimension Data Modelling is used for creating Summary Information for example, summarization of Sales by Day, Week, Month & Year or by Regions Aggregation is a summary of detail data that is stored with or referred to by a cube ETL - Extraction, Transformation, and Loading is an ETL process to extract data from different types of systems, transform it into a structure that's more appropriate for reporting and analysis and finally loads it into the database and or cube(s) OLAP - On-Line Analytical Processing - OLAP provides end users a quick way of slicing and dicing the data
Data Cleansing Data cleansing is a key process in DWH Data Cleansing is the process of altering data in a given storage to ensure that it is accurate and correct - deleting old, incomplete or duplicated data There are many ways to pursue data cleansing in various software and data storage architectures - mostly focused on careful review of data sets and the protocols associated with any particular data storage technology Data Cleansing issues are similar to problems that which archivists, database admin staff and others face What is dirty data? Are data anomalies that create wrong outputs Dirty data is created when reality is different from what is captured and stored Typical steps in Data Cleansing- Data Analysis Defining Transformation Workflows and Mapping Rules Depends on the number of data sources, their degree of heterogeneity and the dirtyness of the data Verification of the correctness and effectiveness of a transformation Transformation - Execution of the transformation steps either by running the ETL Backflow of cleaned data - After (single-source) errors are removed, the cleaned data should also replace the dirty data in the original sources
Business Intelligence & Big Data BI is a set of sophisticated Analytical Techniques & Tools used in identifying, extracting and analysing 'hard' business data - such as sales revenue by products or departments or associated costs and incomes Objectives of a BI action include - understanding of a firm's internal and external strengths and weaknesses, understanding of the relationships between different data for better decision making, detection of opportunities for innovation and cost reduction and optimal deployment of resources BI is accomplished through the use of special softwares and helps companies organize and analyse data to make better decisions the data may be internal or from external data sources BI therefore is not one piece of software - generally include data mining tools, operational dashboards, reporting tools, search and query tools, analytics processing softwares, content viewer ISVs in the BI space include- SAP, Oracle, IBM, Microsoft, Information Builders, MicroStrategy and SAS. Some of the smaller niche players are Actuate Corporation, Alteryx, Logi Analytics, QlikTech and Tableau
Business Intelligence & Big Data Big Data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured Big Data is defined in terms of Volume, Velocity, Variety, Complexity and Variability Why Big Data? The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyse it to find answers that resolve key issues such as product development strategies & cost rationalization It is intended that by combining big data and highpowered analytics, it may be possible to - recalculate entire risk portfolios in minutes, or identify root causes of failures
DEH & BI Products DWH & BI Products are categorized as- Data Modelling Data Mining OLAP Tools ETL Tools BI Tools Reporting Tools The major companies in the DWH/BI space are IBM, SAS, Oracle, TIBCO, Microstrategy, SAP, InformationBuilder, etc There are also smaller and niche product companies Some of the popular tools in the market are Erwin, Rational & Power Designer, Oracle Designer for Data Modelling IBM Cognos, IBM SPSS, SAS Enterprise Miner, TIBCO, etc for Data Mining BO, Cognos, Microstrategy, Hyperion - OLAP Informatica, Cognos, BO, Websphere ETL BI Cognos, Netweaver, BO, Siebel It is to be remembered that DWH & BI space is replete with tools from a slew of companies and most large organizations use multiple (seemingly ) redundant tool-set in their operations