Evolving Data Warehouse Architectures In the Age of Big Data Philip Russom April 15, 2014
TDWI would like to thank the following companies for sponsoring the 2014 TDWI Best Practices research report: Evolving Data Warehouse Architectures This presentation is based on the findings of that report. STAY TUNED At the end of this webinar, learn how to download a free copy of the report.
Agenda Definitions of Data Warehouse Architectures Drivers of Change Benefits & Barriers From EDWs to DWEs Role of Hadoop Analytics versus Reporting Trends among Architectural Components and Practices Top Ten Priorities PLEASE TWEET @prussom, #TDWI, #EDW, #DataWarehouse, #DataArchitecture, #Analytics, #Hadoop
Upcoming Points There isn t one, single architecture for all data warehouses (DWs) Each org is different Expect multiple architectures A well-designed DW has multiple architectural layers Architectural approaches get mixed together into hybrids A DW architecture interacts with architectures for data integration, reporting, analytics, operational applications, etc. The warehouse is still vital, even central But it s evolving into a multiple platform environment Architecture is more important than ever, but now as a logical design that s deployed over multiple physical platforms Please don t ask me to draw a Reference Architecture for DWs Given the current diversity, there isn t just one. But I ll describe many.
What do you think data warehouse architecture is? Select all that apply. Source: TDWI survey run in late 2013. Based on 1197 responses from 538 respondents. 2.2 responses per respondent, on average.
Logical versus Physical DW Architectures And Other Architectural Components that Coexist Today s Focus Logical architecture mostly about data models and their relationships, with a focus on how these represent organizational entities and processes Data standards including standards for data modeling, data quality metrics, interfaces for data integration, programming style, format standards, etc. Physical architecture mostly a plan for deploying data and data structures based on the workload and platform requirements of each System architecture a topology of hardware servers and software servers, plus the interfaces and networks that tie them together
Drivers of Change Does your primary enterprise data warehouse have an architectural design? Yes 79% No 18% Don t know 3% Is the architecture of your data warehouse environment evolving? Yes moderately 54% Yes dramatically 22% No except with DW updates 22% Don t know 2% What technical issues or practices are driving change in your DW architecture? Advanced analytics 57% Increasing data volumes 56% Real-time operations 41% Business performance mgt 38% OLAP 30% Non-relational data 25% Virtualization of data 23% Cloud adoption 21% Streaming data 15% What business issues or practices are driving change in your DW architecture? Competitiveness 45% Fast-paced business processes 43% Compliance 29% Funding 29% Sponsorship 26% Reorganizations 25% Centralizing business control 30% Departmental power struggles 19% Mergers and acquisitions 18% Source: TDWI survey run in late 2013. Based on 538 respondents.
Benefits of Multi-Platform Architecture In priority order, based on survey responses All data analytics, in general (61%) Many new platforms are built for analytics: DW appliances, columnar databases, NoSQL databases, Hadoop. With a multi-platform portfolio, users can match an analytic workload to best platform. A diverse platform portfolio can handle a diverse range of data types. This is key to embracing the unstructured and schema-free data types found in most big data. Enables broad data exploration and discovery (43%) A more diverse platform portfolio can aid a business Additional platforms are key to addressing new business requirements (36%), especially data-oriented ones like analytics (61%), more numerous business insights (34%), business optimization (30%) Handling data in real time usually requires an additional purpose-built system. Traditional relational databases and batch-oriented Hadoop systems were not built for real-time operations (33%), though many organizations need faster business processes (26%). Adding low-cost platforms to a DW environ makes big data more affordable. DW appliances, columnar RDBMSs, Hadoop & NoSQL all lower cost for data staging for data warehousing (20%) and data archiving (16%). Source: TDWI survey run in late 2013. Based on 538 respondents.
Barriers to Multi-Platform Architecture In priority order, based on survey responses Inadequate staffing or skills (47%) is the most prominent barrier. Immaturity with new data types and sources (23%) plus new technologies for Hadoop, event processing, and so on make them unprepared for the complexity of multi-platform designs (25%). As usual, organizational and business issues should be settled first. Data ownership and other politics (43%), a lack of business sponsorship (38%), a lack of a compelling business case (25%) A number of data management issues should be addressed. Data integration complexity (36%), poor data quality (34%), lack of data architecture (29%), and data security, privacy, and governance issues (25%) As with any new IT initiative, proper funding is key. Account for the cost of acquiring multiple platforms (25%) and the cost of administering multiple platforms (27%) Source: TDWI survey run in late 2013. Based on 538 respondents.
WHY CAN T A DATA WAREHOUSE DO EVERYTHING? Square Peg Workloads may not fit Round Hole DW Architectures Most data warehouses were designed and optimized for common deliverables and methods: Standard reports, dashboards, performance mgt, online analytic processing (OLAP) This is a design and architectural decision made by users, not a failing of vendor platforms Can/should all DW & analytic workloads run on your EDW? If your EDW can handle multiple mixed concurrent workloads with performance and without impeding other workloads, then run all workloads (including analytics) on the EDW, for simplicity s sake If not, you may need additional data platforms for some workloads
Multi-Platform Data Warehouse Environments Many enterprise data warehouses (EDWs) are evolving into multi-platform data warehouse environments (DWEs). Users continue to add additional standalone data platforms to their warehouse tool and platform portfolio. The new platforms don t replace the core warehouse, because it is still the best platform for the data that goes into standards reports, dashboards, performance management, and OLAP. Instead, the new platforms complement the warehouse, because they are optimized for workloads that manage, process, and analyze new forms of big data, non-structured data, and real-time data.
Ramifications of a Multi-Platform DW Environ Workload-centric DW architecture Assumes that some workloads and their data are best offloaded from the core DW and taken to a platform more suited to them Workloads and data for advanced analytics (not OLAP), SQL-based analytics, unstructured data, massive big data, real time Distributed DW architecture This simply means that data and data structures (as defined in a logical architectural layer) are distributed across multiple physical data platforms Again, the logical layer is the big picture needed with many platforms A distributed DW architecture is both good and bad Good if it serves the unique requirements of multiple workloads and the users that depend on them Bad if platforms proliferate like the dreaded data marts of yore
Growing Complexity in DW System Architectures The technology stack for DW, BI, analytics, and data integration has always been a multi-platform environment. What s new? The trend toward a portfolio of many data platforms has accelerated. Over The Passage of Time Federated Data Federated Marts Data Federated Marts Data Marts Customer Mart Customer or ODS Mart or ODS Real Time ODS DW from a Merger Columnar DBMS Columnar DBMS Map Reduce Complex, Event Processing Data Warehouse Star or Multi- Snowflake dimensional Scheme Data Models Data Staging Data Areas Staging Data Areas Staging Areas Metrics for Performance Mgt OLAP Cubes OLAP DBMSs Detailed Source Detailed Data Source Detailed Data Source Data Analytic Sand Box Data Federation & Virtualization DW Appliance DW Appliances Hadoop Distributed Hadoop File Distributed Sys File Sys No-SQL Database No-SQL Database Streaming Data Tools
EDW Which of the following best describes your extended data warehouse environment today? Pure, central, monolithic EDWs are relatively rare (15%, far left) Likewise, environments without a DW are equally rare (15%, far right) EDWs mix well in hybrid environments (68%, middle three) Central monolithic EDW with no other data platforms Central EDW with many additional data platforms No true EDW, but many workloadspecific data platforms instead 15% 37% 16% 15% 15% DWE Central EDW with a few additional data platforms Many workload-specific data platforms; EDW is present but not the center Other (2%) Source: TDWI survey run in late 2013. Based on 538 respondents.
Which of the following best describes your organization s strategy for evolving your DW environment and its architecture, relative to big data? Most survey respondents plan to extend an existing DW (41%, far left) Few will deploy new data platforms (25%) 29% have no strategy for DW evolution or addressing big data Extend existing core DW to accommodate big data and other new requirements No strategy for DW architecture, though we need one Other (5%) 41% 25% 23% 6% Deploy new data management systems specifically for big data, analytics, real time, etc. No strategy for DW architecture, because we don't need one Source: TDWI survey run in late 2013. Based on 538 respondents.
Hadoop is a Useful Addition to DW Architectures IT COMPLEMENTS AND EXTENDS DATA WAREHOUSES HDFS extends DW Architectures Managing multi-structured data Repository for detailed source data Processing big data for analytics Advanced forms of algorithmic analytics Data staging on steroids ELT push-down processing Inexpensive compared to average DW Hadoop also contributes outside DWs Imagine HDFS as shared infrastructure, similar to SAN & NAS Imagine a huge, live archive Imagine content mgt on steroids
Reporting and Analytics have Different Requirements for Data and DW Architecture Reporting is mostly about entities and facts you know well, represented by highly polished data that you know well. Carefully modeled and cleansed data with rich metadata and master data that s managed in a data warehouse. Most users designed their DWs first and foremost as a repository for reporting and similar practices such as OLAP, performance management, dashboards, and operational BI. Advanced analytics enables the discovery of new facts you didn t know, based on the exploration and analysis of data that s probably new to you. Unlike the pristine data that reports operate on, advanced analytics works best with detailed source data in its original (even messy) form, using discovery oriented technologies, such as ad hoc queries, search, mining, statistics, predictive algorithms, and natural language processing.
Commitment & Growth Components relative to DW Architecture Some components are poised for aggressive adoption by users. Analytics is driving most adoption of new platforms & features. In-memory analytics (36%), analytic sandboxes (29%) Managing non-relational big data is also a pressing need for many organizations. HDFS (34%), open-source MapReduce (32%), vendor-built MapReduce (25%), NoSQL databases (24%) Real-time is just as important as analytics and big data. In-memory database (34%), in-database analytics (29%), solid-state drives (25%), real-time data (24%) Relational technology is more relevant than ever, but in updated forms. Columnar DBMSs (27%), DW appliances (23%)
Top Ten Priorities for DW Architecture These are recommendations, requirements, or rules that can guide you. 1. Recognize that successful data warehouse architectures have integrated logical and physical layers, plus other components. 2. Determine the business and technical drivers in your organization, and let those determine the evolution of your DW architecture. 3. Beware that the leading barrier to successful DW architecture is inadequate staffing and skills. 4. Address other barriers for sponsorship, funding, and improvements to data management infrastructure. 5. Turn on unused features in existing platforms. 6. Establish DW architectures and standards, but be open to exceptions. 7. Be open to hybrids and alternate standards. 8. Consider Hadoop as a DW complement. 9. Remember that analytics and reporting have different data and DW architectural requirements. 10. Don t expect the new stuff to replace the old stuff.
Download a free copy of the report that this Webinar is based on EVOLVING DATA WAREHOUSE ARCHITECTURES IN THE AGE OF BIG DATA Download the report in a PDF file at: tdwi.org/bpreports Feel free to distribute the PDF file of any TDWI Best Practices Report
Q & A Philip Russom Research Director for Data Mgt TDWI prussom@tdwi.org www.bit.ly/philiprussom @prussom on Twitter linkedin.com/in/philiprussom