TDWI RESEARCH TDWI CHECKLIST REPORT SATISFYING NEW REQUIREMENTS FOR DATA INTEGRATION By David Loshin Sponsored by tdwi.org
JUNE 2012 TDWI CHECKLIST REPORT SATISFYING NEW REQUIREMENTS FOR DATA INTEGRATION By David Loshin TABLE OF CONTENTS 2 FOREWORD 2 NUMBER ONE Increase performance and efficiency. 3 NUMBER TWO Integrate the cloud. 3 NUMBER THREE Protect information in the integration layer. 4 NUMBER FOUR Embed master data services. 4 NUMBER FIVE Process big data and enterprise data. 5 NUMBER SIX Satisfy real-time demands. 5 NUMBER SEVEN Develop data quality and data governance policies and practices. 6 ABOUT OUR SPONSOR 6 ABOUT THE TDWI CHECKLIST REPORT SERIES 6 ABOUT THE AUTHOR 6 ABOUT TDWI RESEARCH 1201 Monster Road SW, Suite 250 Renton, WA 98057 T 425.277.9126 F 425.687.2842 E info@tdwi.org tdwi.org 2012 by TDWI (The Data Warehousing Institute TM ), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to info@tdwi.org. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.
FOREWORD NUMBER ONE INCREASE PERFORMANCE AND EFFICIENCY. New trends in the industry are making data integration more paramount than ever. Essentially, the concept of data integration is being redefined; it is no longer limited to extracting data sets from internal sources and loading them into a data warehouse, but instead focuses on effectively facilitating the delivery of information to the right places within the appropriate time. Data integration goes beyond extract, transform, and load (ETL); data replication; and changed data capture, although these remain key components of the integration fabric. It is also hard to ignore the big data revolution as organizations seek to adapt their information management environments to accommodate massive data volumes coming from a large variety of sources (especially internal ones!) at accelerating speeds. Some of the challenges include absorbing numerous data feeds (both internal and external), moving data to analytical appliances designed for big data analytics, coupling the results with existing data warehouse and business intelligence (BI) architectures, and delivering results to a variety of downstream knowledge workers and information consumers. When coupled with the migration of storage, infrastructure, and business applications to the cloud, we see how these trends require moving large amounts of data to many different places in real time without allowing data replicas to become out of sync with each other. Mounds of structured data, unstructured data, big data, and advancements in cloud technology are imposing new requirements for data integration. This TDWI Checklist Report will explore some of the key drivers these new requirements are intended to address. Whether you are looking to support the performance needs of big data applications, filter concepts from unstructured data, monitor hundreds of data feeds for unexpected behavior, export data across enterprise boundaries, or provide real-time reporting and analysis, there is a rapidly expanding need for data integration competency that extends well beyond traditional ETL. People are increasingly recognizing that repurposing transactional data for analytical purposes yields significant value for improving many aspects of the business. Part of this epiphany is attributable to data volume growth, improved data visualization methods, and the lowered barrier to entry for business intelligence and analytics. Yet, some BI initiatives are at risk of succumbing to the perils of their own success: a growing user community, increased system demand, the need to make optimal use of high-performance platforms and programming models, and the need for simultaneous and rapid access to lots of data. Failed expectations for increased performance are linked to the familiar bottlenecks of data-access latency, coupled with the general need for increased performance in information delivery for both traditional and emerging techniques: 1. Platforms: Analytical appliances provide in-memory computation, yet are throttled by the need to stream data in and flow results out. 2. Demands: Today s environments must satisfy a mixture of workloads, requiring sophisticated queries with many-way joins that require numerous data exchanges to satisfy evaluation of the join conditions. 3. Execution: There is a need to optimize data access for query processing on appliances that support online transaction processing (OLTP) and data warehousing, especially when it comes to intermediate data transformations. 4. Big data analytics: Hadoop requires data integration and exchange support for the phase transitions between the calculation and reduction phases of the MapReduce programming model, especially as the sizes of the inputs grow and the complexity of the algorithms increases. Each of these demands requires innovation in optimizing the use of both new and existing data integration techniques to reduce the data-access bottleneck. Some approaches include using knowledge of canned reports to optimize data distribution and layouts, employing alternate data alignments (such as column orientation, or realignment of column order to improve hardware cache utilization), aggressive use of data compression, use of high-speed network technology (such as Infiniband), data replication, changed data capture, and data federation techniques, all for improved performance. 2 TDWI RESEARCH tdwi.org
NUMBER TWO INTEGRATE THE CLOUD. NUMBER THREE PROTECT INFORMATION IN THE INTEGRATION LAYER. The evolution toward simultaneously exploiting both on-premises and cloud-based environments implies a need for greater agility in data acquisition, integration, and absorption. Moving data beyond the corporate firewall goes beyond solely extracting data from existing internal legacy data sources. The challenge becomes more acute as you look at the different scenarios where data instances and data sets cross the enterprise boundary. Some examples: Incorporating analytical results with cloud-based systems (such as SaaS-based sales, marketing, and contact management applications) Incorporating external data into an analytical platform (such as social network graph analysis that examines continuous data streams) When exposing reporting services to customers (such as account statements and analysis in the financial industry) Collaborative data analysis in which data sources are pooled from different enterprises (such as comparative effectiveness research in the healthcare industry) These examples suggest a need to smooth out the differences in structure and semantics in a bidirectional manner without introducing additional latencies and access bottlenecks. Sharing data across enterprise boundaries has implications for data exchange and integration. Here are some key characteristics for data exchange and sharing capabilities in the big data world: The ability to seamlessly access a variety of large data sets inside the organization (such as transaction data from different operational systems, or data in an enterprise data warehouse) The ability to seamlessly access a variety of large data sets outside the organization (including licensed data sets, public domain data sets, data in the cloud, data feeds, and streamed data) The ability to harmonize your data to unify meaning and structure of your data elements for all consumers The ability to share data with applications and environments hosted outside the organization (again, with a focus on interfacing with cloud-based systems and applications) Extra-enterprise data integration must enable these types of exchanges by providing data standards, canonical models, exchange schemas, and high-performance methods for cloud-based data access. It must also provide integration as services, which allows for information availability with different levels of transparency and reduces the effort required for application development. The need for extra-enterprise and cloud-oriented data integration is clear. But any scenario in which sensitive data is destined to flow outside the corporate boundary is bound to raise eyebrows from the compliance department. In spite of organizational motivations to migrate their applications and data to cloud-based systems, ensuring the security of protected data (especially in the financial and healthcare environments) is not only a good practice, but it is also the law. Most regulated industries have legal requirements for protecting personally identifiable data. And although many organizations feel secure that their IT security program effectively protects private data, the number of security breaches and the scale of the data that is exposed tell a different story. Whether that is a byproduct of deliberate hacking or social engineering, exposure of private data is on the rise. In other words, even in the face of existing IT security frameworks, there are situations in which those barriers are breached, exposing the supposedly secure data. Data integration is a particular point of potential security weakness. Anytime data is exchanged (either within or outside the enterprise), there is risk of exposure. This means that another new requirement for data integration is incorporating different aspects of information protection (especially in the cloud), including identity management, authentication, and authorization for any data exchange, whether it is delivering information to individuals or even in machine-tomachine transfers. Because sharing clear data poses an exposure risk, another mitigation strategy is the use of encryption for any data exchange as part of the integration layer. This addresses concerns about a breach of the typical IT security layer in the event of a security failure, the encrypted data is still unusable. 3 TDWI RESEARCH tdwi.org
NUMBER FOUR EMBED MASTER DATA SERVICES. NUMBER FIVE PROCESS BIG DATA AND ENTERPRISE DATA. Many organizations are putting significant emphasis on the use of master data repositories and master data management (MDM), which is intended to provide universal access to a unified presentation of information about uniquely identifiable entities. Typically, these entities are represented in different ways in multiple data sets across the environment. The typical approaches to MDM are simplistic: build a relatively static master data hub with simple data extraction and transformations prior to loading. But the data integration challenge becomes more complex for a number of reasons, such as: The increased number and granularity of type for master domains The greater emphasis on master domain usability Broad differences in size, format, structure, and meaning among the different (both static and dynamic) data sources Different requirements for use of the shared master information A growing number of domain-specific applications to align The need to resolve references to individual entities from a variety of semi-structured and unstructured data streams Entity identities are embedded within unstructured data in different ways, and the variety of unstructure makes it difficult to create a single set of standardization and transformation rules that can be universally applied. Plus, applications using master data will need coherent views of master domains (that is, their data instance values are consistent, timely, and up to date) to ensure consistency of business process operation. Master data management will require increasingly sophisticated methods for seamless integration of data into master data sets, indicating that the data integration framework directly embed master data services such as: Text analysis, automated tokenization, and semantic resolution Automated data standardization Entity extraction Data validation Integrated hierarchical mapping Big-data analytics applications are often intended to absorb many large structured and unstructured data sets and then calculate results, providing better predictive models and enhancing customer profiles. The analytical results add value when they are recombined with information persisted in existing data warehouses. This presents a data dependency between the two analytics styles: The big data analytics applications must be adept at parsing out and resolving entity concepts from unstructured data and linking those entities with recognized entities (such as customers or products) that are accessed from the data warehouse. The important results (such as customer profile enhancements) from the big data analytics applications must be appended to the persistent data warehouse models to support reporting, queries, dimensional analysis, etc. For example, a big data analytics program might scan many simultaneous social media streams, parse out individual identities and corresponding product preferences, access the individuals profiles from the data warehouse, recalculate product affinity scores, then update the profiles in the warehouse. Leveraging this integrated analysis implies two data integration requirements: 1. Unified data integration tooling: The data integration tools and runtimes must be complete and provide unified support for enterprise data (including both structured and unstructured data), along with a variety of data access methods, such as text files, traditional SQL, Hadoop, and other NoSQL frameworks. 2. Integrated analytics: The data integration layer should embed capabilities for parsing the relevant content in unstructured data. This combines newer techniques such as text analytics and entity extraction/identification with techniques associated with ETL and data cleansing, such as parsing, standardization, and identity resolution. More to the point: emerging big data technologies will need data integration tooling that will enable integration of business analytics and data warehouses. Searching, matching, and linking entity information in real time Real-time synchronization and coherence among the consumers of master data 4 TDWI RESEARCH tdwi.org
NUMBER SIX SATISFY REAL-TIME DEMANDS. NUMBER SEVEN DEVELOP DATA QUALITY AND DATA GOVERNANCE POLICIES AND PRACTICES. The need for real-time data integration emanates from two directions: the first is to reduce the latency between transaction events and the time those events can be incorporated into reports and analyses, and the second is to use real-time technology for continuous availability solutions (i.e., active-active) or live standby systems. Broader adoption of business intelligence in a pervasive manner across the organization enables a wide community of decision makers to be informed with actionable knowledge. A side effect of this broad adoption is heightened expectations for availability of that actionable knowledge, especially in relation to embedded business intelligence within operational processes. The results of BI and analytics need to be integrated directly into production operational applications to address real-time process needs, and transaction information must be fed in real time to the analytical applications. In other words, there is a need for real-time data acquisition as well as real-time delivery of actionable information to a variety of different types of channels and devices across different locations and networks. The concept of real-time data integration seems like a natural fit for all aspects of pervasive business intelligence, and offers the opportunity for an amazing upside. One might even hypothesize that everything should be real-time enabled for all BI and data warehouses. Yet the logistics of real-time data synchronization can impose significant performance criteria, especially in environments with many simultaneously executing production transaction and operational applications that are informed by different BI applications. Satisfying real-time data demands puts focus on adapting timeproven mechanisms for continuous availability to the world of information, including high-availability methods for systemic faulttolerance, live standby, synchronous replication, asynchronous replication, and changed data capture to maintain coherence across replicas. And let s not forget the big-data integration scenario either. The desire to analyze high-volume/high-velocity data streams and push the results to the right consumers creates another intersection point that benefits from real-time data integration. Data reuse and repurposing creates a dilemma in attempting to verify the suitability, usability, and quality of data. Data governance is a set of policies and practices intended to institute control over data quality and usability, as well as standardize business terms, data element definitions, and conformance to defined business rules. Requiring data governance policies and protocols for data integration within and across different organizations is not strictly a product requirement; it is a systemic one in which observation of data policies must be supported within the data integration fabric. The need to demonstrate how data complies with business policies can be satisfied through integrating data quality services within the data integration fabric. For example, there are various complexities of customer and party data when it comes to compliance with laws, especially when it comes to protection of private information, customer awareness (such as Know Your Customer in the financial sector), or identification of individuals on government watch lists. The preferred method for implementation is using data quality services that can be tightly coupled with the data integration layer instead of relying on jury-rigged solutions made of different tools cobbled together. With increased data exchanges, sharing, and repurposing, introducing policies and practices will guide the definition of standards and controls. This will also provide a means for monitoring compliance with agreed-to standards that can help reduce the uncertainty associated with data reuse and sharing among collaborative partners. The first step involves recognizing the need for data governance within organizational boundaries, and establishing the policies and procedures for ensuring data quality within the firewall. The more mature phase involves data governance that extends to collaborative communities (extending beyond the enterprise boundaries), which requires sharing common metadata and documenting, tracking, and managing data lineage, policy management, and data quality servicelevel agreements. In addition, improved alignment with metadata management capabilities helps trace the introduction of data errors and enables impact analysis and scoping of changes to eliminate their root causes. 5 TDWI RESEARCH tdwi.org
ABOUT OUR SPONSOR ABOUT THE AUTHOR Oracle Data Integration provides a fully unified set of products for building, deploying, and managing data-centric architectures for operational and analytical data integration across the enterprise. Oracle s products combine to provide all the elements of data integration real-time data movement, transformation, big data processing, data synchronization, data quality, data management, and data services to ensure that information is timely, accurate, and consistent across complex systems. Oracle (NASDAQ: ORCL) is the world s most complete, open, and integrated business software and hardware systems company. Visit www.oracle.com. David Loshin president of Knowledge Integrity, Inc. (www.knowledge-integrity.com), is a recognized thought leader, TDWI instructor, and expert consultant in the areas of data management and business intelligence. David is a prolific author regarding business intelligence best practices, as the author of numerous books and papers on data management, including The Practitioner s Guide to Data Quality Improvement, with additional content provided at www.dataqualitybook.com. David is a frequent invited speaker at conferences, Web seminars, and sponsored Web sites and channels, including www.b-eye-network.com. His bestselling book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at www.mdmbook.com. David can be reached at loshin@knowledge-integrity.com. ABOUT THE TDWI CHECKLIST REPORT SERIES ABOUT TDWI RESEARCH TDWI Checklist Reports provide an overview of success factors for a specific project in business intelligence, data warehousing, or a related data management discipline. Companies may use this overview to get organized before beginning a project or to identify goals and areas of improvement for current projects. TDWI Research provides research and advice for business intelligence and data warehousing professionals worldwide. TDWI Research focuses exclusively on BI/DW issues and teams up with industry thought leaders and practitioners to deliver both broad and deep understanding of the business and technical challenges surrounding the deployment and use of business intelligence and data warehousing solutions. TDWI Research offers in-depth research reports, commentary, inquiry services, and topical conferences as well as strategic planning services to user and vendor organizations. 6 TDWI RESEARCH tdwi.org