Managing Data in Motion Data Integration Best Practice Techniques and Technologies April Reeve ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Morgan Kaufmann is an imprint of Elsevier M<
Contents Foreword Acknowledgements Biography Introduction xv xvii xix xxi PART 1 INTRODUCTION TO DATA INTEGRATION Chapter 1 The Importance of Data Integration з The natural complexity of data interfaces 3 The rise of purchased vendor packages 4 Key enablement of big data and virtualization 5 Chapter 2 What Is Data Integration? 7 Data in motion 7 Integrating into a common format transforming data 7 Migrating data from one system to another 8 Moving data around the organization 9 Pulling information from unstructured data 11 Moving process to data 12 Chapter 3 Types and Complexity of Data Integration 15 The differences and similarities in managing data in motion and persistent data 15 Batch data integration 16 Real-time data integration 16 Big data integration 17 Data virtualization 17 Chapter 4 The Process of Data Integration Development 19 The data integration development life cycle 19 Inclusion of business knowledge and expertise 20 PART 2 BATCH DATA INTEGRATION Chapter 5 Introduction to Batch Data Integration 25 What is batch data integration? 25 Batch data integration life cycle 26
viii Contents Chapter 6 Extract, Transform, and Load 29 WhatisETL? 29 Profiling 30 Extract 30 Staging 31 Access layers 32 Transform 33 Simple mapping 33 Lookups 33 Aggregation and normalization 33 Calculation 34 Load 34 Chapter 7 Data Warehousing 37 What is data warehousing? 37 Layers in an enterprise data warehouse architecture 38 Operational application layer 38 External data 38 Data staging areas coming into a data warehouse 39 Data warehouse data structure 40 Staging from data warehouse to data mart or business intelligence 40 Business Intelligence Layer 40 Types of data to load in a data warehouse 41 Master data in a data warehouse 41 Balance and snapshot data in a data warehouse 42 Transactional data in a data warehouse 43 Events 43 Reconciliation 43 Interview with an expert: Krish Krishnan on data warehousing and data integration 44 Chapter 8 Data Conversion 51 What is data conversion? 51 Data conversion life cycle 51 Data conversion analysis 52 Best practice data loading 52 Improving source data quality 53
Contents ix Mapping to target 53 Configuration data 54 Testing and dependencies 55 Private data 55 Proving 56 Environments 56 Chapter 9 Data Archiving 59 What is data archiving? 59 Selecting data to archive 60 Can the archived data be retrieved? 60 Conforming data structures in the archiving environment 61 Flexible data structures 61 Interview with an expert: John Anderson on data archiving and data integration 62 Chapter 10 Batch Data Integration Architecture and Metadata 67 What is batch data integration architecture? 67 Profiling tool 67 Modeling tool 68 Metadata repository 69 Data movement 69 Transformation 70 Scheduling 71 Interview with an expert: Adrienne Tannenbaum on metadata and data integration 73 PART 3 REAL TIME DATA INTEGRATION Chapter 11 Introduction to Real-Time Data Integration 77 Why real-time data integration? 77 Why two sets of technologies? 78 Chapter 12 Data Integration Patterns 79 Interaction patterns 79 Loose coupling 79 Hub and spoke 80 Synchronous and asynchronous interaction 83
x Contents Request and reply 83 Publish and subscribe 84 Two-phase commit 84 Integrating interaction types 85 Chapter 13 Core Real-Time Data Integration Technologies 87 Confusing terminology 87 Enterprise service bus (ESB) 88 Interview with an expert: David S. Linthicum on ESB and data integration 89 Service-oriented architecture (SOA) 90 Extensible markup language (XML) 92 Interview with an expert: M. David Allen on XML and data integration 92 Data replication and change data capture 95 Enterprise application integration (EAI) 97 Enterprise information integration (Ell) 97 Chapter 14 Data Integration Modeling 99 Canonical modeling 99 Interview with an expert: Dagna Gaythorpe on canonical modeling and data integration 100 Message modeling 103 Chapter 15 Master Data Management 105 Introduction to master data management 105 Reasons for a master data management solution 105 Purchased packages and master data 106 Reference data 107 Masters and slaves 107 External data 110 Master data management functionality 110 Types of master data management solutions registry and data hub Ill Chapter 16 Data Warehousing with Real-Time Updates 113 Corporate information factory 113 Operational data store 113
Contents xi Master data moving to the data warehouse 116 Interview with an expert: Krish Krishnan on real-time data warehousing updates 116 Chapter 17 Real-Time Data Integration Architecture and Metadata 119 What is real-time data integration metadata? 119 Modeling 120 Profiling 120 Metadata repository 120 Enterprise service bus data transformation and orchestration 121 Technical mediation 122 Business content 122 Data movement and middleware 123 External interaction 123 PART 4 BIG, CLOUD, VIRTUAL DATA Chapter 18 Introduction to Big Data Integration 127 Data integration and unstructured data 127 Big data, cloud data, and data virtualization 127 Chapter 19 Cloud Architecture and Data Integration 129 Why is data integration important in the cloud? 129 Public cloud 129 Cloud security 130 Cloud latency 131 Cloud redundancy 132 Chapter 20 Data Virtualization 135 A technology whose time has come 135 Business uses of data virtualization 137 Business intelligence solutions 137 Integrating different types of data 137 Quickly add or prototype adding data to a data warehouse 137 Present physically disparate data together 138 Leverage various data and models triggering transactions 138
xii Contents Data virtualization architecture 138 Sources and adapters 138 Mappings and models and views 138 Transformation and presentation 139 Chapter 21 Big Data Integration 141 What is big data? 142 Big data dimension volume 142 Massive parallel processing moving process to data 142 Hadoop and MapReduce 143 Integrating with external data 144 Visualization 144 Big data dimension variety 145 Types of data 145 Integrating different types of data 145 Interview with an expert: William McKnight on Hadoop and data integration 145 Big data dimension velocity 146 Streaming data 147 Sensor and GPS data 147 Social media data 147 Traditional big data use cases 147 More big data use cases 148 Health care 148 Logistics 148 National security 149 Leveraging the power of big data real-time decision support 149 Triggering action 149 Speed of data retrieval from memory versus disk 150 From data analytics to models, from streaming data to decisions 150 Big data architecture 151 Operational systems and data sources 151 Intermediate data hubs 151 Business intelligence tools 152 Data virtualization server 153
Contents xiii Batch and real-time data integration tools 153 Analytic sandbox 153 Risk response systems/recommendation engines 153 Interview with an expert: John Haddad on Big Data and data integration 154 Chapter 22 Conclusion to Managing Data in Motion 157 Data integration architecture 157 Why data integration architecture? 157 Data integration life cycle and expertise 158 Security and privacy 158 Data integration engines 160 Operational continuity 160 ETL engine 160 Enterprise service bus 161 Data virtualization server 161 Data movement 162 Data integration hubs 162 Master data 163 Data warehouse and operational data store 164 Enterprise content management 164 Data archive 164 Metadata management 164 Data discovery 165 Data profiling 165 Data modeling 165 Data flow modeling 165 Metadata repository 166 The end 166 References 167 Index 169