The Data Warehouse Challenge Taming Data Chaos Michael H. Brackett Technische Hochschule Darmstadt Fachbereichsbibliothek Informatik TU Darmstadt FACHBEREICH INFORMATIK B I B L I O T H E K Irwentar-Nr.:...H.3...:T...G3.ty..2iL.. Saclwjebiete: n..7!..r; Standort: WILEY COMPUTER PUBLISHING John Wiley & Sons, Inc. New York Chichester Brisbane Toronto Singapore
Contents About the Author Foreword by William H. Inmon Acknowledgments Preface v vii ix xi Chapter 1 Data Crisis 1 Information Demand 2 Dynamic Environment < 2 Business Changes 3 Business Information Demand 4 Data Situation 4 Disparate Data 5 Disparate Data Cycle 7 Data Dilemma 8 Technology Trends 9 Client/Server Architecture 10 Data Warehouse Systems 10 Geographic Information Systems 11 Other Trends 12 Metadata Demand 13 Summary 14 Questions 15 xv
xvi CONTENTS Chapter 2 Data Challenge 17 The Realities 18 Basic Problem 18 Data Awareness Data Understanding 18 19 Data Variability Data Redundancy 20 21 Data Access Tools 22 23 Standards 24 Hidden Resource 25 / Disparate Data Shock 25 Meeting the Challenge 26 Data Resource Initiative Data Resource Strategies 26 27 Identify Data Understand Data 27 27 Integrate Data 28 Aggregate Data 28 DepZoy Data 28 Opportunity for Change 29 Approaches 29 Justification 30 Summary 32 Questions 33 Chapter 3 Data Vision 35 Integrated Data Resource 36 Principles 36 Subject-Oriented 37 Business Survival-Oriented 38 i?eaz World Perspective 38 Robust Resource 40 Sharable Resource 41 Development 42 A Formal Data Resource 43 Data Resource Library 44 Information Engineering Support 44 Data 46 Data Engineering 48 Summary 49 Questions 50
CONTENTS xvii Chapter 4 Data Architecture 51 Formal Architecture 52 Information Technology Infrastructure 52 Data Resource Framework 55 Data Architecture 55 Common Data Architecture 57 Formal Approach 60 Data Architecture Perspective 60 Data Model Perspective 61 y Data Unit Perspective 62 Objects and Events 62 Features 63 Existences and Occurrences 64 Coded Data Values 65 Data Megatype Perspective 65 Summary 67 Questions 68 Chapter 5 Data Description 69 Data Names 70 Data Naming Conventions 71 Data Naming Taxonomy 72 A Structural Taxonomy 72 Original Taxonomy Components 74 Enhanced Taxonomy Components 75 Data Naming Vocabulary 76 Aligning Naming Conventions 78 Forming Data Names 78" Data Site Names 79 Data Occurrence Selections Names 80 Data Subject Names 80 Data Code Set Names 81 Data Characteristic Names 82 Data Characteristic Variation Names 86 Data Characteristic Substitution Names 89 Data Code Names 90 Data Version Name 91 Data Name Abbreviations 92 Short Data Names 93 Defining Data 94 Data Definition Criteria 94 Data Definition Common Words 98 Summary 98 Questions 99
xviii CONTENTS Chapter 6 Chapter 7 Data Structure Data Structure Concept Common Data Structure Data Sets Data Relations Common Notation Data Relation types Data Relation Diagrams Entity Relation Diagrams Subject Relation Diagrams File Relation Diagrams Multiple Perspectives Data Subject Hierarchy Presenting Ideas Data Keys Primary Keys Multiple Primary Keys Primary Key Intelligence Dual Primary Keys Foreign Keys Subject Structure Chart Coded Data Code Tables Data Code Set Coded Data Trends Data Group Trends Data Classification Data Classification Scheme Data Themes Data Segments Data Clusters Summary Questions Data Qualit Disparate Data Quality Data Integrity Data Value Integrity Conditional Data Value Integrity Data Domains Default Data Values 101 102 102 103 104 104 105 108 108 112 114 115 116 119 123 123 125 126 128 128 129 131 131 134 134 135 135 136 139 139 140 141 142 143 144 145 146 147 150 151
CONTENTS XIX Chapter 8 Data Structure Integrity Conditional Data Structure Integrity Referential Integrity Data Retention Integrity Data Derivation Integrity Derived Data Redundant Data Replicated Data Data Accuracy Scope Data Currentness Data Lineage and Heritage Temporal Data Data Versions Multiple Source Updates Proactive and Retroactive Updates Data Completeness Managing Data Quality Data Quality Improvement Data Quality Criteria Data Quality Techniques Data Quality Process Realizing Disparate Data Quality Understanding Existing Data Quality Determine Desired Data Quality Adjusting Data Quality Tracking Data Quality Summary Questions Metadata Metadata Situation Disparate Metadata Disparate Metadata Cycle Metadata Dilemma Metadata Shock A New Perspective Metadata Types Common Metadata Metadata Warehouse Metadata Warehouse Concent 152 154 156 157 158 158 163 164 164 165 165 167 170 172 173 174 175 176 177 177 178 179 179 179 179 179 180 181 183 185 186 186 187 188 188 189 189 191 193 194
xx CONTENTS Metadata Warehouse Architecture 195 Metadata Warehouse Components 195 Data Naming Lexicon 197 Data Dictionary 199 Data Structure 202 Data Integrity 203 Data Thesaurus Data Glossary 205. 208 "" Data Product Reference Data Directory 209 211 Data Translation Schemes 212 Data Clearinghouse 213 Managing Metadata 216 Metadata Quality 216 Metadat Versions 218 Summary 220 Questions 221 Chapter 9 Data Refining 223 Data Refining Concept 224 Data Refining Approach 224 Data Product Concept 225 Data Product Names 227 Data Naming Taxonomy 227 Data Products 228 Data Product Groups 228 Data Product Units 229 Data Product Codes 230 Data Product Definitions 231 Data Product Structure 232 File Relation Diagram 232 File Structure Chart 233 Entity Relation Diagram 234 Entity Structure Chart 235 Data Product Quality 236 Data Product Integrity 236 Data Product Accuacy 237 Data Cross-Reference 238 Data Cross-Reference Approach 239 Data Product Group 240 Data Product Unit 240 Data Product Code 244 Data Product Inventory 246
CONTENTS xx! Data Variability 247 Primary Key Variability 247 Data Subject Variability 247 Data Characteristic Variability 247 Data Code Value Variability 249 Official Data Variations 251 Official Primary Key Official Data Characteristic Variations 252 252 Official Data Domains Official Data Codes 254 254 Data Translation Schemes 255 Data Characteristic Translation 255 Data Code Translation 257 Disparate Data Integration 258 Integration Scope 258 Official Data Source 259 Integration Table 260 Physical Integration 261 Summary 262 Questions 263 Chapter 10 Evaluational Data 265 Data Warehouse System Concept 266 Decision Support 266 Data Resource Support 267 Data Warehouse System Definition 268 Dual Database Concept 269 A New Perspective 270 Evaluation Data 270 Data Architecture 272 Data Dimensions 273 Evaluation Data Perspective 21A Evaluation Data Description 274 Data Subjects 275 Data Subject Names 276 Data Characteristic Names 277 Data Selection 278 Data Versions 279 Data Definitions 279 Evaluation Data Structure 280 Primary Keys 280 Subject Relation Diagram 281 Summary Data Subject Matrices 283
xxii CONTENTS Evaluation Data Integrity 285 Data Relations 285 Data Normalization 286 Data Summarization 288 Data Summarization Levels 290 Maintaining Evaluation Data 291 Data Addition 292 Data Removal 293 Data Rederivation 295 Data Version 296 Data Perspectives 297 Metadata 298 Data Exploration and Mining 301 Summary 302 Questions 303 Chapter 11 Data Transformation 305 Data Transformation Concept 306 Data Transformation Perspective 306 Data Transformation Routes 310 Data Transformation Matrix 311 Data Transformation Steps 311 Identify Target Data 312 Identify Source Data 313 Extract Source Data 314 Reconstruct Historical Data 315 Translate Data 316 Recast Data 317 Restructure Data Summarize Data 319 320 Load Data 321 Review Data 321 Summary 322 Questions 323 Chapter 12 Spatial Data 325 A Data Perspective 326 Decision Support 326 Data Situation 327 Common Data Architecture 328 Spatial Data Definitions 329
CONTENTS xxiii I Spatial Data Description 331 Data Layers 331 Spatial Data Layer Names 335 Spatial Data Definition 338 Spatial Data Structure 339 Data Relations 339 Primary Keys 342 Spatial Data Quality 344 Datums 344 Linear Referencing Systems 345 Linear Addressing Systems 347 Geographic Areas 348 Linear Object Segmentation 349 Metadata 350 Managing Spatial Data 351 Spatial Data Tiers 351 Spatial Data Themes 353 Seen Areas 354 Duplicate Data Layers 355 Data Layer Extents 356 Time-Variant Spatial Data 356 Data Layer Aggregation 357 Three-Dimensional Aggregation 360 Spatial Data Scale 361 Integrating Tabular and Spatial Data 362 Spatial Data Referencing 363 Descriptive Spatial Referencing 364 Nondescriptive Georeferencing 366 Indirect Spatial Referencing 367 Summary 369 Questions 370 Chapter 13 Distributing Data 373 Data Distribution Concept 374 Data Distribution 374 Data Distribution Dilemma 375 Common Data Architecture 376 Official Data 377 Replicating Data 378 Distributed Data Description 379 Distributed Data Names 379 Distributed Data Definitions 381
xxiv CONTENTS Distributed Data Structure 381 Logical Data Structure 382 Distributed Data Structure 382 Physical Data Structure 384 Distributed Data Diagram 386 Data Partitioning 389 Data Subject Partitioning 390 Data Occurrence Partitioning Data Characteristic Partitioning 391 392 Dual Data Partitioning 393 Distributing Data 393 Data Distribution Driver 394 Distributing Tabular Data 394 Distributing Evaluational Data Distributing Spatial Data 395 396 Distributing Metadata 397 Data Marts 398 Redistributing Data 399 Distributed Data Quality 400 Data Origination 401 Data Tracking 401 Data Concurrency 403 Distributed Data Quality Principles 405 Summary 406 Questions 407 Chapter 14 Common Data Model 409 The Data Schema Concept 410 Two-Schema Concept 410 Three-Schema Concept 411 Four-Schema Concept 412 Five-Schema Concept 414 Abstract Schema Concept 415 Framework for Information Systems 416 Five-Schema and the Framework 417 Common Data Modeling 418 Data Modeling Perceptions 419 Data Modeling Problems 420 Common Data Architecture 422 Common Data Modeling Concept 424 Forward Data Modeling 424 Reverse Data Modeling 426 Vertical Data Modeling 427
CONTENTS XXV Common Data Modeling Method Basic Data Modeling Components An Integrated Data Resource Modeling Logical Schema Developing New Data Refining Disparate Data Developing Evaluational Data Distributing Data Changing Operating Environments Integrating Data Data Model Interfaces Data Subject Hierachies Common Person Grouped Code Tables Archive and History Data Summary Questions 428 428 430 431 431 432 433 433 434 435 436 437 439 441 442 444 446 Chapter 15 Resolving the Dilemma 447 Data Issues 448 Increasing Data Disparity 448 Knowledge Loss 449 Millennium Data Problem Client Data Access 450 451 Acquired Applications Conflicting Data Standards 453 454 Standards and Guidelines 455 Rapid Development Multiple Common Data Architectures 456 457 Legacy Systems 457 Stabilizing Variables 458 Business Improvement 460 Resolution Initiative 461 Recognition 461 Vision 462 Orientation 463 Strategy 465 Evaluation 466 Summary 466 Questions 468 Glossary 469
xxvi CONTENTS Appendix A Common Words 523 Common Data Site Words 523 Common Data Subject Words 523 Common Data Characteristic Words 525 Common Data Characteristic Variation Words 528 Common Data Version Words 529 Common Data Definition Words 529 Appendix B Short Data Names 531 Parent Elimination Notation 531 Subordinate Inclusion Notation 532 Subordinate Substitution Notation 532 Parent Substitution Notation 533 Summary Data Subject Notation 533 Program Name Notation 533 Appendix C Data Definition Examples 535 Data Sites 535 Data Occurrence Groups 535 Data Subjects 536 Data Characteristics 537 Data Characteristic Variations 538 Data Codes 539 Data Versions 539 Appendix D Metadata Explanation 541 Appendix E Cross-Reference Example 545 Original Data Definitions 545 Data Qaulity Information 545 Cross-References 551 Cross-References by Common Data Name 551 Cross-References by Product Data Name 552 Subject Relation Diagram Data Definitions 553 553 Geospatial Dataset 554 Geospatial Dataset Attribute Accuracy Geospatial Dataset Horizontal Accuracy 554 554 Geospatial Dataset Process 555 Geospatial Dataset Source 555 Geospatial Dataset Vertical Accuracy 556
CONTENTS xxvii Appendix F Evaluation Data Example 557 Operational Subject Relation Diagram 558 Evaluation Subject Relation Diagram 559 Primary Key Matrix 560 Data Characteristic Matrix 562 Bibliography 565 Index 567