Integrated Data Management: Discovering what you may not know Eric Naiburg ericnaiburg@us.ibm.com
Agenda Discovering existing data assets is hard What is Discovery Discovery and archiving Discovery, test data management and data privacy Discovery and application consolidation and retirement Summary
Data management must drive competitive advantage 75% of CIO s believe they can strengthen their competitive advantage by better using and managing enterprise data. 78% of CIO s want to improve the way they use and manage their data. but only 15% believe that their data is currently comprehensively well managed. Source: Accenture CIO Data Management Survey 2007. n=167 CIOs Through 2009, IT leaders and information architects must develop a vision for their future information architecture for technologies related to data management * *Source: Gartner Research, The Gartner Data Management and Integration Vendor Guide, 2009 Regina Casonato, Mark A. Beyer, Ted Friedman; April 24, 2009
Innovation comes through integration Information is Related Across the Enterprise Channels Business Units Data Systems Providers Finance Administration DB CRM App DB DW App Core Systems ERP Health Plans Sales & Marketing DB CRM App DB DB App Core Systems ODS Patient / Member Contact Centers Internet Care Management Ancillary Services DB CRM DB CRM App DB App DB DW App DB App Core Systems Core Systems CIF Employers New Business Development DB CRM App DB DW App Core Systems Partners
IBM Solutions for Integrated Data Management An integrated, modular environment to manage enterprise application data and optimize data-driven applications, from requirements to retirement across heterogeneous environments
Optim is a Platform for Integrated Data Management Integrated Data Management Test & Development Databases Production Databases Value: Automates analysis of data and data relationships for complete understanding of data assets IBM InfoSphere Discovery Define the business objects for archiving and subsetting Identify all instances of private data so that they can be fully protected Discover undocumented business rules used to transform data from existing systems Prototype and test new transformations for the target system IBM Optim Test Data Management Solution Value: Speed Application Delivery Create realistic and manageable test environments Speed application delivery Improve Test Coverage Improve Quality IBM Optim Data Privacy Solution Value: Risk Management Protect PII Data Apply Single Data Masking Solution Leverage realistic data IBM Optim Application Retirement Solution Value: Reduce Infrastructure Cost & Compliance Decommission redundant or obsolete applications Retain Access to historical data IBM Optim Data Growth Solution Value: Improve Application Performance, Reduce Infrastructure Costs & Improve Compliance Retain only needed data, move the rest to archives Deploy Tiered Storage Strategies Retain Data According to Value Simplify Infrastructure
Supporting enterprise environments Discovery Test Data Management Data Privacy Data Growth Application Retirement Organization environments are diverse, yet interrelated therefore what you use to manage the data MUST support across your environment
You can t manage what you don t understand Distributed Data Landscape Highly distributed over multiple applications, databases and platforms Complex, poorly documented data relationships Which clients are eligible for the new sales promotion Which version of the data should we use for the ERP consolidation Relationships not understood because: Corporate memory is poor Documentation is poor or nonexistent Logical relationships (enforced through application logic or business rules) are hidden 8
Impact of NOT understanding core information assets 83% of data integration projects either overrun or fail Scrap and rework Increased $$$ Lack of consumer confidence Inaccurate or incomplete data is a leading cause of failure in business-intelligence and CRM projects 25% of time is spent clarifying bad data Lost opportunities Low data quality costs companies $611 billion annually Undetected defects will cost 10 to 100 times as much to fix upstream
Understand your distributed data landscape IBM InfoSphere Discovery automates analysis of data and data relationships for complete understanding of data assets: Identifies the relationships that link data elements into a business object within a source Customer, counterparty, invoice Identifies the complex logic that relates business objects across multiple sources 10
Automation accelerates time to deployment Data Growth Management: Automates discovery of referential integrity and business objects Data Consolidation, Integration & Migration : Discovers transformation and business logic between data sources Prototypes empty targets from the combination of many data sources Data Privacy: Discovers hidden sensitive data Discovery Discovery is is the the first first phase phase of of information information centric centric projects projects Discovery Phase Data Growth Consolidate What is unique Analyzes data values and patterns and produces actionable results Discovers complex relationships within and between data sources Transformation Rule Discovery Data Privacy
InfoSphere Discovery Requirements Discovery Accelerate project deployment by automating discovery of your distributed data landscape Define business objects for archival and test data applications Discover data transformation rules and heterogeneous relationships Identify hidden sensitive data for privacy Benefits Automation of manual activities accelerates time to value Business insight into data relationships reduces project risk Provides consistency across information agenda projects 12
Re-use shareable business objects Test Data Generation Application Consolidation Data De-identification Data Quality Data Integration Data Archival Master Data Management Group related tables in to logical business objects Single click to create a consistent sample set across business objects Re-use as shared objects in Infosphere Data architect & Optim Data Warehousing Enterprise Projects 13
Discovery for Data Archiving 14
Uncontrolled Data Growth Impacts cost Production 500 GB Training Training 500 GB Unit Test Unit Test 500 GB Production Integration System Test 500 GB UAT 500 GB System Test Integration 500 GB UAT Total 3 TB 15
Optim Data Growth Solution mitigates cost Production 200GB Training Training Unit Test 200GB 200GB Unit Test System Test 200GB Current Production Integration UAT Integration 200GB 200GB System Test UAT Total 1.2 TB Storage reduced by 60% 16
Complete Business Objects Are Critical for Data Archiving Payments Represents application data record payment, invoice, customer Referentially-intact subset of data across related tables and applications; includes metadata Provides historical reference snapshot of business activity Federated extract support across enterprise data stores 17
Complete business object: the challenge Where are they What are they How do I find them 18
Complete business object: automated discovery solution Automated discovery of Primary Foreign Keys 19
Complete business object: automated discovery solution Payments Automated grouping of tables into business entities Optim will automatically generate service definition/requests based on these entities. 20
InfoSphere Discovery for data archiving projects Analyze one or more data sources simultaneously Perform column analysis Identify primary-foreign keys Identify business objects Export business objects to Optim for archiving Other: Generate referentially consistent sample sets Identify critical data elements and overlaps across data sources 21
Discovery for Data Privacy and Test Data Management 22
Uncontrolled Data Growth Impacts cost Production 500 GB Training Training 500 GB Unit Test Unit Test 500 GB Production Integration System Test 500 GB UAT 500 GB System Test Integration 500 GB UAT Total 3 TB 23
Optim Data Growth Solution mitigates cost Production 200GB Training Training Unit Test 200GB 200GB Unit Test System Test 200GB Current Production Integration UAT Integration 200GB 200GB System Test UAT Total 1.2 TB Storage reduced by 60% 24
Optim Test Data Management mitigates cost Production 200GB Training 25GB Training Unit Test 25GB System Test 200GB Current Production Integration Unit Test UAT Integration 25GB 25GB UAT System Test Total 500GB Infrastructure reduced by 83% 25 Creating right-sized targeted test environments saves storage costs & speeds testing
Rendering data unusable to protect privacy - masking Removing, masking or transforming elements that could be used to identify an individual Name, address, telephone, SSN / National Identity number, credit card # Masked data must be appropriate to the context Within permissible range of values Application-aware Some other names you may see for masking Obfuscation, Scrambling, Data de-identification, Privacy Your Credit Card Your Credit Card 4212 5454 6565 7780 GOOD THRU > 12/09 EUGENE V. WHEATLEY 4536 6382 9896 5200 GOOD THRU > 12/09 SANFORD P. BRIGGS Before Masking After Masking
Optim Test Data Management & Data Privacy solutions Production Validate and Compare Test Subset Mask Propagate PeopleSoft / DB2 Siebel / Oracle Custom App / any DBMS Automate creation of complete test environment De-identify for privacy protection Deploy multiple masking algorithms Substitute real data with fictionalized yet contextually accurate data Provide consistency across environments and iterations No value to hackers Enable off-shore testing Compare results to identify defects early PeopleSoft / DB2 Siebel / Oracle Custom App / any DBMS
Using discovery to identify confidential data Some instances of sensitive data are easy to recognize, but others are hidden Compounded with other data elements in a row Broken apart and spread into multiple columns Buried within comment or text fields Hidden instances of private data represent a potential compliance risk 28
Sensitive data discovery Known Sensitive Sensitive Data Repository Data Row Member SS # A ge Phone Sex 1 595846226 123-45-6789 15 (123) 456-7890 M 2 567472596 138-27-1604 8 (138) 271-6037 F 3 540450091 154-86-4196 22 (154) 864-1961 M 4 514714372 173-44-7900 55 (173) 447-8996 F 5 490204164 194-26-1648 4 (194) 261-6476 F 6 466861109 217-57-3046 66 (217) 573-0453 M 987,623 444629628 243-68-1812 25 (243) 681-8107 F 987,624 423456789 272-92-3629 87 (272) 923-6280 M Finding Sensitive Data Elements (SDE) in each system can take days Whole and partial SDE s can be found in hundreds of tables and fields
InfoSphere Discovery for sensitive data Analyze multiple data sources simultaneously Discover sensitive data by comparing known sensitive data with data in a wide variety of systems at the push of a button Identified sensitive data elements (SDEs) are exported to Optim for masking 30
InfoSphere Discovery for hidden sensitive data Automates discovery of complex business rules between data sources Finds sensitive data hidden within longer fields (e.g. SSN hidden in a 46 digit routing number) Finds sensitive data that has been divided up across multiple columns (e.g. SSN divided into three separate columns) Finds sensitive data that has been transformed (i.e. items converted into codes) 31
Discovery for Application Retirement and Data Migration 32
Keep data available Consolidate multiple applications into a single instance and retire unused applications Move from home grown to packaged system Custom built General Ledger to PeopleSoft Financials Consolidate similar systems due to mergers and acquisitions Consolidate an independent business process with others Move automation capabilities into a single system and retire independent application Move application from an old to new architecture Not all data is relevant for the move, but it must be retained Shut down legacy system without a replacement In almost ALL cases, access to legacy data MUST be retained while the application and database are eliminated
Before application retirement and consolidation: you must know Archive Legacy Application Data Data from other applications New Application What are the business objects and data structures which are needed for intelligent archiving How does the legacy data map to the new application data structures How do other related applications map to the new application
Discover the business objects Archive Legacy Application Data Data from other applications New Application Discovery automates the identification of referential integrity and business objects to accelerate time to deployment for archiving
Map the legacy data to the consolidated application Archive Legacy Application Data Data from other applications New Application What are the business objects and data structures which are used for archiving How does the legacy data map to the new application data structures How do other related applications map to the new application
Data migration & consolidation is extremely difficult What is in each data source What are the matching keys used to align the rows Which sources do you trust New Application How do you combine the columns together 37
InfoSphere Discovery for unified schema prototypes Prototype migration of one or more sources into a new target application Align columns map sources to the new schema Align rows - analyze matching keys Match and Merge - analyze conflict detection and resolution rules, identify trusted sources, generate matched and merged prototypes Generates actionable rules for migrating data to the new application (SQL & FastTrack) 38
Map other applications to the new application Archive Legacy Application Data Data from other applications New Application What are the business objects and data structures which are used for archiving How does the legacy data map to the new application data structures How do other related applications map to the new application
Mapping data is very difficult Data from other applications How will we get data from our other applications into the new application How do I know I have the same transaction across applications What is the matching key that will align the rows across applications New Application What happens if the data formats and structures are different What is the transformation logic we need to map the new application to existing applications
InfoSphere Discovery transformation analyzer automates data mapping Distributed Enterprise Structured Data If age<18 and Sex=M then 0 If age<18 and Sex=F then 1 If age>=18 and Sex=M then 2 If age>=18 and Sex=F then 3 = Demo1 What is unique Discovers cross-system business rules, transformations and data exceptions by examining data values Transformation Analyzer: Automates discovery of: cross-system business rules and transformations data inconsistencies Detailed data mapping between 2 data sources Discrepancy discovery Cross source troubleshooting workbench Applicability Map a legacy applications to newly deployed applications Discover cross-source rules for data consolidation 41
IBM solutions manage costs, speed success and reduce risk 10-20x 10-20x time time savings savings identifying identifying data data objects objects 30-40% 30-40% Storage Storage savings savings 40%-75% 40%-75% Performance Performance boost boost InfoSphere Discovery Automates analysis of data and data relationships for complete understanding of data assets to identify the relationships that link data elements into a business object within a source and discovery sensitive data Optim Data Growth Solution Reduces the size of production databases improving application performance, reducing hardware and software costs and maintaining adherence to data governance regulations and policies 96% 96% Time Time savings savings 2x 2x the the data data protected protected Optim Test Data Management Solution Creates right-sized test environments to reduce data propagation, and related hardware and software costs; while increasing team efficiency by significantly speeding the creation of test environments Optim Data Privacy Solution Protects the confidentiality of data in non-production environments such as test through intelligent de-identification (i.e., masking) making data worthless if lost or stolen
Summary You don t know what you don t know and that is usually what will hurt you Data centric projects require extensive knowledge of existing systems and the most cost and time effective way of achieving that is through automation IBM InfoSphere Discovery automates analysis of data and data relationships for complete understanding of data assets to speed time to project success
44