The Reservoir: Architecture, Best Practices and Governance Jo Ramos, Distinguished Engineer, Chief Architect for Information Integration & Governance 2015 IBM Corporation
Agenda for Today How to become a data driven analytics organization Modern Architecture Considerations The Reservoir & Governance 1
Three major shifts in our industry is becoming the world s new natural resource Social, mobile and access to data are changing how individuals are understood and engaged The emergence of cloud is transforming IT and business processes into digital services 500 million DVDs worth of data is generated daily 80% of individuals are willing to trade their information for a personalized offering 85% of new software is being built for cloud 1 trillion connected objects and devices by 2015 84% of millennials say social and user-generated content has an influence on what they buy 25% of the world's applications will be available in the cloud by 2016 80% of the world s data is unstructured 5 minutes: response time users expect once they have contacted a company via social media 72% of developers say cloud-based services or APIs are central to the applications they are designing
Monetization Drivers volumes variety Real-time execution Cost of data storage BI & Analytics visibility Increasing value of data Underutilized resource
driven analytics platform capabilities move from information to insights and from reporting to action Cognitive: What did I learn, what s best? Real-Time Decision Management: What action should I take? What is the Next Best Action? High Prediction / Simulation: What will happen? Uses data to forecast based on complex algorithms or rules; forecasting of customers buying behavior. Technological complexity Evaluation: Why did it happen? Thoroughly analyzes data to support important decisions / understand root causes for unusual observations; evaluation of campaign responses to understand changes in customer behavior. Mining: Why did it happen? Looks for patterns in the data to explain a not yet understood observation; understand why certain customers show a certain behavioral pattern. Low Low Business value / impact High Monitoring: What is happening now?looks for trigger information in the data indicating need for action; monitoring live campaigns and making optimization decisions as necessary. Reporting: What happened? Slices and dices data to create transparency on campaign performance and financial or quality outcomes.
Multi-structured Mashups provide the Greatest Enterprise Value Systems of Record Structured data from operational systems 20% of all data generated Systems of Insight Diverse data types that combine structured and unstructured data for business insight Systems of Engagement that connects companies with their customers, partners and employees 80% of all data generated Structured Small Clearly formatted Quantitative Objective Logical Puzzle Repeatable linear Warehouses Transaction data ERP Electronic Health Records Mainframe OLTP System Advanced Analytics Context Accumulation Enterprise Integration Hadoop, Streams, Spark Audio Documents Images RFID Emails Sensors Social Video Unstructured Big Language based Qualitative Subjective Intuitive Mystery Exploratory dynamic Web Logs Traditional Sources New Sources
DATA DRIVEN ARCHITECTURE
A growing data demand and organizational tensions Lines of Business IT Organization Agility Access Freedom Any kinds of data Powerful Analysis & Visualization Knowledge Worker CDO Security Privacy Regulatory Compliance Standards Application Developer Scientists seeking data for new analytics models. Marketer seeking data for new campaigns. Fraud investigator seeking data to understand the details of suspicious activity.
Modern Architectures: What to aim for We need IT solutions that are like Legos Agile, flexible, adaptive, efficient Able to give fast responses to business needs Taking advantage of cloud, mobile, social, localization, Big IT should not be like pouring cement Rigid, hardcoded business processes Monolithic Expensive to change Out of sync with business needs
Modern Architecture What to aim for Simplification of the IT environment Eliminate redundancies Reduce cost Decouple systems Decouple transactions Fast, easy reuse Faster timeto-market Explore opportunities React to problems Generate insights from information in real time Use insights to improve customer experience, anticipate facts 9
Modern Architecture Enables better Customer Engagement Social Profiles Preferences Activities Sensors Geolocation Movement Events Internet of Things Transactional Customer info Transaction history Product rules Analytics Next Best Action Expert systems Future trends Big Processes Tasks & milestones Integration Monitoring & SLAs Smart Channels Mobile & Web SoLoMo Context-enriched Context-Aware Experiences Personalized services Real-time Intelligence Moments of Truth Opportunistic offers 10
Analytics Lifecycle Search & Survey (understand what data is available) Online Collaboration Workspace Analytics Discovery (application development) Deploy & Consume (application & workflow) IT - Ope rati ons
The Reservoir Built to extract value from the data. Managed, Trusted and Governed
Big Lakes or Swamps? As we bring data together, are we creating a data swamp? No one is sure of the origin or purity of data. No one can find the data they need. No one knows what data is present and if it is being adequately protected. How do we build trust in big data? Need trust both to share and to consume data. Need understanding of quality, origin and ownership of data. Need classification of data to govern and protect it. Need timely, reliable data feeds and results. All built on secure and reliable infrastructure.
IBM s Lake Lake Services Lake Repositories Information Management and Governance Fabric Lake IBM s Lake = Efficient Management, Governance, Protection and Access.
Users supported by the Lake Analytics Teams Information Curator Governance, Risk and Compliance Team Enterprise IT Systems of Record Systems of Engagement Lake Services Lake Repositories Line of Business Teams New Sources Systems of Automation Other Lakes Information Management and Governance Fabric Lake (System of Insight) Lake Operations
The Lake Subsystems (Services) Analytics Teams Information Curator Governance, Risk and Compliance Team Enterprise IT Systems of Record Self-Service Access Catalogue Systems of Engagement New Sources Enterprise IT Exchange Lake Repositories Self- Service Access Line of Business Teams Systems of Automation Other Lakes Information Management and Governance Fabric Lake (System of Insight) Lake Operations
Considerations for a well-managed and governed data lake 9 Access to raw data to develop new production analytics. 5 Curation of all data to define meaning and classifications Business-led 6 information governance and management 4 Effective interchange of data and insight with other systems. 2 Catalog of data, ownership, meaning and permitted usage 10 Moderated, viewbased self-service access to data and analytics for line of business. 3 No direct access to repositories 8 -centric Security 7 Active monitoring and management of data 1 Multiple repositories organized based on source and usage; hosted on appropriate data platforms for workload.
View from the user community - fraud Detect and prevent fraud Develop new fraud models Conform to regulations Investigate Fraud Case
repositories support multiple zones Raw Interaction Catalog Interfaces Descriptive Enterprise IT Interaction Deposited Historical Harvested Context Lake Repositories View-based Interaction Published Information Integration & Governance Lake
Big data needs a variety of repositories for cost, access and performance reasons Raw Interaction Catalog Interfaces Descriptive CATALOG SEARCH INDEX INFORMATION VIEWS Enterprise IT Interaction Deposited Historical Harvested Context OPERATIONAL LOG AUDIT HISTORY DATA DATA INFORMATION WAREHOUSE CONTENT ASSET HUB HUB DEEP DATA ACTIVITY CODE HUB HUB Lake Repositories View-based Interaction Published EXPORT AREA OBJECT CACHE DATA MARTS Information Integration & Governance Lake
lake logical architecture Decision Model Analytics Tools Management Information Curator Governance, Risk and Compliance Team Enterprise IT System of Record Applications Systems of Engagement New Sources Third Party Feeds Third Party APIs Internal Sources Systems of Automation Other Systems Of Other Insight Lakes Enterprise Service Bus Deploy Real-time Decision Models Events to Evaluate Notifications Information Service Calls Out In Deploy Real-time Decision Models Enterprise IT Interaction APIs Continuous Analytics STREAMING ANALYTICS EVENT CORRELATION Publishing Feeds Ingestion INFORMATION BROKER BROKER CODE HUB Access SAND BOXES Information Service Calls Descriptive Deposited Historical Harvested Context Published STAGINGAREAS Deposit Secure Access CATALOG OPERATIONAL HISTORY INFORMATION WAREHOUSE CONTENT HUB OPERATIONAL GOVERNANCE HUB EXPORT AREA Deploy Decision Models Raw Interaction SEARCH INDEX LOG DATA ASSET HUB OBJECT CACHE Lake INFORMATION VIEWS DEEP DATA ACTIVITY OFFLINE ARCHIVE HUB AUDIT DATA DATA MARTS Information Integration & Governance Understand Information Sources Secure Access CODE HUB MONITOR Lake Repositories Advertise Information Source Catalog Interfaces View-based Interaction Secure Access Search SAND WORKFLOW Feedback BOXES Access Refine GUARDS Understand Compliance Understand Information Sources Search Requests Curation Interaction Information Service Calls Deposit Access Report Requests Management Report Compliance Line of Business Applications Analytical Insight Applications Simple, ad hoc Discovery and Analysis Reporting Consumers of Insight Lake Operations
Differing user perspectives Provision Sand Boxes. Sand Box Search for, locate and download data and related artifacts. Define governance policies, rules and classifications. Monitor compliance. Information Governance Catalogue View lineage (business and technical) and perform impact analysis. Add additional insight into data sources through automated analysis. Develop data management models and implementations. Curation of Metadata about Stores, Models, Definitions Stores Stores Stores
Governance Rules Defined for each classification for each situation Sensitive information masked here Personal information masked here
Integrated Metadata Lineage (Traceability) Where does this data come from? Why is this data incorrect? Why is this data incomplete? Can I trust this value? Impact Analysis Where is this element used? What happens if I change this? Optimization Where is the redundancy? How can I make this run more efficiently? Understanding What does this mean? How is this used? Control Why is this parameter set to this value? Who made this change? I can change this to meet new business requirements
Secure access to the data lake s data The data lake s security is assured with this combination of business processes and technical mechanisms. curation for security and protection access approval by subject area owner Well defined access points centric security access Isolated repositories Access monitoring and logging Security analytics and investigation Security audit and review ADMINISTRATION AT RUNTIME REVIEW
Building a data lake The first step in creating the lake is to establish the following: The information integration and governance components, The staging areas for integration, The catalog, The common data standards. The build out of the lake then proceeds iteratively based on the following processes: Governance of a data lake subject area. Managing an information source. Managing an information view. Enabling analytics. Maintaining the data lake infrastructure.
z zz z z z z Questions?
Thank You