Getting Real Real Time Data Integration Patterns and Architectures Nelson Petracek Senior Director, Enterprise Technology Architecture Informatica Digital Government Institute s Enterprise Architecture Conference, May 1, 2014, Washington, DC
The World has Changed
User Expectations MORE AGILITY RIGHT Time Immediate Response Times All Data INSTANT TRUST PROACTIV E vs. REACTIVE 100% Uptime One Place Self- Service Fresh Information
Representative Use Cases Sensor Monitoring Customer Interaction Security Asset Optimization
Changing Perspectives on Data It is no longer sufficient to view information after the fact. Business demands information sooner, with more accuracy, in order to meet competitive and regulatory demands. Business needs to respond to threats and opportunities sooner. Reduce decision latency. Proactive alerts and notifications. Improve TTA (time to answer).
Traditional Data Management Approaches Act Analyze Data Integratio n EDW BI Store Valuable for: Reporting Historical Activity Strategic Analysis
The Challenges with Traditional Approaches Act Analyze Store Takes too long to deliver what is needed. Lots of wait and waste in the process. No common and trusted data access. Information is missing or is stale / delayed. Too much decision latency.
Next Generation Data Integration Real-Time Design Patterns and Architectural Approaches
A Shift in Thinking is Needed Need to shift from building large, monolithic applications to smaller sets of distributed micro-applications based on the principles of Reactive Applications *. Resilient Scalable Event Driven Responsive Move away from a store first approach; provide the ability to process event data as it arrives. Focus on hybrid architectures that facilitate both batch and real-time processing. * See: http://www.reactivemanifesto.org/#the-need-to-go-reactive
Reactive Applications: Characteristics Resilient Able to recover at all levels. Utilize fine grained resilience on the component level. Bulkhead pattern. Scalable Avoid contention on shared resources. Scale out or up as needed (without rewrites). Maintain programming model as system is scaled. Event-Driven System communicate via events. Loosely coupled, asynchronous, Amdahl s Law. Efficient use of resources. Responsive Honor response time guarantees regardless of load. Provide users with a rich, interactive experience. Observable models, event streams, stateful clients. * See: http://www.reactivemanifesto.org/#the-need-to-go-reactive
Sample Architectural Approach: Reactive Applications Data Warehouse Hadoop / NoSQL Analytics Event Based Applications Event Processing Streaming Analytics RulePoint Ultra Real Time Stream Transport / Delivery Messaging Ultra Messaging Stream Transformation B2B Data Transformation CDC / Data Access CDC PWX Data PowerCenter Integration Streaming Collection Vibe Data Stream Power Exchange Various Source Applications / Technologies Operational Data (Field Devices, Applications, Clickstream, IoT, logs, etc.)
Resulting Activity Based Intelligence Process Proactive actions instead of reactive. EVENT S Action ALERT S OI System DATA Allows the end-user to define conditions and rules through selfservice capabilities. Users are pushed the information they need, when they need it, in the system that they need it.
Sample Big Data Reference Architectures Real-Time Component * Source: http://hortonworks.com/hdp/ * Source: http://www.cloudera.com/content/cloudera/en/products-and-services/
Hybrid Architecture: Batch Plus Real-Time Historical Batch Computation Batch Map / Reduce, YARN Data Analytics Long term Persistence, High Latency e.g. Purchase history analysis. Data Sources (Devices, Apps, Clickstream, IoT, logs, etc.) Distributed Real- Time Computation Real-Time Continuous Computations Streaming Analytics / Event Processing Incremental, Low Latency e.g. Sensor / infrastructure monitoring. Data Targets (Dashboards, BI, Mobile, etc.) Big Data Supply Chain
Stream Collection Separate from batch or bulk data loading. Involves the collection of event data ( streams ) as they occur, from various endpoints, systems, and people. Multiple options available: Micro-batch or near real-time data integration. Data integration hub pattern. Real-time collection. Data replication, etc. Number of factors to look at when determining the right pattern to utilize.
Stream Collection: Replication Utilize replication beyond the copying of data from one data store to another. Console Event-enable back-end data stores. Non-intrusively detect changes in data, publish data changes to one or more targets. Real-time delivery of the latest data changes to target systems. Source System High Speed Extraction EXTRACT Checkpoint Intermediate Files Committed SQL Apply Merge Apply Audit Apply APPLY Checkpoint Target System High Speed Parallel Apply SERVER MANAGER http:// SERVER MANAGER
Stream Collection: Data Integration Hub Pattern Eliminate point-to-point collection / delivery interfaces. Provide a location independent mechanism for data producers (and consumers) to talk to one another. Publish and Subscribe Manage data delivery impedance mismatches. Provide self-service capabilities. Centralize data quality, masking, transformation logic.
Data Integration Hubs: Beyond Collection
Stream Collection: Distributed Agents Distribute collection across thousands of endpoints. Perform filtering, transformation, etc. close to the source. Focus on daemon-less or broker-less designs for improved performance and scalability. Provide varying qualities of service. Streaming, guaranteed, etc. Allow for dynamic configuration. Sources Stream Node Stream Node Stream Node Stream Node Stream Node Stream Node Targets
Stream Collection: Distributed Agents with Collectors Local Hub Agent Streaming Data Collection Regional Hub Central Hub Event Processin g Real Time Actions Agent Data Integration EDW Agent Edge data filtering and processing Data Transfer HDFS Agent
Event Streaming Analytics Execute logic against real-time streams. Utilize streaming language constructs. Logic may be executed at a point-in-time, or over time. Temporal reasoning. Join or merge multiple streams together for real-time pattern recognition, correlation, etc. across data sources. Distributed Real- Time Computation RulePoint Timely and contextual. Augment real-time streams with historical context.
Event Delivery Data Integration Hub Allow data consumers to subscribe to data previously pushed to the hub. Batch + near real-time feed. Data Integration Feed content into back-end systems through application interfaces. Batch + near real-time feed. Streaming Delivery Push content to end applications, dashboards, etc. Content may consist of derived or raw events. Near real-time + real-time feed.
Lambda Architecture Source: http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting Data is distributed to both a Batch Layer and Speed Layer for processing. Batch layer manages the append-only master set of raw data. Serving Layer indexes batch views for lowlatency queries. Speed Layer covers recent data not in the Batch Layer. Queries merge results from the batch and realtime views.
Data Security with Data Integration
Architectural Implications
Questions? www.operationalintelligenc e.me