Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0

InfoSphere DataStage is Big Data Integration Questions and suggestions regarding presentation topics? - send to editor@dsxchange.net Downloading the presentation Click Presentation YES on Poll Question Replay will be available within one day with email with details Bonus Offer Free premium membership for your DataStage Management! Submit your management s email address and we will offer him/her access on your behalf. Email Info@dsxchange.net subject line Managers special. Join us all at Linkedin http://tinyurl.com/dsxmembers ISXchange will sponsor Trial membership for new requests at Linkedin DSX members site

2013 IBM Corporation InfoSphere DataStage is Big Data Integration Tony Curcio InfoSphere Product Management

Bigger Data Integration Challenges New types of data stores Big Data introduces additional data stores that need to be integrated both Hadoop based and nosql based These data stores don t easily lend themselves to conventional methods for data movement New data types and formats Unstructured data; poly-structured data stores; JSON, Avro, and what more to come??? Video, docs, web logs, Larger volumes Solutions need to move, transform, cleanse and otherwise prepare huge data volumes Big Data requires data scalability 3

Benefits of InfoSphere DataStage Speeds Productivity Graphical design easier to use than hand coding Simplifies Heterogeneity Common method for diverse data sources Shortens Project Cycles Pre-built components reduce cost and timelines Promotes Object Reuse Build once, share, and run anywhere (etl/elt/real-time) Reduces Operational Cost Provides a robust framework to manage data integration Protects from Changes isolation from underlying technologies changes as they continue to evolve

Big Data is part of the Information Supply Chain Transactional & Collaborative Applications Manage Integrate Master Data Analyze Content Big Data Cubes Streams Business Analytics Applications External Information Sources Data Content Streaming Information Govern Data Warehouses Information Governance Quality Lifecycle Security & Privacy Standards Gartner Magic Quadrant IBM is the only DBMS vendor that can offer an information architecture across the entire organization, covering information on all systems 5

4 Key Analytical Use Cases for Big Data Find, visualize, understand all big data to improve decision making Integrate big data and data warehouse capabilities to increase operational efficiency Big Data Exploration Data Warehouse Augmentation Enhanced 360 o View of the Customer Operations Analysis Extend existing customer views by incorporating additional information sources Analyze a variety of machine data for improved business results

Data Warehouse Augmentation Integrate big data and data warehouse capabilities to increase operational efficiency Challenges Leveraging structured, unstructured, and streaming data sources for deep analysis Low latency requirements Query access to data Optimizing warehouse for big data volumes Metadata management to support impact analysis and data lineage Required capabilities Data Integration Hub Processing High-speed, massively scalable read from and write to big data sources and new data Big Data Expert Automatically build MapReduce logic through simple data flow design and coordinate workflow across traditional and big data platforms

Data Integration Hub Processing

2013 IBM Corporation Connectivity Hub InfoSphere DataStage Effectively handle the complexity of enterprise information sources and types with a common design paradigm across heterogeneous landscape with high-speed scalable solution to speed the delivery of analytics.

InfoSphere DataStage is Big Data Integration Sour ce Data Transfor m Cleanse Enrich EDW Dynamic Instantly get better performance as hardware resources are added to any topology Sequential Disk CPU Memor y CPU 4-way Parallel CPU Disk CPU Shared Memory CPU 64-way Parallel Uniprocessor SMP System MPP Clustered System Extendable Add a new server to scale out through simple text file edit (or, in grid config, automatically via integration with grid management software). Data Partitioned In true MPP fashion (like Hadoop) data persisted in the data integration platform is stored in parallel to scale out the I/O. Hadoop Integrated Push all or parts of the process out to Hadoop to take advantage of it s scalability in ELT fashion. 10 10

Big Data Source Types Hadoop Distributed File System massively scalable and resilient storage nosql (not-only SQL) record storage optimized for read (or write) nosql InfoSphere Streams massive real-time analytics 11

Blazing Fast HDFS Available since v8.7 in 2011 Extends the simple flat file paradigm - just add your hadoop server name and port number Parallelization techniques to pipe data in and out at massive scale Performance study run up to 5.2 TB/hr before hdfs disks were complete saturated (5 node hadoop cluster) 12

Simple data flow design for HDFS Transform/ restructure the data Read from an HDFS file in parallel Create new HDFS file, fully parallelized Join two HDFS files 13

Agile Connector Accelerators for nosql New connectors available on developerworks Plugs into InfoSphere DataStage and operates just like any other stage. Includes features to exploit specific data sources Open Code 14

Sample Job with MongoDB and Hive Selects what HDFS data to send down stream. Accepts specific MongoDB directives Writing data to MongoDB Writing data to Hive 15

Parse and Compose JSON (beta) Parsing and composing of JSON data format Included advanced transformation framework already provided for XML capabilities Beta available on InfoSphere DataStage 9.1 FP1 16

Big Data Expert

2013 IBM Corporation Big Data Expert InfoSphere DataStage Automatically push transformational processing close to where the data resides, both SQL for DBMS and MapReduce for Hadoop, leveraging the same simple data flow design process and coordinate workflow across all platforms

Automated MapReduce Job Generation New in 9.1, leverage the same UI and the same stages to build MapReduce. Drag and drop stages to the canvas to create a job, rather than have to learn MapReduce programming. Push the processing to Hadoop for patterns when you don t want to transport the data on the network. 19

Automated MapReduce Job Generation Build integration jobs with the same data flow tool and stages Automatically creates MapReduce code. 20 2013 IBM Corporation

Automated MapReduce Job Generation Job includes other database on separate system Recognizes what processing can run natively in Hadoop and what requires DataStage engine to move the data 21 2013 IBM Corporation

Architecture for Warehouse Landing Zone Use Case Requirements: Data Warehouse Landing Zone Large Scale large data volumes, scale out requires open MPP platform Low Cost low cost storage, compute and commodity hardware Many Data Types un/semi structured and social datatype coverage Many Access Patterns exploratory, iterative and discovery oriented clickstream ETL Lineage Quality sensors transactions Replication Information Server JAQL Hive HBase Analytics Warehouse Zone content Guardium BigInsights / Hadoop all sources Landing Zone Masking Masking Optim Custom MR Operational Warehouse Zone 22

Combined Workflows for Big Data Oozie Integration Same design paradigm for workflows as for job design. Directly call an Oozie activity that is invoking custom MapReduce code. End-to-end Workflows Sequence right alongside other data integration and analytics activities Allows users to have the data sourcing, ETL, Analytics and delivery of information all controlled through a single process. Monitor all stages through Operations Console s web based interace 23

Cross Tool Impact Analysis and Traceability Understand how traditional and big data sources are being used Assess impact of change and mitigate risks Show impact on downstream applications and BI reports Navigate through impacted areas and drill down

Wrap-up

The IBM Big Data Platform New analytic applications drive the requirements for a big data platform Integrate and manage the full variety, velocity and volume of data Apply advanced analytics to information in its native form Visualize all available data for adhoc analysis Development environment for building new analytic applications Workload optimization and scheduling Security and Governance Systems Management Hadoop System BIG DATA PLATFORM Application Development Accelerators Stream Computing Discovery Data Warehouse Information Integration & Governance Data Media Content Machine Social 26

Information Integration & Governance for Big Data Integrate & Link Big Data Big Data as a Source Big Data as a Target Data Transformations Data Movement Integrate w/existing Enterprise Data Lineage & Impact Analysis Metadata Integration w/analytics Realtime & Data Federation Cleanse and Validate Big Data Accuracy and Entity Matching with Social Data De-duplication and Standardization of Machine Data In-line Cleansing with Integration Trusted Data Dashboard and Reporting on Data Quality Protect Big Data Activity Monitoring Data Masking Data Encryption On-Demand / In-Place Protection In-Line Protection (w/etl etc.) Active Detection & Alerting Audit & Archive Big Data Queryable Archive Structured and Semi-Structured Optimized Connectors to existing Apps Hot-Restorable On-the-Fly Immutable and Secure Access Automated Legal Hold Capability for Data Freeze Master Big Data Big Data as a Supplier Big Data as a Consumer Links between Big Data and Trusted Golden Records Leverage Master Data in Big Data Analytics Entity Resolution at Extreme Scale Out Levels Probabilistic Entity Matching 27

Where to go for learn more. If you d like to explore this topic further Contact your IBM account team or your preferred IBM Partner. If you d like to explore more about InfoSphere DataStage and the Information Server platform http://www-01.ibm.com/software/data/integration/info_server/ If you re looking for a Enterprise level Hadoop distribution InfoSphere Big Insightshttp://www- 01.ibm.com/software/data/infosphere/biginsights/ 29

Thanks