Luncheon Webinar Series May 13, 2013

Similar documents
Big Data, Integration and Governance: Ask the Experts

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Ganzheitliches Datenmanagement

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

IBM BigInsights for Apache Hadoop

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Integrating Netezza into your existing IT landscape

IBM InfoSphere BigInsights Enterprise Edition

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Building Confidence in Big Data Innovations in Information Integration & Governance for Big Data

The Future of Data Management

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

Data Integration Checklist

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

IBM Big Data Platform

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Big Data and Trusted Information

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Information Architecture

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Beyond the Single View with IBM InfoSphere

Implement Hadoop jobs to extract business value from large and varied data sets

HDP Hadoop From concept to deployment.

What's New in SAS Data Management

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

How the oil and gas industry can gain value from Big Data?

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Getting Started Practical Input For Your Roadmap

IBM Software Integrating and governing big data

The Next Wave of Data Management. Is Big Data The New Normal?

SAP and Hortonworks Reference Architecture

How To Handle Big Data With A Data Scientist

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Oracle Big Data SQL Technical Update

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

The Future of Data Management with Hadoop and the Enterprise Data Hub

Agile Business Intelligence Data Lake Architecture

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

Building your Big Data Architecture on Amazon Web Services

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

#TalendSandbox for Big Data

Navigating the Big Data infrastructure layer Helena Schwenk

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Deploying an Operational Data Store Designed for Big Data

Big Data Analytics Nokia

More Data in Less Time

Comprehensive Analytics on the Hortonworks Data Platform

The Enterprise Data Hub and The Modern Information Architecture

Bringing Big Data to People

Apache Hadoop: The Big Data Refinery

Getting Real Real Time Data Integration Patterns and Architectures

White Paper. Unified Data Integration Across Big Data Platforms

Unified Data Integration Across Big Data Platforms

Safe Harbor Statement

Testing Big data is one of the biggest

Business Intelligence for Big Data

IBM Big Data Platform

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

How to Enhance Traditional BI Architecture to Leverage Big Data

Hadoop & Spark Using Amazon EMR

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Big Data Management and Security

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Constructing a Data Lake: Hadoop and Oracle Database United!

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Tap into Hadoop and Other No SQL Sources

Customized Report- Big Data

How To Use Hp Vertica Ondemand

How To Use Big Data For Business

Modern Data Architecture for Predictive Analytics

Virtualizing Apache Hadoop. June, 2012

A Detailed Look at IBM InfoSphere DataStage 9.1

Using Tableau Software with Hortonworks Data Platform

Luncheon Webinar Series November 18th, What s new in IS Presented by Tony Curcio Sponsored By:

Big Data Architectures. Tom Cahill, Vice President Worldwide Channels, Jaspersoft

Native Connectivity to Big Data Sources in MSTR 10

Manifest for Big Data Pig, Hive & Jaql

Big Data Integration: A Buyer's Guide

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Real Time Big Data Processing

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Transcription:

Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0

InfoSphere DataStage is Big Data Integration Questions and suggestions regarding presentation topics? - send to editor@dsxchange.net Downloading the presentation Click Presentation YES on Poll Question Replay will be available within one day with email with details Bonus Offer Free premium membership for your DataStage Management! Submit your management s email address and we will offer him/her access on your behalf. Email Info@dsxchange.net subject line Managers special. Join us all at Linkedin http://tinyurl.com/dsxmembers ISXchange will sponsor Trial membership for new requests at Linkedin DSX members site

2013 IBM Corporation InfoSphere DataStage is Big Data Integration Tony Curcio InfoSphere Product Management

Bigger Data Integration Challenges New types of data stores Big Data introduces additional data stores that need to be integrated both Hadoop based and nosql based These data stores don t easily lend themselves to conventional methods for data movement New data types and formats Unstructured data; poly-structured data stores; JSON, Avro, and what more to come??? Video, docs, web logs, Larger volumes Solutions need to move, transform, cleanse and otherwise prepare huge data volumes Big Data requires data scalability 3

Benefits of InfoSphere DataStage Speeds Productivity Graphical design easier to use than hand coding Simplifies Heterogeneity Common method for diverse data sources Shortens Project Cycles Pre-built components reduce cost and timelines Promotes Object Reuse Build once, share, and run anywhere (etl/elt/real-time) Reduces Operational Cost Provides a robust framework to manage data integration Protects from Changes isolation from underlying technologies changes as they continue to evolve

Big Data is part of the Information Supply Chain Transactional & Collaborative Applications Manage Integrate Master Data Analyze Content Big Data Cubes Streams Business Analytics Applications External Information Sources Data Content Streaming Information Govern Data Warehouses Information Governance Quality Lifecycle Security & Privacy Standards Gartner Magic Quadrant IBM is the only DBMS vendor that can offer an information architecture across the entire organization, covering information on all systems 5

4 Key Analytical Use Cases for Big Data Find, visualize, understand all big data to improve decision making Integrate big data and data warehouse capabilities to increase operational efficiency Big Data Exploration Data Warehouse Augmentation Enhanced 360 o View of the Customer Operations Analysis Extend existing customer views by incorporating additional information sources Analyze a variety of machine data for improved business results

Data Warehouse Augmentation Integrate big data and data warehouse capabilities to increase operational efficiency Challenges Leveraging structured, unstructured, and streaming data sources for deep analysis Low latency requirements Query access to data Optimizing warehouse for big data volumes Metadata management to support impact analysis and data lineage Required capabilities Data Integration Hub Processing High-speed, massively scalable read from and write to big data sources and new data Big Data Expert Automatically build MapReduce logic through simple data flow design and coordinate workflow across traditional and big data platforms

Data Integration Hub Processing

2013 IBM Corporation Connectivity Hub InfoSphere DataStage Effectively handle the complexity of enterprise information sources and types with a common design paradigm across heterogeneous landscape with high-speed scalable solution to speed the delivery of analytics.

InfoSphere DataStage is Big Data Integration Sour ce Data Transfor m Cleanse Enrich EDW Dynamic Instantly get better performance as hardware resources are added to any topology Sequential Disk CPU Memor y CPU 4-way Parallel CPU Disk CPU Shared Memory CPU 64-way Parallel Uniprocessor SMP System MPP Clustered System Extendable Add a new server to scale out through simple text file edit (or, in grid config, automatically via integration with grid management software). Data Partitioned In true MPP fashion (like Hadoop) data persisted in the data integration platform is stored in parallel to scale out the I/O. Hadoop Integrated Push all or parts of the process out to Hadoop to take advantage of it s scalability in ELT fashion. 10 10

Big Data Source Types Hadoop Distributed File System massively scalable and resilient storage nosql (not-only SQL) record storage optimized for read (or write) nosql InfoSphere Streams massive real-time analytics 11

Blazing Fast HDFS Available since v8.7 in 2011 Extends the simple flat file paradigm - just add your hadoop server name and port number Parallelization techniques to pipe data in and out at massive scale Performance study run up to 5.2 TB/hr before hdfs disks were complete saturated (5 node hadoop cluster) 12

Simple data flow design for HDFS Transform/ restructure the data Read from an HDFS file in parallel Create new HDFS file, fully parallelized Join two HDFS files 13

Agile Connector Accelerators for nosql New connectors available on developerworks Plugs into InfoSphere DataStage and operates just like any other stage. Includes features to exploit specific data sources Open Code 14

Sample Job with MongoDB and Hive Selects what HDFS data to send down stream. Accepts specific MongoDB directives Writing data to MongoDB Writing data to Hive 15

Parse and Compose JSON (beta) Parsing and composing of JSON data format Included advanced transformation framework already provided for XML capabilities Beta available on InfoSphere DataStage 9.1 FP1 16

Big Data Expert

2013 IBM Corporation Big Data Expert InfoSphere DataStage Automatically push transformational processing close to where the data resides, both SQL for DBMS and MapReduce for Hadoop, leveraging the same simple data flow design process and coordinate workflow across all platforms

Automated MapReduce Job Generation New in 9.1, leverage the same UI and the same stages to build MapReduce. Drag and drop stages to the canvas to create a job, rather than have to learn MapReduce programming. Push the processing to Hadoop for patterns when you don t want to transport the data on the network. 19

Automated MapReduce Job Generation Build integration jobs with the same data flow tool and stages Automatically creates MapReduce code. 20 2013 IBM Corporation

Automated MapReduce Job Generation Job includes other database on separate system Recognizes what processing can run natively in Hadoop and what requires DataStage engine to move the data 21 2013 IBM Corporation

Architecture for Warehouse Landing Zone Use Case Requirements: Data Warehouse Landing Zone Large Scale large data volumes, scale out requires open MPP platform Low Cost low cost storage, compute and commodity hardware Many Data Types un/semi structured and social datatype coverage Many Access Patterns exploratory, iterative and discovery oriented clickstream ETL Lineage Quality sensors transactions Replication Information Server JAQL Hive HBase Analytics Warehouse Zone content Guardium BigInsights / Hadoop all sources Landing Zone Masking Masking Optim Custom MR Operational Warehouse Zone 22

Combined Workflows for Big Data Oozie Integration Same design paradigm for workflows as for job design. Directly call an Oozie activity that is invoking custom MapReduce code. End-to-end Workflows Sequence right alongside other data integration and analytics activities Allows users to have the data sourcing, ETL, Analytics and delivery of information all controlled through a single process. Monitor all stages through Operations Console s web based interace 23

Cross Tool Impact Analysis and Traceability Understand how traditional and big data sources are being used Assess impact of change and mitigate risks Show impact on downstream applications and BI reports Navigate through impacted areas and drill down

Wrap-up

The IBM Big Data Platform New analytic applications drive the requirements for a big data platform Integrate and manage the full variety, velocity and volume of data Apply advanced analytics to information in its native form Visualize all available data for adhoc analysis Development environment for building new analytic applications Workload optimization and scheduling Security and Governance Systems Management Hadoop System BIG DATA PLATFORM Application Development Accelerators Stream Computing Discovery Data Warehouse Information Integration & Governance Data Media Content Machine Social 26

Information Integration & Governance for Big Data Integrate & Link Big Data Big Data as a Source Big Data as a Target Data Transformations Data Movement Integrate w/existing Enterprise Data Lineage & Impact Analysis Metadata Integration w/analytics Realtime & Data Federation Cleanse and Validate Big Data Accuracy and Entity Matching with Social Data De-duplication and Standardization of Machine Data In-line Cleansing with Integration Trusted Data Dashboard and Reporting on Data Quality Protect Big Data Activity Monitoring Data Masking Data Encryption On-Demand / In-Place Protection In-Line Protection (w/etl etc.) Active Detection & Alerting Audit & Archive Big Data Queryable Archive Structured and Semi-Structured Optimized Connectors to existing Apps Hot-Restorable On-the-Fly Immutable and Secure Access Automated Legal Hold Capability for Data Freeze Master Big Data Big Data as a Supplier Big Data as a Consumer Links between Big Data and Trusted Golden Records Leverage Master Data in Big Data Analytics Entity Resolution at Extreme Scale Out Levels Probabilistic Entity Matching 27

Where to go for learn more. If you d like to explore this topic further Contact your IBM account team or your preferred IBM Partner. If you d like to explore more about InfoSphere DataStage and the Information Server platform http://www-01.ibm.com/software/data/integration/info_server/ If you re looking for a Enterprise level Hadoop distribution InfoSphere Big Insightshttp://www- 01.ibm.com/software/data/infosphere/biginsights/ 29

Thanks