Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer



Similar documents
Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Performance and Scalability Overview

Data Integration Checklist

Luncheon Webinar Series May 13, 2013

Unified Batch & Stream Processing Platform

IP Expo 2014 Pentaho Big Data Analytics Accelerating the time to big data value London, UK

Performance and Scalability Overview

Sisense. Product Highlights.

Jitterbit Technical Overview : Salesforce

Ernesto Ongaro BI Consultant February 19, The 5 Levels of Embedded BI

XpoLog Competitive Comparison Sheet

Jitterbit Technical Overview : Microsoft Dynamics AX

Jitterbit Technical Overview : Microsoft Dynamics CRM

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Pentaho BI Capability Profile

2014 Astera Software. Convergence of Data and Application Integration

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

Hadoop & Spark Using Amazon EMR

Getting Started & Successful with Big Data

What's New in SAS Data Management

Big Data Analytics Nokia

Cisco IT Hadoop Journey

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

RRF Reply Reporting Framework

IBM Websphere Application Server as a Service

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

AngularJS, Javascript, Prototype-based OO concept, RESTful Design Pattern, GWT, HTML5, Database.

Pentaho Reporting Overview

White Paper. Unified Data Integration Across Big Data Platforms

Unified Data Integration Across Big Data Platforms

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Cloud Ready Data: Speeding Your Journey to the Cloud

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Pentaho Data Integration 4 and MySQL. Matt Casters: Pentaho's Chief Data Integration Kettle Project Founder

Roadmap Talend : découvrez les futures fonctionnalités de Talend

A BUSINESS INTELLIGENCE PLATFORM

Open Source Business Intelligence Intro

Big Data at Cloud Scale

GeoKettle: A powerful open source spatial ETL tool

Lofan Abrams Data Services for Big Data Session # 2987

Embedded Analytics Vendor Selection Guide. A holistic evaluation criteria for your OEM analytics project

GoodData. Platform Overview

Client Overview. Engagement Situation. Key Requirements for Platform Development :

Data Virtualization for Agile Business Intelligence Systems and Virtual MDM. To View This Presentation as a Video Click Here

Databricks. A Primer

How to avoid building a data swamp

Real Time Big Data Processing

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Assignment # 1 (Cloud Computing Security)

Securing your business

A Tipping Point for Automation in the Data Warehouse.

Cisco Data Preparation

Introduction to Oracle Business Intelligence Standard Edition One. Mike Donohue Senior Manager, Product Management Oracle Business Intelligence

WHITE PAPER. Domo Advanced Architecture

Databricks. A Primer

Cloud Computing. With MySQL and Pentaho Data Integration. Matt Casters Chief Data Integration at Pentaho Kettle project founder

<Insert Picture Here> Oracle BI Standard Edition One The Right BI Foundation for the Emerging Enterprise

Automate Your BI Administration to Save Millions with Command Manager and System Manager

Copyright 2014, Oracle and/or its affiliates. All rights reserved.

Analance Data Integration Technical Whitepaper

Ganzheitliches Datenmanagement

Framework Adoption for Java Enterprise Application Development

Cloud First Does Not Have to Mean Cloud Exclusively. Digital Government Institute s Cloud Computing & Data Center Conference, September 2014

Eliminating Complexity to Ensure Fastest Time to Big Data Value

Cisco Integration Platform

Safe Harbor Statement

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

HDP Hadoop From concept to deployment.

Improve your IT Analytics Capabilities through Mainframe Consolidation and Simplification

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

SQL Server 2012 Business Intelligence Boot Camp

Integration in Action using JBoss Middleware. Ashokraj Natarajan - Cognizant

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Communiqué 4. Standardized Global Content Management. Designed for World s Leading Enterprises. Industry Leading Products & Platform

SAP HANA SPS 09 - What s New? HANA IM Services: SDI and SDQ

Oracle Business Intelligence EE. Prab h akar A lu ri

Digital Marketplace - G-Cloud

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

EMA Radar for Workload Automation (WLA): Q2 2012

Analance Data Integration Technical Whitepaper

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

SAS Enterprise Data Integration Server - A Complete Solution Designed To Meet the Full Spectrum of Enterprise Data Integration Needs

The Inside Scoop on Hadoop

Architecting for the Internet of Things & Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

Mike Boyarski Jaspersoft Product Marketing Business Intelligence in the Cloud

Testing Big data is one of the biggest

Pervasive Software + NetSuite = Seamless Cloud Business Processes

Taming the Elephant with Big Data Management. Deep Dive

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Oracle Reference Architecture and Oracle Cloud

MySQL Enterprise Monitor

In-memory computing with SAP HANA

Business Intelligence for Big Data

ACEYUS REPORTING. Aceyus Intelligence Executive Summary

G Cloud 6 CDG Service Definition for Forgerock Software Services

Transcription:

Automated Data Ingestion Bernhard Disselhoff Enterprise Sales Engineer

Agenda Pentaho Overview Templated dynamic ETL workflows Pentaho Data Integration (PDI) Use Cases

Pentaho Overview

Overview What we will address today Automated self-service solutions Templated dynamic ETL workflows Manage the data pipeline at enterprise scale

Pentaho Product Components ETL, Job Orchestration & Big Data Pentaho Data Integration (PDI) Data Science Weka & R Data Modeling Pentaho Metadata & Mondrian Data Discovery Pentaho Analyzer Operational Reports Pentaho Report Designer & Interactive Reports Dashboards Pentaho Dashboard Designer & CTools Pentaho provides a complete platform for end-to-end data and analytics solutions

Templated Dynamic ETL Workflows ETL Metadata Injection

Traditional ETL Hardcoded Metadata Metadata details (fields, datatypes, etc.) are required for various steps within a transformation: sources, targets, and/or transformation steps. Extract Source Step 1 Transform Step 2 Step 3 Load Target Legacy ETL tools require you to hardcode the metadata at development time. Metadata

Dynamic ETL ETL Metadata Injection lets you inject the metadata into a template at runtime. Extract Source Step 1 Template Transform Step 2 Step 3 Load Target Metadata (blank)

Use Case 1 Scalability / Reuse Same workflow, many different files/tables, etc. Maintain metadata in a list/table and reuse a single workflow template. Extract Source Step 1 Template Transform Step 2 Step 3 Load Target Example: migrate 1,500 tables Metadata (blank)

Use Case 2 Self-service Allow user/customer to enter metadata in a simple web form Extract Template Transform Load Example: select fields for a template to pull data from Hadoop and build an on-demand data mart Source Metadata (blank) Step 1 Step 2 Step 3 Target

Use Case 3 Auto-Discovery Parse out metadata dynamically at runtime. Example: Dynamically parse messages of varying formats Extract Source Step 1 Template Transform Step 2 Step 3 Load Target Metadata (blank)

DRY Principle Don t Repeat Yourself Use a Templated Approach Use Cases Scalability: simplified data onboarding & management Large & Gas Co. Auto-Discovery: dynamic parsing of log files for cybersecurity Self-service: customer on-boarding Scalability: large data migration Major Professional Services Firm

Pentaho Data Integration (PDI)

Big Data challenges Pentaho addresses Network NoSQL Big Data Challenges Location Web Hadoop Cluster BI Tools EDW is too rigid, too slow, too expensive, for big data Steep learning curves, scarcity of talent Social Media Semi/un/structured data parsing, extraction, processing, data quality Blending big data with traditional data for 360 view Customer Data Access, Governance Provisioning Real time EDW Data Marts Data Science Billing

Connectivity The broadest data connectivity and a robust data integration engine Relational Big Data Applications Much More

Integrate ALL Data in an Intuitive Way A user-friendly graphical interface to build complete data pipelines 100+ Transformation Steps Drag & drop Development 100% GUI-based Configuration Model, Analyze, and Visualize as you go

Concept Data Transformations INPUT(S) PROCESS(ES) OUTPUT(S)

Concept Jobs (orchestrate) START CHECK WATCH EXECUTE NOTIFY - FINISH

Job Orchestration Toolkit

Integrate ALL Data in an Intuitive Way Apply familiar ETL techniques to new data and technologies Data Profiling and Data Quality Validate Cleanse De-duplicate Filter Transform Sort, Aggregate and Group Normalize & De-normalize Calculate, rank and score In-flight encryption & compression Data Blending Join disparate data sources Data caching Output to multiple targets Control Structures Split & re-join data streams Dynamic variables with multiple scoping levels Define serial & parallel execution workflows Unlimited levels of job nesting

Go Beyond Standard ETL Operations Flexible capabilities to provide data services and deliver analytics Data Virtualization & Application Integration PDI JDBC & Web Services Extract Transform Report Automate Data Science Open Architecture Pluggable architecture Active community eco-system and marketplace Create your own data connectors and transformation steps Call your own code: Java, JavaScript, Shell scripts, & SQL Stored Procedures

Manage the Data Pipeline at Enterprise Scale Architected for Scalable Performance Scale Up PDI Clustering Visual Map/Reduce YARN

Manage the Data Pipeline at Enterprise Scale Enterprise-Grade Control and Security Job Orchestration Check resource availability, watch for file, etc. Execute transformations, nested jobs, shell scripts Logging, error handling, and Notifications Administration Enterprise scheduler Real-time performance monitoring Restart jobs at checkpoints on failure Bundled operations mart & reports to audit usage and access Security Active Directory / LDAP integration Access controls Version control

PDI Infrastructure Components Loosely Coupled Components Pentaho Data Integration (PDI) Server J2EE (Tomcat, JBoss) Data Source Data Target Data Virtualization PDI JDBC Interface Application Integration via HTTP/S Transformation returns JSON, XML, Text, etc. Web Service Call Enterprise Scheduler Initiates ETL via CLI SSH CLI Publish Repository Database ETL Versioning Security DB Connections PDI Cluster Configuration Partitioning Schemes Logging Scheduling Oracle, SQL Server, PostgreSQL, and MySQL are supported Developer Workstation ETL Development Monitoring Administration Scheduling Local execution Mac, Windows, Linux supported

Summary

Putting it all together Data Data Engineering Data Preparation Analytics Managing and Automating the Pipeline Administration Security Lifecycle Management Data Provenance Dynamic Data Pipeline Monitoring Automation

Pentaho & Hitachi Solutions Social Innovation, IoT, Smart Cities, & Vertical Solutions Turnkey BI and Big Data Solutions Embedded Solutions for Enterprise Data Governance and SaaS CLOUD Big Data since 2009 Traditional DI & BI Since 2004 SSO & Java Spring DB per Group/Tenant Row-level Multi-tenancy Object Multi-tenancy UI Multi-tenancy Scale-out Architecture Unified Compute Platform (UCP) Hyper-Scale-out Platform (HSP)

Summary What we addressed today Automated self-service solutions Templated dynamic ETL workflows Manage the data pipeline at enterprise scale

Thank You! Besuchen Sie uns am Hitachi Demopunkt im Foyer