Automated Data Ingestion Bernhard Disselhoff Enterprise Sales Engineer
Agenda Pentaho Overview Templated dynamic ETL workflows Pentaho Data Integration (PDI) Use Cases
Pentaho Overview
Overview What we will address today Automated self-service solutions Templated dynamic ETL workflows Manage the data pipeline at enterprise scale
Pentaho Product Components ETL, Job Orchestration & Big Data Pentaho Data Integration (PDI) Data Science Weka & R Data Modeling Pentaho Metadata & Mondrian Data Discovery Pentaho Analyzer Operational Reports Pentaho Report Designer & Interactive Reports Dashboards Pentaho Dashboard Designer & CTools Pentaho provides a complete platform for end-to-end data and analytics solutions
Templated Dynamic ETL Workflows ETL Metadata Injection
Traditional ETL Hardcoded Metadata Metadata details (fields, datatypes, etc.) are required for various steps within a transformation: sources, targets, and/or transformation steps. Extract Source Step 1 Transform Step 2 Step 3 Load Target Legacy ETL tools require you to hardcode the metadata at development time. Metadata
Dynamic ETL ETL Metadata Injection lets you inject the metadata into a template at runtime. Extract Source Step 1 Template Transform Step 2 Step 3 Load Target Metadata (blank)
Use Case 1 Scalability / Reuse Same workflow, many different files/tables, etc. Maintain metadata in a list/table and reuse a single workflow template. Extract Source Step 1 Template Transform Step 2 Step 3 Load Target Example: migrate 1,500 tables Metadata (blank)
Use Case 2 Self-service Allow user/customer to enter metadata in a simple web form Extract Template Transform Load Example: select fields for a template to pull data from Hadoop and build an on-demand data mart Source Metadata (blank) Step 1 Step 2 Step 3 Target
Use Case 3 Auto-Discovery Parse out metadata dynamically at runtime. Example: Dynamically parse messages of varying formats Extract Source Step 1 Template Transform Step 2 Step 3 Load Target Metadata (blank)
DRY Principle Don t Repeat Yourself Use a Templated Approach Use Cases Scalability: simplified data onboarding & management Large & Gas Co. Auto-Discovery: dynamic parsing of log files for cybersecurity Self-service: customer on-boarding Scalability: large data migration Major Professional Services Firm
Pentaho Data Integration (PDI)
Big Data challenges Pentaho addresses Network NoSQL Big Data Challenges Location Web Hadoop Cluster BI Tools EDW is too rigid, too slow, too expensive, for big data Steep learning curves, scarcity of talent Social Media Semi/un/structured data parsing, extraction, processing, data quality Blending big data with traditional data for 360 view Customer Data Access, Governance Provisioning Real time EDW Data Marts Data Science Billing
Connectivity The broadest data connectivity and a robust data integration engine Relational Big Data Applications Much More
Integrate ALL Data in an Intuitive Way A user-friendly graphical interface to build complete data pipelines 100+ Transformation Steps Drag & drop Development 100% GUI-based Configuration Model, Analyze, and Visualize as you go
Concept Data Transformations INPUT(S) PROCESS(ES) OUTPUT(S)
Concept Jobs (orchestrate) START CHECK WATCH EXECUTE NOTIFY - FINISH
Job Orchestration Toolkit
Integrate ALL Data in an Intuitive Way Apply familiar ETL techniques to new data and technologies Data Profiling and Data Quality Validate Cleanse De-duplicate Filter Transform Sort, Aggregate and Group Normalize & De-normalize Calculate, rank and score In-flight encryption & compression Data Blending Join disparate data sources Data caching Output to multiple targets Control Structures Split & re-join data streams Dynamic variables with multiple scoping levels Define serial & parallel execution workflows Unlimited levels of job nesting
Go Beyond Standard ETL Operations Flexible capabilities to provide data services and deliver analytics Data Virtualization & Application Integration PDI JDBC & Web Services Extract Transform Report Automate Data Science Open Architecture Pluggable architecture Active community eco-system and marketplace Create your own data connectors and transformation steps Call your own code: Java, JavaScript, Shell scripts, & SQL Stored Procedures
Manage the Data Pipeline at Enterprise Scale Architected for Scalable Performance Scale Up PDI Clustering Visual Map/Reduce YARN
Manage the Data Pipeline at Enterprise Scale Enterprise-Grade Control and Security Job Orchestration Check resource availability, watch for file, etc. Execute transformations, nested jobs, shell scripts Logging, error handling, and Notifications Administration Enterprise scheduler Real-time performance monitoring Restart jobs at checkpoints on failure Bundled operations mart & reports to audit usage and access Security Active Directory / LDAP integration Access controls Version control
PDI Infrastructure Components Loosely Coupled Components Pentaho Data Integration (PDI) Server J2EE (Tomcat, JBoss) Data Source Data Target Data Virtualization PDI JDBC Interface Application Integration via HTTP/S Transformation returns JSON, XML, Text, etc. Web Service Call Enterprise Scheduler Initiates ETL via CLI SSH CLI Publish Repository Database ETL Versioning Security DB Connections PDI Cluster Configuration Partitioning Schemes Logging Scheduling Oracle, SQL Server, PostgreSQL, and MySQL are supported Developer Workstation ETL Development Monitoring Administration Scheduling Local execution Mac, Windows, Linux supported
Summary
Putting it all together Data Data Engineering Data Preparation Analytics Managing and Automating the Pipeline Administration Security Lifecycle Management Data Provenance Dynamic Data Pipeline Monitoring Automation
Pentaho & Hitachi Solutions Social Innovation, IoT, Smart Cities, & Vertical Solutions Turnkey BI and Big Data Solutions Embedded Solutions for Enterprise Data Governance and SaaS CLOUD Big Data since 2009 Traditional DI & BI Since 2004 SSO & Java Spring DB per Group/Tenant Row-level Multi-tenancy Object Multi-tenancy UI Multi-tenancy Scale-out Architecture Unified Compute Platform (UCP) Hyper-Scale-out Platform (HSP)
Summary What we addressed today Automated self-service solutions Templated dynamic ETL workflows Manage the data pipeline at enterprise scale
Thank You! Besuchen Sie uns am Hitachi Demopunkt im Foyer