Taming the Elephant with Big Data Management Deep Dive
Big Data Management Introduction
Safe Harbor The information being provided today is for informational purposes only. The development, release and timing of any Informatica product or functionality described today remain at the sole discretion of Informatica and should not be relied upon in making a purchasing decision. Statements made today are based on currently available information, which is subject to change. Such statements should not be relied upon as a representation, warranty or commitment to deliver specific products or functionality in the future
Overview of Data Integration Solutions PowerCenter Big Data Management Cloud Data Integration Traditional Workloads Next-Gen Workloads Cloud & SaaS Workloads Data Warehousing Agile BI Real-time DI Data Migration Apps Integration (onprem) DW Offloading/ Optimization Data Lakes Big Data Analytics NoSQL Integration Apps Integration (Hybrid) Cloud & Hybrid DI DW & Analytics (Cloud DBs)
Informatica s big data Journey 2012 2012 1 st release of Informatica Big Data Edition 1 st Data Integration Platform to Natively execute on Hadoop Support for Map Reduce Support for HDFS/Hive/HBase Profile Natively on Hadoop Map Reduce Processing & Resource Management HDFS Distributed Storage Hadoop 1.0
Informatica s big data Journey 2016 Polyglot computing: Map Reduce, Blaze, Tez, Spark Informatica Big Data Management Smart Executor Multi-distribution support on both on- Hive on Map Reduce Hive on Tez Hive on Spark Spark Blaze prem and cloud End to End Big Data Map Reduce Tez Spark Core Spark Core INFA ENGINE Management solutions YARN HDFS
Big data modes of execution Native Hadoop Pushdown Run on Informatica Node(s) Connect to Hadoop sources/targets Run on Hadoop cluster Connect to Hadoop sources/targets Connect to non- Hadoop sources/targets
Why Informatica BDM? Informatica Mappings Business logic Informatica Big Data Management Solution Informatica Native SQL Pushdown Map Reduce Hadoop Pushdown Tez Spark Blaze Polyglot Computing
Big Data Challenges 36% 33% 26% 26% Obtaining Skills and capabilities needed Security, Privacy & Data Quality Integrating multiple data sources Integrating big data technology with existing infrastructure Mapping based development PC Reuse SQL to Mapping Kerberos Support Sentry / Ranger Support Data masking, OS Profiles DQ, Profiling on Hadoop Power Exchange Data Processor SQOOP On-Prem distro support Cloud distro support Source: Gartner
3 pillars of Informatica Big Data Management Single, Comprehensive and Integrated Platform for End-to-End Big Data Management Data Integration Data Quality & Governance Data Security
Universal connectivity WebSphere MQ JMS MSMQ SAP NetWeaver XI Oracle DB2 UDB DB2/400 SQL Server Sybase ADABAS Datacom DB2 IDMS IMS Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP Pivotal Vertica Netezza Web Services TIBCO webmethods Informix Teradata Netezza ODBC JDBC VSAM C-ISAM Binary Flat Files Tape Formats Flat files ASCII reports HTML RPG ANSI LDAP Teradata Aster JD Edwards Lotus Notes Oracle E-Business PeopleSoft Salesforce CRM Force.com RightNow NetSuite EDI X12 EDI-Fact RosettaNet HL7 HIPAA XML LegalXML IFX cxml Facebook Twitter LinkedIn Kapow ADP Hewitt SAP By Design Oracle OnDemand AST FIX SWIFT Cargo IMP MVR 100+ PRE-BUILT PARSERS 200+ PRE-BUILT CONNECTORS Out of the Box BUSINESS RULES AND DATA STANDARDIZATION
Pre-Built Parsers for Industry Standards Data Storage & Transport Formats Industry Standard Formats Organizational Formats Informatica IDE XML Delimited Files Financial Services PDF JSON Healthcare Word AVRO Hadoop Cluster EDI Excel Parquet
SQOOP JDBC based universal connectivity to many sources No need for installation of database clients on Hadoop cluster to read / write data Seamless integration into Informatica mappings Integration at both connection and data object level Works similar to External Loaders in PowerCenter
Profiling on Hadoop Statistics to identify anomalies Value & Pattern Analysis Drill down analysis Multi tenancy Analyst Informatica Native Hadoop Pushdown
Data Quality on Hadoop Address validation Parse Match Standardize Data Quality Informatica Native Hadoop Pushdown
Security has many aspects Application Multi-tenancy + Infrastructure Data Encryption Data Masking + Authentication Authorization Auditing Monitoring http://blogs.informatica.com/2015/07/24/bigdatasecurity-2/
Authentication: Kerberos Informatica BDM Supports: Kerberos authentication in INFA domains Connecting to Kerberos enabled Hadoop clusters Industry standard authentication for Hadoop clusters 360 O support: Client & Server Metadata access & Data access Polyglot engines: Hive, Blaze & Spark modes
Blaze Security Integration Ranger/Sentry Blaze Executor Blaze Runtime Blaze Container Mapping at runtime (in-memory) Source Transforms Target Hive Metastore HDFS Service / Hive Server 2 Optimizer call Ranger/Sentry HDFS Data files Informatica node Hadoop Cluster
Informatica Monitoring 1 2
Informatica Monitoring 1 2
Informatica Monitoring 1 2 3
Data Masking Supports Persistent Data Masking 16 different techniques supported including SSN Mask sensitive data while ingesting and processing Credit Card First & Last names, Emails Polyglot engine: Supported in Native mode Supported in Hive mode Supported in Blaze mode
Multi-tenancy Application Binding Bind multiple Informatica users to one or more system accounts System accounts can be OS / Hadoop accounts Primarily used in batch use-cases, mappings User Binding Also known as pass through security Bind individual Informatica users to their corresponding OS / Hadoop accounts Primarily used in BI use-cases, data profiling
3 pillars of Informatica Big Data Management Single, Comprehensive and Integrated Platform for End-to-End Big Data Management SQOOP Blaze DI on Spark Data Integration SQOOP for Profiling Blaze for Profiling JDBC for reference data* Data Quality & Governance Kerberos Sentry / Ranger Data Masking Data Security
Deep Dive Hand s on
DEMO Use case Industry: Airlines Use-case: DWH Optimization Scenario: INFA Air receives information from multiple airports on the expected / actual schedules of various flights. They need to consolidate this information into a Hadoop environment to perform analytics such as flight-on-time analysis Challenges: Data is collected in various formats with various intervals: Some provide in flat files and some are staged in Oracle table All this data is ingested into a Hive table for cleansing and analysis The data from hive table is subsequently sent to alerting system to send individual alerts for travelers
Lab environment Private Network Hadoop Cluster Informatica Server Hadoop Node 1 Hadoop Node 2 Informatica Client
Login credentials Lab access: https://informatica.instructorled.training Access code: 34762748xx Hadoop Node 1 Hadoop Node 2 Host name Username Password psvrl65iw2016hdp00 1 psvrl65iw2016hdp00 2 iw2016 iw2016 iw2016 iw2016 INFA Server psvrl65iw2016i1001 iw2016 iw2016 INFA Client psvw7iw2016i1001 Administrator iw2016 Administrator, Monitoring Administrator Administrator Desktop tools:
Logging into the lab
Logging into the lab
Overview of labs Lab 1 High speed Ingestion in pushdown mode Read from flat file Read from Oracle Union the data Write to hive Lab 2 Extraction with schema-on-read Read from Hive Write to flat file Dynamically update the schema Use Blaze
Questions??!
User Groups Informatica User Groups are a great way for you to invest in your professional development and learn about new Informatica offerings. Local Chapter Leaders manage each IUG online and via in person meetings Network and Socialize Find and share content, best practices & tips Learn about the latest technologies and solutions from Informatica Discover how colleagues and peers use Informatica https://network.informatica.com/welcome/ LEARN MORE AT IW16 : Go to the Solutions Expo Informatica Pavilion / Ecosystem & Innovation Area: Talk to regional user group leaders Learn about meeting plans Join your regional user group When: Monday 6:00pm 8:30pm Tuesday 10:45am 2:15pm Wednesday 10:30am 1:45pm Where: Moscone West Hall Level One