TE's Analytics on Hadoop and SAP HANA Using SAP Vora Naveen Narra Senior Manager TE Connectivity Santha Kumar Rajendran Enterprise Data Architect TE Balaji Krishna - Director, SAP HANA Product Mgmt. - SAP Session ID# A 4963
TE CONNECTIVITY ( FORMERLY TYCO ELECTRONICS ) COMPANY PROFILE AMERICAS CHINA EMEA ASIA * (EXCLUDING CHINA) $ 4.4B $ 4.8B $ 2.3B $ 2.4B Design Centers 10 5 3 3 Manufacturing Sites 38 29 8 15 Engineers 2,570 2,020 810 2,100 $ 13.9B FY 14 SALES WORLDWIDE *Including India
TE DATA ANALYTICS VISION Critical Capabilities Speed from source to data to publish insights Self Service BI different solutions for different skills Enterprise Data Platform Complete, Secured, Understood, Trusted Data Governance data definitions, data lineage, data visibility BU Data Labs for data discovery and investigative analysis
ETL TE ANALYTICS CONCEPTUAL ARCHITECTURE Optimized to service the right data workload and analytical use Data Sources Data Platforms Data Presentation ERP Sources ETL Data to Run the Business Enterprise Data Warehouse Analytics to Run the Business Business Objects Guided Analytics Non-ERP Sources Standard Reporting ETL Analytics to Change the Business Emerging Sources Emerging Sources ETL Data to Change the Business Data Discovery/ Visualization Predictive Analytics External Sources Enterprise Data Hub Machine Learning Data Governance / Data Security manage data as an asset
TE ANALYTICS LOGICAL ARCHITECTURE Structured Data SAP Sources Other ERP / TED / Sales Force / Elequa etc 4 to 5 Billion per Month 3 to 4 Billion per Month EDW - Enterprise Data warehouse HANA Sybase IQ Hot Warm HANA Guided Analytics Standard Reporting Data Discovery / Visualization Machine Data Social Media, Geo spatial etc SAP Data Services (ETL) Semistructured & Unstructured Data EDH - Enterprise Data Hub ( Hadoop ) ODBC Reporting Predictive Analytics Machine Learning
TE ANALYTICS LOGICAL ARCHITECTURE WITH VORA Structured Data SAP Sources Other ERP / TED / Sales Force / Elequa etc 4 to 5 Billion per Month 3 to 4 Billion per Month EDW - Enterprise Data warehouse HANA Sybase IQ Hot Warm HANA Guided Analytics Standard Reporting Data Discovery / Visualization Vora Semistructured & Unstructured Data EDH - Enterprise Data Hub ( Hadoop ) Machine Data Social Media, Geo spatial etc ODBC Reporting Predictive Analytics Machine Learning
DATA TIERING USING HANA DYNAMIC TIERING HOT STORE (In-memory) WARM STORE (on-disk) Active Tables/ DATA Marts PSA Staging Reporting Layer Historic Snapshots Dynamic Tiering is a warm storage option for HANA and is a integral part of HANA Architecture Dynamic Tiering helps to reduce In-memory footprint by pushing non frequently used data from Memory to disk Dynamic Tiering with BW helps to push staging and Write optimized DSOs and Changelog tables to disk SAP HANA By using Dynamic Tiering we can push all Snapshots and Historic data to disk
DYNAMIC TIERING USE CASES Use Case 1 : PSA Persistent staging area is the temporary landing zone in BW for any data loaded into BW, as of now we retain 3 to 15 days of data in PSA and this constitute of 8 % of our HANA usage Use Case 2 : We are using Staging Tables in HANA to store data for PSA ( No Reporting ) Staging ( No Reporting ) processing, lookup etc and this constitute 10 % of our HANA Usage Use Case 3 : BW Change Log tables are used to determine delta records in BW and we retain 3 to 15 days of data in Change log tables and this constitute 8 % of our HANA usage Use Case 4 : BW L1 Staging DSOs We use these tables as staging tables and they hold all the data loaded into BW at the most granular level, they retain massive volume of data and they constitute about 25 % of our HANA usage Use case 5 & 6 : We are planning to store all our Historic data ( data older than 3 years ) and Snapshots in DT for slow reporting Change Log ( No Reporting ) DSO ( No Reporting ) Historic Snapshots Reporting Data Marts SAP HANA WARM STORAGE ( disk based storage )
HADOOP OVERVIEW
APACHE SPARK Apache Spark is a fast and general engine for largescale data processing. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. The Data Sources API provides a pluggable mechanism for accessing structured data though Spark SQL
HADOOP/SPARK INTEGRATION WITH HANA 2012 -SP06 Hive Added as a Remote Source ODBC Based Communication SP07 Query Optimization Like Remote Caching and Join Relocation Ambari Hive Mahout Pig Yarn/MR Spark HBASE SP09 HDFS Reading HDFS Directly Map Reduce Job Execution SP10 Spark SQL added as a new Remote Source Ambari launcher tile in HANA Cockpit
THE JOURNEY SO FAR.. HANA & HADOOP INTEGRATION HANA & Hadoop Integration SQL on Hadoop via SDA (virtual tables) Hive (SPS06) Remote caching with Hive (SPS07) Connectivity to Apache Spark using ODBC Execution of MR-Jobs via HANA (Virtual Functions) and direct access to HDFS (SPS 09) Spark SQL adapter via SDA (SPS10) Join relocation to Hadoop thru SparkRDD Unified Admin thru Ambari integration for Hortonworks Key Benefits Deep Integration for storage & processing Optimized data access between HANA & Hadoop Data tiering to Hadoop for cold storage
SAP HANA VORA: THE POWER OF CONTEXT, IN- MEMORY Data Hierarchies Semantic Analysis & Optimization Query Acceleration Metadata Catalog Any Hadoop. For Cloud. A massively distributed in-memory computing system that scales to 1000 s of node both on-prem and in the cloud and simplifies big data processing for the business ApacheSpark Other
MOTIVATION - (SOME) EXISTING SOLUTIONS Hadoop Distributed SQL Databases No-SQL Databases MLlib GraphX HiveSQL SparkSQL SAP HANA Google F1 Facebook Presto Amazon Redshift MongoDB (Document Store) Neo4j (Graph Store) Berkley DB (Key Value Store) IBM Informix (Time Series Store) Apache Lucene (Text Search) Holistic, enterprise ready, and massive scale out solution?
IN-MEMORY DATA FABRIC FOR ENTERPRISE + DISTRIBUTED COMPUTE ALL IN-MEMORY Enterprise Compute Distributed Compute CONSUME COMPUTE STORE HANA OLTP + OLAP Scale Up + Scale Out + Tiering Appliance TDI Federated Queries & Programming Model Vora Vora Vora Vora Vora Vora Vora Vora Vora Vora Vora Vora Massive Scale Out Distributed File System Network Storage Cloud Persistence Any Hardware
SAP HANA VORA: STRATEGIC POINT OF VIEW SAP HANA Add functionality for enterprise applications Hierarchies OLAP modeling Boost SQL performance Federate access across HANA and Hadoop Integrate tooling
SAP HANA VORA : EXTENDING THE SAP HANA PLATFORM S4/HANA/ HANA Live SAP Business Warehouse Industry Applications Partner Applications Data Hierarchies Query Acceleration Application Services Semantic Analysis & Optimization Metadata Catalog Database Services Spark Other Integration Services Unstructured Data Vora SAP HANA Platform
SAP HANA VORA ROADMAP Today Planned Innovations Future Direction Deliver Enterprise Analytics and HANA- Spark integration Enable OLAP style Analytics on Hadoop data Support for HDFS, Parquet, ORC and S3 data formats Hierarchies on Hadoop data Integration to SAP HANA thru Apache Spark Data Source API LLVM technology to translate SQL code to C programs for faster performance Deeper Integration with SAP HANA Modeler for Vora simple web interface to model, build and query cubes More OLAP features like UoM conversion, currency conversion Extend HANA Data Lifecycle Management to Hadoop thru Vora integration Extend engine support for Time series, Graph, Document store and disk based processing Enhanced Kerberos support SAP ILM integration and beyond Extend HANA integration to support ERP ILM scenarios for archived data in Hadoop Distributed query processing using native Vora processing engine Cluster integration for different distributions (monitoring and admin) Hana integration : HANA shell as an integrated part of VORA delivery Security for Vora tables SAP HANA Vora 1.1 This is the current state of planning and may be changed by SAP at any time.
FOLLOW US Thank you for your time Follow us on at @ASUG365