Lofan Abrams Data Services for Big Data Session # 2987
Big Data Are you ready for blast-off?
Big Data, for better or worse: 90% of world s data generated over last two years. ScienceDaily, ScienceDaily May 22, 2013.
Barriers to operational effectiveness Scattered Information Scattered Information Heterogeneous / Complex Sources Data explosion Trustworthiness of Information Handling Unstructured Content Stored Data Structured 15% Unstructured 85% SAP 2008 / Page 4
SAP Solutions for Enterprise Information Management Information Ready for Action Analytics Business Processes Before After Data Quality Management Data Integration GOVER N Master Data Management Big Data & IoT 1010 01011 01011 001 01001 Data Discovery Information Lifecycle Management Content Management Compliance
SAP SOLUTIONS FOR TRUSTED DATA PROVEN LEADER IN EVERY CATEGORY
Data Services and Big Data Sources Hadoop MongoDB Google BigQuery
Big Data in SAP Data Services Value Proposition Use one single ETL tool to move data (structured and unstructured) to big data stores and data warehouses. Simple to use with same dataflow designer for all types of sources/targets, with data preview capabilities to enhance developers productivity. Use Cases Extract data (with the right filters pushed down to the source) from mongodb or Hadoop into a DWH for analytics (HANA, Teradata, Google Big Query, ). ETL experts don t have knowledge of languages like Pig script or MongoDB syntax and need a code-free UI. How it works Native datastores for MongoDB, Google Big Query, Hadoop (HDFS, Hive) + adapter SDK open to partners to build more adapters. Data preview and data profiling for Hadoop sources built into Designer user interface. 8
Hadoop Datastore Certified with top two Hadoop distributions Hadoop HortonWorks 2.2 (source, target) Hadoop Cloudera CDH 5.3 (source, target)
Hadoop/Hive Support since Data Services 4.1 Files HANA, IQ, other Target Systems Databases Data Services Hadoop Data Services Web & Others Same familiar, easy-to-use UI design paradigm for Hadoop/Hive as other database systems but with specific behind-the-scenes extensions to leverage the power, scale and unique functionality of Hadoop High-performance reading from and loading into both Hadoop (HDFS) and Hive Makes use of Hadoop capabilities by delegating operations to Hadoop/Hive systems (T-E-L) Extended Optimizer fully HiveQL and PIG aware and generates optimized scripts for Hive and Hadoop
Hive Support Full metadata support via JDBC, browse and explore Hive tables DS generates HiveQL and pushes down operations to Hive Joins, Sorting, Filters, Functions including aggregation functions High-performance, scalable reading from Hive Multi-threaded, parallel reading of Hive results (not JDBC) All types of column partitioning is supported High performance loading into Hive Support for Inserts and Updates Support for both Static and Dynamic partitioning Multi-threaded (parallel) loading Reading/Loading Hive Metadata (JDBC) Data Services HDFS Files
HDFS Support Access to metadata and structure of files in HDFS DS generates PIG and pushes down operations to Hadoop, operations include: Joins Sorting Filters and Projections Functions including aggregation functions Reading from Hadoop High performance, parallel reading of files produced by above PIG script Ability to invoke pre-defined or custom PIG scripts High-performance File-based loading into Hadoop
Data Preview for Hadoop Hive Tables Hive Table Preview, includes Data Preview Profile Preview Column Profile Preview Filtering
Data Preview for Hadoop HDFS Files Offer View Data (no profiling) for Hadoop HDFS files: In the datastore When used as source or target in a dataflow Including filtering and sorting pushed down to HDFS.
Enable SSL Certificate for Hadoop To enable SSL in HIVE adapter, set SSL Enabled = yes SSL Trusted Store and Password The name of the Trust Store you are using to verify credentials and store certificates. TrustStore stores certificates from third party, your Java application communicate or certificates signed by certificate authorities like Verisign, Thawte, Geotrust etc.) which can be used to identify third party. The password associated with the Trust Store. Additional Properties Specifies any additional connection properties. Property value pairs must be separated by a semi-colon.
Support SQL() function for Hadoop Added Support for HIVE data stores Used for Data Definition Language (DDL) and Data Manipulation Language (DML) on HIVE databases Useful for managing database objects as precursor to DS code execution. Can also be used for post process database information retrieval.
Support SQL Transform for Hadoop SQL Transform supports a single Select statement only Used for standard SQL selects from existing scripts outside DS Select statements can be parameterized.
Support Join pushdown operation for Hadoop Why pushdown? Pushdown of transforms and functions to source or target database will leverage the database power instead of doing these operations in the Data Services engine. Specially if source and target tables are in the same database, this will give best performance since no data is extracted from the database. Support join push-down operations for Hadoop e. g. Using a Data Transfer transform to stage data from non-hive source to HIVE
MongoDB Datastore
What is MongoDB? MongoDB is: a popular document oriented (open source) database. It s a nosql database with dynamic schemas that stores data in a (nested) JSON-like format. MongoDB ranked #4 on most popular database in February 2015 (http://db-engines.com/en/ranking ).
MongoDB use case for Data Services Enable our customers to be able to extract data from MongoDB (coinnovation with US customer) as a source and load it to a target for analytics.
MongoDB adapter in the Management Console Implemented as a new adapter leveraging the Data Services adapter SDK. Adapter needs to be added and started in Management Console before it can be used in a datastore.
MongoDB datastore MongoDB adapter supports: Single (Primary) Replica set (Secondary) Shared Cluster Sharding is the process of storing data across multiple machines MongoDB uses this approach to support large data sets deployament and high throughput operations. MongoDB Credential, LDAP and Kerberos authentications SSL Certificate Since MongoDB does not have a schema definition, Data Services will scan a sample set of documents ( Rows to scan ) in the collection and create a schema based on the superset of all fields.
MongoDB documents as source in a dataflow Collections are imported as Documents in the repository. The nested structure is preserved, with XML_Map in Data Services you can manipulate the data. Filters defined in the WHERE clause are pushed down to the database. More advanced filter conditions can be defined in the adapter parameter Query criteria using the MongoDB syntax.
Google BigQuery Datastore
What is Google Big Query? Google BigQuery is using Google s data storage in the cloud, for fast interactive analysis on huge amounts of data: Google BigQuery enables super-fast, SQL-like queries against append-only tables, using the processing power of Google's infrastructure. Main use case for Data Services is to load data into BigQuery for analytics. Note: in this release, BigQuery can be used as target only, not as source.
Google Big Query Datastore Native Google Big Query datastore (in the Applications category) Certificate based login: import private key file + provide password The private key is generated from Google Big Query account page. When exporting a GBQ datastore, the private key is NOT exported and needs to be imported again in the target repository (with correct passphrase).
Google Big Query as target in a dataflow Browse metadata and import tables Tables contain nested data Google Big Query can be a target table only Note: template tables are not supported, but from a query you can generate a JSON structure which can be used to create the target via the BigQuery web console.
Roadmap Current, Planned Innovations and Future
SAP Data Services Product road map overview - key themes and capabilities Simple Today Planned Innovations Future Direction (Release 4.2 SP5) Simple Simple Enhanced runtime troubleshooting process by introducing Bypass dataflows and workflows feature Enabled Switch repositories capability in Designer Big Data Enhanced support for IQ, HANA, and other Big Data sources Simplified real-time CDC with SAP Replication Server New connectivity for OData, JSON, REST, MongoDB, Google Big Query and JDBC Certified Hadoop Cloudera and HortonWork Support DDL and DML and data preview for Hadoop Added Sharded Cluster support for MongoDB Security enhancement for Hadoop & MongoDB SSL certificate and role-based authentications (LDAP, Kerberos) Enterprise Support Support pattern variance in Data Masking transform Added Secured Remote File Adapter Built-in functions for file transfer (SFTP) and file manipulation Simplify Data Services software upgrades Improve Substitution Parameters management Add preview and select capability for importing objects into DS repository from a file Merge DS workbench capabilities into DS Designer Big Data Support Hadoop on Windows platform Enhance existing connectivity (source/target) Token based security Enterprise Support Integrate comprehensive runtime stats of DS batch/real-time jobs with SAP Solution Manager Native integration with SAP NetWeaver CTS+ to deliver single transport tool for DS, SAP and other applications Integrate TA 5.x to enhance Text Data Processing engine Data Quality global expansion in Asia Pacific Show graphical dataflow monitor and identify bottlenecks Self-Guiding user interfaces to enhance user experience Big data Expanding support for new sources/targets based on market traction Data Model advisor for HANA database Tight integration with Big Data solutions (SPARK, YARN ) Enterprise support Resource Advisor to provide clarity of system usage Data Services components health monitor with proactive job alerts and analysis. Data Services datastore as a service
Demo
Why SAP?
SAP Solutions for Enterprise Information Management Proven and trusted 12,000+ SAP EIM customers worldwide Winner Swiss Re, Kraft Foods Inc. and Lexmark Intl. are winners of Gartner MDM Excellence Awards Leader In every EIM Category: Master Data Management Data Quality Data Integration Enterprise Architecture Enterprise Content Management Enterprise Data Virtualization 90% customer satisfaction rating #2 Market share for data integration and data quality
STAY INFORMED Follow the ASUGNews team: Tom Wailgum: @twailgum Chris Kanaracus: @chriskanaracus Craig Powers: @Powers_ASUG
SESSION CODE 2987