Presenters: Luke Dougherty & Steve Crabb
About Keylink Keylink Technology is Syncsort s partner for Australia & New Zealand. Our Customers: www.keylink.net.au 2
ETL is THE best use case for Hadoop. ShanH Subramanyam, CEO, Orzota Orzota provides Big Data services to customers like: ETL is THE best use case for Hadoop. However, with the huge hype about data science and analy=cs, many customers are now confused and seem to think that Big Data means insights. It is never that easy. You need to crawl before you walk. The best thing we in the Big Data community can do is to get away from the hype and encourage enterprises to adopt Big Data with the most straight- forward use cases and ETL is certainly the first one. 3
4
The Impact of ELT & Dormant Data on the EDW! ELT drives up to 80% of database capacity! Dormant rarely used data waste premium storage Hot Warm Cold Data TransformaHons (ELT) of unused data! ETL/ELT processes on dormant data waste premium CPU cycles 5
A Complete SoluDon to Harness the Power of Hadoop MapReduce
Syncsort ContribuDons to Apache FoundaDon NaDve Sort: JIRA Description 4807 Allow MapOutputBuffer to be pluggable 4808 4809 4812 Allow Reduce-side merge to be pluggable Make classes required for 2454 public Create reduce input merger plug-in " Modular " Extensible " Configurable through use of external sorters on MapReduce nodes 4842 Shuffle race can hang reducer 2461 HDFS file name globbing in libhdfs 4482 Backport of 2454 to MapReduce 1 & 1.2 and more, even Mainframe support! SQOOP-1272: Support importing mainframe datasets 7
Case Study: OpDmizing the EDW at Bank of America 250 Offload ELT processing from data warehouse into Hadoop using DMX- h Elapsed Time (m) 200 150 100 50 0 HiveQL 217 min DMX- h 9 min Implement flexible architecture for staging and change data capture Ability to pull data directly from Mainframe No coding. Easier to maintain & reuse Enable developers with a broader set of skills to build complex, large- volume data pre- processing and transformahon workflows DMX- h HiveQL 5 Man days 15 Man days 0 2 4 6 8 10 Development Effort Benefits: " Cut development Hme by 2/3 " Reduced complexity. From 47 HiveQL scripts to 4 DMX- h graphical jobs " Eliminated need for Java user defined funchons " 24x faster! 8
Case Study: Offloading Teradata at HealthCore TERADATA Support growing healthcare research data Offload ELT workloads to Hadoop Free up & ophmize valuable Teradata capacity Accelerate Hadoop inihahve: Quick ramp- up, no need for specialized skills Empower exishng IT staff with the use of point & click graphical user interface No manual coding, no tuning $1.4M Projected TCO Savings over 3 years TCO - ELT on TCO - Teradata ETL on Hadoop $1.8M $390k $500k $1M $1.5M $2M Benefits: " Projected TCO savings of $1.4M over 3 years " Eliminated need of addihonal $300k TD expense " Helped build a modern architecture to support growing data volumes and next- generahon analyhcs 9
Case Study: Improving Customer Service & Reducing Costs at comscore Web Log Data Panel Data Census Data INTEGRATE & SHIFT DATA TO HADOOP Pre- process & Analysis Pre-process AnalyDcs EDW Delivery Company collects over 1.7 trillion records per month Hadoop cluster with 290+ nodes; 9,200+ total cores; 19.5 TB of memory; 6 PB of space Challenges: Increase SLAs for digital services & products to increase compehhveness Reduce storage requirements Manage over 72x data growth in 2 years! 70% Improved Processing - 3.5 Billion Input Rows/Day Pig & Java UDF: 550 lines of code; 34 mins Syncsort DMX- h: 8 reusable tasks; 11 mins Benefits: " Deliver data faster to customers by increasing throughput per node by up to 70% " Save 1 Petabyte of data every 6 months " Reduce capital and ongoing operadonal expenses " Accelerate development & democrahze access to Hadoop with point- and- click interface 10
11
Coding on Hadoop vs. Syncsort Graphical Design Approach VS. 12
Break Free from Hadoop Complexity Design Once, Deploy Anywhere! Intelligent ExecuHon Layer Windows, Linux, Unix Hadoop Cloud Visually design data transformahons once, and run anywhere No changes or tuning required Combine new and legacy sources for bigger insights Intelligent ExecuHon Layer dynamically ophmizes the job for each plarorm: Hadoop, Windows, Unix, Linux or Cloud Future- proof your applicahons!
One- step Access to All Your Data Build Your Enterprise Data Hub Avro Oracle Cassandra JSON Files HBase Teradata Parquet MongoDB VerHca Cloud Mainframe Netezza Hadoop + DMX- h Collect virtually any data from mainframe to Big Data and NoSQL sources Load data directly into Avro & Parquet. No staging required Access & translate mainframe data using Sqoop and Spark Let DMX- h dynamically split the data and load it to HDFS in parallel
Make Data Available to Business Analysts Achieve the Fastest Path from Raw Data to Insight NoSQL Hadoop + DMX- h Create Tableau & Qlikview files with one click Achieve the fastest data loads without tuning hassles: Fastest parallel loads to Greenplum, Netezza, Teradata & VerHca High- performance connechvity to Big Data & NoSQL databases such as Cassandra, Hbase & MongoDB
SILQ Helps You Fast- track Your EDW Offload Projects What? Web based uhlity helps you shit ELT processing from the data warehouse into Hadoop Provides integrated analysis of ELT SQL jobs How? Reads BTEQ, NZ SQL, PL/SQL. ANSI SQL- 92 compliant Generates graphical data flow Provides best- prachces to develop DMX- h jobs ResulHng DMX- h jobs run nahvely on Hadoop Syncsort is the only Big Data company with a SQL Offload uhlity! 16
17
Test Drive Syncsort DMX- h www.syncsort.com/try 18