GO BIG WITH DATA PLATFORMS: HADOOP AND TERADATA Betsy C. Huntingdon Product Marketing Manager May 13, 2014 Columbus, OH Spring Teradata User Group Meetings
AGENDA UDA and the Data Platform Teradata Appliance for Hadoop Integrated Big Data Platform 1700 Q&A 2 Copyright Teradata
TERADATA UNIFIED DATA ARCHITECTURE System Conceptual View ERP MOVE MANAGE ACCESS Marketing Marketing Executives SCM CRM INTEGRATED DATA WAREHOUSE Applications Operational Systems Images DATA PLATFORM Business Intelligence Customers Partners Audio and Video PLATFORM FAMILY Data Mining Frontline Workers Machine Logs Text INTEGRATED BIG DATA PLATFORM APPLIANCE FOR HADOOP INTEGRATED DISCOVERY PLATFORM Math and Stats Business Analysts Data Scientists Languages Web and Social BIG ANALYTICS APPLIANCE Engineers SOURCES ANALYTIC TOOLS & APPS USERS
Teradata and Hadoop Positioning Teradata Hadoop Characteristics High performance analytics and complex joins High concurrency SQL (ANSI and ACID compliant) Advanced workload mgmt. High Availability Data Governance Emerging Late Binding Fine Grain Security One-stop support Use Cases Low $/TB Long-Term Raw Data Storage ETL Reporting Deep Analytics Characteristics Fast Data Landing and Staging MapReduce, Hive, Pig Emerging SQL/SQLlike interfaces Batch-oriented processing Low workload concurrency Multi-structured and file based data Late Binding Open Source Community 4 Copyright Teradata
Hadoop Data Platform Data Lake ETL Single source of raw data Drag-the-Lake for new insights Co-location versus line of business data marts Transforms Data set creation Data manipulation ETL new data 5 Copyright Teradata
The Data Lake A Data Lake is a massive repository enabled by low cost technologies that improves the capture, refinement, and exploration of raw data within an enterprise. Single source of raw, historical, operational data Cost effectively explore data sets > Unknown, underappreciated, or unrecognized value Consolidate data environments > Reduces costs and analytical discrepancies Web Logs Mobile IDW Co-location of files enables light, on-the-fly integration Sensors 6 Copyright Teradata Files
ETL on Hadoop or ELT in Teradata Where Hadoop will Shine CPU intensive calculations Scans of data Complex logic Fast ingest Where Hadoop will be Challenged I/O intensive calculations Seeks of data Complex joins Service level agreements 7 Copyright Teradata
Archival using Teradata Hadoop Situation Large pharmacy healthcare provider has variety of data with different value Some data not useful in data warehouse Problem Long term storage data cannot be queried No analysis can be performed on the archived data Losing out on business value from this data Solution Teradata Hadoop nodes store weblogs, medical data, JSON files Hadoop enrichment layer enhances data for analytics consumption Use UDA platforms for easy movement and access Impact Reduced storage costs for data variety Perform adhoc analytics on the multiple versions of data Retrieve data in minutes ( vs. days with tape archives ) Reduced load and improved performance of DW/Databases 8 Copyright Teradata
Telematics in Insurance Geospatial analytics for better risk management Situation Insurer needs accurate risk scores to adjust premiums corporate auto fleets Data collected vehicle data, driver behavior, GPS, weather, traffic Current custom application limits scoring effectiveness Problem Limited storage capacity/infrastructure for huge volumes of real time data No ad-hoc reporting or analytic systems Solution Teradata Appliance for Hadoop to ingest telematics data Combine with other data sources to perform risk analysis Impact Quickly analyze data plus ad hoc reporting Streamlined process to calculate vehicle and fleet scores Cost effectively quantify, adjust and manage risk premiums 9 Copyright Teradata
Telematics Use Case Data Architecture Standard Format VIN data Enhanced GPS Sessionize Trip files Apache Storm Streaming TSP data (sources, formats) Vehicle Accelerometer data Vehicle scores Telematics Service Provider (TSP) streaming and transforming Apache Hive for ad-hoc querying and reporting 10 Copyright Teradata
TERADATA APPLIANCE FOR HADOOP
Why Teradata Appliance for Hadoop? Building a Hadoop Cluster Teradata Multiple vendors DIY set up, install DIY SW/HW updates Integration test deploy Multiple consoles Easy 1 vendor acquisition Quick set up, Plug n play Eliminate integration complexity Single pane of glass management 12 Copyright Teradata
What is the Teradata Appliance for Hadoop? Appliance Solution > Purpose-built integrated hardware / software solution > Optimized hardware for Hadoop, software, storage, and networking in a single rack > Delivered ready to run at a competitive price point Enterprise Ready > Integrated with Teradata Analytical Ecosystem to expand analytical capabilities > Support for major business intelligence, visualization, and ETL tools > Management tools for monitoring system health Data Staging > Loading, storing, and refining data in preparation for analytics Active Archiving > Powerful solution for Unified Data Architecture for data archiving 13 Copyright Teradata
Teradata Vital Infrastructure Teradata Appliance for Hadoop Highlights Aster and Teradata QueryGrid Teradata Studio with Smart Loader Value Added Software from Partners Teradata Viewpoint Teradata Connector for Hadoop (TDCH) Intelligent Start and Stop NameNode Failover Teradata Open Distribution for Hadoop Optimized hardware for Hadoop BYNET V5 40GB/s InfiniBand interconnect 14 Copyright Teradata
Teradata Hadoop Enhancements Simplifying Hadoop for Enterprise Readiness Installation > HadoopBuilder Systems arrive out of the box ready to run Cluster Management (with Teradata Hadoop Tools) > Intelligent Start/Stop All Hadoop services are coordinated to begin/end automatically > Single Drive Replace Simplified the hardware procedure > Add/Replace Data node Automated the process for bare node hardware setup Monitoring > Viewpoint Single GUI-based view of all systems in UDA > TVI alerts and service dispatches for proactive issue monitoring Availability > Easy NameNode Failover: JobTracker and NameNode high availability works out of the box > Full Master node HA 15 Copyright Teradata
Hadoop + Viewpoint System management > Hadoop services > System health > Alert viewer > Node monitor > Space usage > Metrics analysis > Metrics graph > Capacity heatmap 16 Copyright Teradata
Studio and Smart Loader for Hadoop Hadoop view > Browse Hadoop tables > Bi-directional table copying Drag and drop interface Maps data types between Hadoop and Teradata tables Hadoop Table Properties Benefits > Simplifies Hadoop browsing > Ad hoc data movement > No scripting required > Point and click 17 Copyright Teradata
Teradata Vital Infrastructure for Hadoop Enterprise class Hadoop support > Hadoop hardware and software > Proactive problem detection and fixes Reliability, availability, manageability Virtualized server management > System monitoring > Cabinet Management Interface Controller (CMIC) > Service Work Station (SWS) > Automatically installed on base/first cabinet 62 70 % of incidents fixed proactively 18 Copyright Teradata
Teradata 15.0: Teradata QueryGrid Business users IDW Discovery Data Scientists TERADATA DATABASE TERADATA ASTER DATABASE HADOOP Remote, push-down processing in Hadoop TERADATA ASTER DATABASE SQL, SQL-MR, SQL-GR TERADATA DATABASE Teradata Systems OTHER DATABASES Remote Data LANGUAGES SAS, Perl, Python, R, Ruby, etc., When fully implemented, the Teradata Database or the Teradata Aster Database will be able to 19 intelligently use the functionality and Copyright data of Teradata multiple heterogeneous processing engines
Data Data Filtering Teradata QueryGrid Built with Hortonworks > Donated to Apache Business user query with favorite BI tools Join Hadoop data to > Teradata Data Warehouse > Aster Discovery Platform Teradata Systems SQL-H HCatalog Hadoop MR Hive Teradata 15.0 > Bi-directional SQL > Push down filters to Hive Fast, secure, reliable Hadoop Layer: HDFS Pig 20 Copyright Teradata
TERADATA INTEGRATED BIG DATA PLATFORM 1700
Integrated Big Data Platform Contextual Analytics Resource Flexibility Always On Corporate Memory Deep analytics Data Labs Data refinery Hadoop integration Ad hoc projects Peak workload assist Disaster recovery High availability Archive reporting & retrieval Audit and compliance 22 Copyright Teradata
One Platform, Many Uses Contextual Analytics Resource Flexibility Always On Corporate Memory Unrefined Multi-structured data Current data Archival data Raw data IDW data years 1-5 IDW data years 5-10 Unrefined structured data 23 Copyright Teradata
Contextual Analytics Deep Analytics xdr analytics > Analyze xdr, and smart phone logs > Calling patterns, fraud, usage patterns Consumer sentiment analytics > Brand and products likes/dislikes Clickstream analytics > Optimize website, digital spend, web site design Sensor/machine analytics > Proactive maintenance, provisioning > Healthcare, telematics, > Utilities (water, electricity, etc..) Location based analytics > Manage operations where they occur 24 Copyright Teradata
Contextual Analytics Data Refinery Consider 1700 when offloading ELT Benefits > Lower cost system > Little to no ETL rewrite > Continue using favorite transformation tools and scripts > Reference data available for transformations > Preserve security and access rights > Teradata Unity automates data sync ELT offload X Considerations > SLA s for data availability on IDW > System-to-system dependencies > Available CPU resources on IDW 25 Copyright Teradata Integrated Big Data Platform Hadoop
Handling Multi-structured data with SQL Store data objects in database > Weblogs, JSON, XML, CSV, etc.. > VarChar, CLOB, or BLOB Teradata Data Warehouse Built-in functions > Name value pair functions > String handlers, REGEX > JSONpath operators XML XML 41521390 2013-01- 0100:25:4 22.111.94. 18Mozilla/5.0(Macintos h; U; Intel weblogs JSON > XML and Xquery Table Operators > Dynamic input schema, output schema > Use C++/Java to unravel complex objects into columns Late-binding flexibility 26 Copyright Teradata
Resource Flexibility Ad Hoc Projects The Executive Request > New inventory supplier > Urgent marketing campaign > Sales manager challenges numbers > Marketing buys sample social media data > What if projects Fast reaction > Fire disrupts supply chain > Hurricane relief plan > Major competitor action Mergers and acquisitions 27 Copyright Teradata
Resource Flexibility Peak Workload Assist Load balance prime time user activity > Support subset of users > Common during month end, quarter end, retail Mondays Help meet batch SLAs > Daily batch reports > Month end, quarter end, CFO and sales summaries Enablers > Unity Director, Loader, Data Mover, Ecosystem Manager > Workload Management 28 Copyright Teradata
Always On Disaster Recovery Maintain all or a portion of the production IDW for use in a true disaster > Unity Director, Unity Loader, Unity Data Mover, Unity Ecosystem Manager Minimum necessary users and applications > Keep the core business running Teradata Unity 29 Copyright Teradata
Always On High Availability Data warehouses are operational, mission-critical systems > Continuous data access to end users Planned maintenance of production warehouse > Software updates > Hardware upgrades Unplanned outages > Hardware or software failures hidden from users > Reduces pressure on IT for system recovery 30 Copyright Teradata
Corporate Memory Archival, Audit, and Compliance Shared requirements > 5-10 years of data storage > Fast report turn around > Trusted data > Secure environment > Self-service queries Reduce dependency on tape Audit and compliance > Financial security and trust > Equal opportunity employment > Fair lending practices > Tax audit (ugh) Archival reporting > Marketing - revisit lost customers > CFO - track fraud back further > Manufacturing - compare parts cost trends > Call center - find old warranties, call logs 31 Copyright Teradata
A/B testing on auction site Contextual analytics: join behavior to IDW data Digital investment optimization Hadoop integration Archive reporting and retrieval Dual load Peak workload assist Load refine data Join for image IDW 10PB structured analytics Analyze & Report Singularity 36PB weblogs, IDW copy 32 Copyright Teradata Discover & Explore Hadoop 50PB bot detection, images
More Customers Large US Credit Card Company Deep history queries Compliance queries International Telecom In-database mining with SAS Aggregation layer BAR / DR xdr hosting offload Subscriber info Large US Online Retailer Behavioral Analytics Free up capacity on IDW Large US Financial Institution Backup Copy of IDW DR 2 nd copy of IDW Offload Archiving Activity 33 Copyright Teradata
When to Use Which? Hadoop 1700 Structured data X X Multi-structured All JSON, XML, weblogs Interactive Queries Evolving X MapReduce X Predictive analytics Map Reduce In-DB Interactive Performance Low-med Med-high Data governance Emerging High Interactive tools Few All SQL SQL 92 SQL 2008+ Security Emerging Extensive Service levels consistency Low High 34 Copyright Teradata
Summary: Teradata Data Platforms Unified Data Architecture > Matching workloads and cost to platforms Teradata Hadoop > Data Lake > ETL Teradata 1700 > Teradata Data Warehouse > Contextual Analytics > Resource Flexibility > Always On > Corporate Memory 35 Copyright Teradata
THANK YOU TO OUR TUG SPONSOR Trusted supplier to major OEMs for 30 years Joint engineering with Teradata Fully integrated with Teradata nodes and Database New technology > Chromium FX RAID controllers which support 5.2 Gb/s SAS 2.0 > Inde EcoStor technology eliminates the need for cache batteries 36 Copyright Teradata
Q&A
BACKUP SLIDES
Platform ETL or ELT Considerations Complexity Web Logs Mobile Dependencies Latency & SLAs Security IDW Data quality Costs Sensors Files 39 Copyright Teradata
Capture, Refine, Store Clickstream Data Situation Customers interact with PC vendor websites Huge volumes of raw Omniture data Inconsistent data structure and format Problem File errors, corrupted file compressions, error prone analysis Velocity (70files/hr., 1M files) adds to the complexity Solution Teradata Appliance for Hadoop -- landing and staging area Hadoop nodes curate the data, check for data consistency, and prepare the data Impact Reduced data inconsistencies and improved performance Capture and curate ALL the data Perform ad hoc analytics on multi-level interactions Improve marketing campaigns and customer support 40 Copyright Teradata
Introducing the Appliance for Hadoop Teradata Appliance for Hadoop is enterprise class > Landing area and data lake for raw files of any type > Data refining engine some transformations and simple math at scale > Archival system for histories of data with low or unknown value Teradata Enterprise Access for Hadoop > Enables business user to easily access Hadoop data with standard SQL from within the Teradata Database and BI tools > SQL-H provides on-the-fly access to data, leveraging HCatalog > Teradata Studio w/smart loader for Hadoop: ad-hoc data movement Best-of-breed Technology Partner Value Add > Hortonworks engineering relationship: SQL-H, Viewpoint integration with Ambari, and high performance Hadoop nodes > Protegrity, Informatica, Revelytix 41 Copyright Teradata
Hadoop Enables Another Data Platform Ad hoc projects > One-shot complex analytics > Hurry up, short term efforts Alternative analytics > Not SQL-friendly algorithms > Markov chains, random forest > JPG, audio analysis Sandbox hunting in the dark > Prototyping > Data exploration > Trial and error new algorithms 42 Copyright Teradata
Web Logs Mobile Teradata Data Warehouse Sensors Operational files 43 Copyright Teradata
Comparing Data Platform Configurations Teradata Appliance for Hadoop Integrated Big Data Platform 1700 Nodes -full rack 18 MPP nodes/cabinet 1+1, 2+1, 3+0 MPP nodes/cabinet Node CPU Storage Total user data capacity Master (Qty. 2): dual 8-core Intel Xeon @2.60GHz Data (Qty. 16): dual 6-core Intel Xeon @2.0GHz 192 3TB HDDs/cabinet 152TB/cabinet (9.5 TB/data node uncompressed) Dual 8-core Intel Xeon @2.60GHz 168 3TB HDDs /cabinet (+6 global hot spares) 229TB/cabinet (114 TB/node uncompressed) Memory Management, troubleshooting and support Availability 256GB per master node 128GB per data node Teradata Vital Infrastructure, Teradata Viewpoint, single source software and hardware support Software data replication Up to 512GB per node Teradata Vital Infrastructure, Teradata Viewpoint, single source software and hardware support Hot standby node available, global hot spare drives Interconnect 40GB InfiniBand 40GB InfiniBand OS SUSE Linux 11 SUSE Linux 11 44 Copyright Teradata
Comparing Data Platform Configurations Commodity Dell HDP Hadoop Stack Integrated Big Data Platform 1700 Nodes -full rack 16 MPP nodes/cabinet 1+1, 2+1, 3+0 MPP nodes/cabinet Node CPU Storage Total user data capacity Master (Qty. 2): dual 8-core Intel Xeon @2.00GHz Data (Qty. 16): dual 6-core Intel Xeon @2.9GHz 384 1TB HDDs/cabinet 166TB/cabinet (6.8 TB/data node uncompressed) Dual 8-core Intel Xeon @2.60GHz 168 3TB HDDs /cabinet (+6 global hot spares) 229TB/cabinet (114 TB/node uncompressed) Memory Management, troubleshooting and support Availability 128GB per master node 64GB per data node Ambari, software support Software data replication Up to 512GB per node Teradata Vital Infrastructure, Teradata Viewpoint, single source software and hardware support Hot standby node available, global hot spare drives Interconnect 10GB Ethernet 40GB InfiniBand OS RHEL Linux 6.4 SUSE Linux 11 45 Copyright Teradata
Comparing Teradata Hadoop Configurations Commodity Dell Hadoop Stack Teradata Appliance for Hadoop Nodes (Full rack) (18) MPP Nodes Per Cabinet (18) MPP Nodes Per Cabinet Master (Qty. 2) Dual 8-core CPU Intel Xeon E5-2670 @2.60GHz Processors Dual 8-core CPU Intel Xeon E5-2670 @2.60GHz Processors Data (Qty. 16) Dual 4-core CPU Intel Xeon E5-2603 @1.8GHz Processors Dual 6-core CPU Intel Xeon E5-2620 @2.00GHz Processors Storage (192) 3TB Internal Drives per Cabinet (192) 3TB Internal Drives per Cabinet Total User Data Capacity Memory 152 TB per Full Cabinet (9.5 TB per Hadoop Data node uncompressed 3x compression available) 256GB per Master node 64GB per Data node 152 TB per Full Cabinet (9.5 TB per Hadoop Data node uncompressed 3x compression available) 256GB per Master node 128GB per Data node Switch 10 Gb Ethernet 40 Gb InfiniBand Availability Software data replication Software data replication Operating System SUSE Linux 11 SUSE Linux 11 Management, Troubleshooting, and Support Teradata Viewpoint, Software Support Teradata Vital Infrastructure, Teradata Viewpoint, Single Source Support, Software Support, Hardware Support Enterprise Integration SQL-H (Teradata & Aster >Hortonworks) Teradata connector for Hadoop, Teradata Studio with smart loader 46 Copyright Teradata SQL-H (Teradata & Aster >Hortonworks) Teradata connector for Hadoop, Teradata Studio with smart loader
Enormous Volumes of Sensor Data Managers, CSRs, Logistics, Manufacturing Dual load New product designers Data Warehouse Appliance 28TB 2 months of data Extreme Data Appliance 50TB 12 months of data 47 Copyright Teradata