Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters
About me Dominik Wagenknecht Accenture Vienna Technology Architect Emerging Technology Accenture Open Data Platform
Ask questions or rate the Speaker www.sli.do/openslava
The Scenario: Banking! Mainframe Data lives in DB2 on z/os Perfect for classic banking-style workloads, robust and trusted backbone of many banks Limitations around flexible data processing, analytic capabilities, no model for stream processing, etc Pay-per-use model very expensive for exploding mobile (mostly read) use
So should we replace it? Is a long-term strategy Long time till you get new capabilities Real question is: What is the right architecture for what we re trying to achieve
The aim of a modern data architecture Horizontal Scalability Low cost per TB of data (to keep all data) Processing over all data (explorative) Cheap data access (for simple reads) Enhanced real-time capabilities for analytics, streams and e.g. search so let s look for an appropriate datastore
OpenSlava 2013: Why RDBMS don t seem to work NOT an Option Probably Ok Limited Scalability Latency Rigid Schema Expensive Hardware NOT an Option NOT an Option
Latency is a real world issue San Francisco Frankfurt = 9 132 km (air distance) Speed of Light = 299 792 458 meters/second Lightbeam travel speed (in vacuum) = 30 milliseconds Lightbeam travel speed (in fiber) = 45 milliseconds Round trip time (RTT) = 91 milliseconds Calculation: 9132/299792,458 * 1,5 * 2 as long as users are distributed world-wide
OpenSlava 2013: NoSQL! BigTable C A CAP P Dynamo Illustrations: wiki.basho.com
OpenSlava 2013: NoSQL! BigTable Consistency & Scalability C Flexible Schema (and scalability) A CAP P Dynamo Illustrations: wiki.basho.com
Popular BigTable data stores In short HBase Original open-source implementation of BigTable Cassandra Scalability Datacenter Datacenter/Global Replication Master/Slave Master/Master Consistency Consistent Tunable Dynamo-based BigTable implementation Interfaces HTTP, Avro, Thrift, Native Custom Binary, Thrift Why cool Very large scale, integration in Hadoop (Map/Reduce) Perfect for high write rates in tabular data, some query ability What does that mean? Consistency enables familiar programming and data modeling patterns Full Hadoop ecosystem integration rounds off for Batch and OLTP features
Zookeeper (Coordination) HBase (BigTable) Enter Hadoop ca. 2010/11 (simplified) Pig (Dataflow) Hive (SQL) Sqoop (Data Integration) MapReduce (distributed batch processing) HDFS (distributed file system) Hadoop core Closely related
Zookeeper (Coordination) HBase, Accumulo (BigTable) Sqoop, Flume (Data Integration) Spark, Tez (Interactive / in-memory) Storm, Spark Streaming (Stream Processing) Pig (Dataflow scripting) Spark Mlib, Mahout (Machine Learning) Hive, Drill, Spark SQL (SQL) Solr, Elasticsearch (Search) Enter Hadoop 2014 (even more simplified) YARN (distributed processing framework) HDFS (distributed file system)
Read- & Writepath (Copybooks ) High-Level Architecture: Mainframe only Application(s) Mainframe DB2
Write-Path (Copybooks ) High-Level Architecture: Introducing Hadoop/HBase Application(s) Read-Path (REST/JSON) Hadoop Cluster Loadbalancer Master Primary Master Secondary DB2 Mainframe Agent z/os Agent Linux Node Node Node Node Node
Introducing Storm Hadoop Cluster Agent Linux PubSub e.g. Kafka 2nd (subset) get data update stats e.g. push notification Low-Latency stream processing 1st (full) Operational NoSQL Datastore
Introducing Analytics SQL Hadoop Cluster Hive (via MR or Tez) Agent Linux Flume Log- Ingestion Plain HDFS (including HBase data)
Is a multi-workload Hadoop ready for the Enterprise? Concern How about support? Is it secure? Is it just a hype? Should I use it everywhere? Response Enterprise-level support is available (Cloudera, Hortonworks, MapR Technologies, etc.) Openness you can switch vendor Yes, integration with Kerberos and LDAP Encryption in transit fully supported in OpenSource Encryption at rest coming, but easy with Linux NO Huge Eco-System The adoption rate is steadily increasing, filling a real gap All vendors are major contributors to the open source community Comparable to Linux in takeup NO Just like all of NoSQL it s not for everything Be thoughtful what to adopt, the core is very stable, newer things may not
Thanks. END Ask questions or rate the Speaker www.sli.do/openslava