HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Similar documents

How To Handle Big Data With A Data Scientist

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop & Spark Using Amazon EMR

NoSQL and Hadoop Technologies On Oracle Cloud

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Implement Hadoop jobs to extract business value from large and varied data sets

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

How Companies are! Using Spark

Hadoop Ecosystem B Y R A H I M A.

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Parquet. Columnar storage for the people

Hadoop Cluster Applications

Moving From Hadoop to Spark

Open source Google-style large scale data analysis with Hadoop

Big Data - Infrastructure Considerations

CitusDB Architecture for Real-Time Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Oracle Big Data SQL Technical Update

I/O Considerations in Big Data Analytics

Hadoop & its Usage at Facebook

Using an In-Memory Data Grid for Near Real-Time Data Analysis

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

NoSQL for SQL Professionals William McKnight

Unified Big Data Processing with Apache Spark. Matei

Building your Big Data Architecture on Amazon Web Services

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

In-Memory Analytics for Big Data

Big Data Challenges in Bioinformatics

Big Fast Data Hadoop acceleration with Flash. June 2013

Duke University

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Architectures for Big Data Analytics A database perspective

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Cray: Enabling Real-Time Discovery in Big Data

CSE-E5430 Scalable Cloud Computing Lecture 2

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Best Practices for Amazon EMR

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

bigdata Managing Scale in Ontological Systems

Data Refinery with Big Data Aspects

Open source large scale distributed data management with Google s MapReduce and Bigtable

5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*

Challenges for Data Driven Systems

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Workshop on Hadoop with Big Data

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Best Practices for Hadoop Data Analysis with Tableau

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Big Data at Cloud Scale

How To Scale Out Of A Nosql Database

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Platfora Big Data Analytics

Big Data on Google Cloud

StorReduce Technical White Paper Cloud-based Data Deduplication

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Understanding the Value of In-Memory in the IT Landscape

Using In-Memory Computing to Simplify Big Data Analytics

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

How to Choose Between Hadoop, NoSQL and RDBMS

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Integrating Apache Spark with an Enterprise Data Warehouse

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

A Brief Outline on Bigdata Hadoop

Why Big Data in the Cloud?

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

A Brief Introduction to Apache Tez

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

AtScale Intelligence Platform

Reference Architecture, Requirements, Gaps, Roles

Actian Vector in Hadoop

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop: Embracing future hardware

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Transcription:

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY Jaipaul Agonus FINRA Strata Hadoop World New York, Sep 2015

FINRA - WHAT DO WE DO? Collect and Create Up to 75 billion events per day 13 National Exchanges, 5 Reporting Facilities Reconstruct the market from trillions of events spanning years Detect & Investigate Identify market manipulations, insider trading, fraud and compliance violations Enforce & Discipline Ensure rule compliance Fine and bar broker dealers Refer matters to the SEC and other authorities 2

FINRA S SURVEILLANCE ALGORITHMS Hundreds of surveillance algorithms against massive amounts of data in multiple products (Equities, Options, etc.) and across multiple exchanges (NASDAQ, NYSE, CBOE, etc.). 3

FINRA S SURVEILLANCE ALGORITHMS Ord Fron er P rote c tion t g nin Run Spoo fing Layering Best Execution les a S sh Wa Sa hort S les Compliance Detecting abusive activity and compliance breaches 4

FINRA S SURVEILLANCE ALGORITHMS Dealing with big data before there was Big Data. Over 430 batch analytics in the Surveillance suite. Massively Parallel Processing methodology used to solve big-data problems in the legacy world. 5

PRE-HADOOP DATA ARCHITECTURE ü Tiered storage design that struggles to balance Cost, Performance and Flexibility 6

PRE-HADOOP PAIN POINTS Data Silos Data distributed physically across MPP appliances, NAS and Tapes, affecting accessibility and efficiency. 7

PRE-HADOOP PAIN POINTS Cost Expensive, specialized hardware tuned for CPU, storage and network performance. Proprietary software that comes with relative high cost & vendor lock-in. 8

PRE-HADOOP PAIN POINTS Non-Elasticity Can't grow or shrink easily with data volume, bound by the cost of hardware in the appliance and the relative high cost of the software. 9

ADDRESSING PAIN POINTS WITH.. HIVE De facto standard for SQL-on-Hadoop Amazon EMR Amazon Elastic Map Reduce Managed Hadoop Framework Amazon S3 Amazon Simple Storage Service With Practically Infinite Storage 10

WHY SQL? Heavily SQL based legacy application that works on source data that s already available in structured format. Hundreds of thousands of lines of legacy SQL code, iterated and tested rigorously through the years and readily available for easy Hive porting. Developers, testers, data scientists and analysts with deep SQL skills and necessary business acumen already part of FINRA culture. 11

HIVE Developed at Facebook and now open-source de facto SQL standard for Hadoop. Been around for seven years, battle tested at scale, and widely used across industries. Powerful abstraction over MapReduce jobs, translates HiveQL into map-reduce jobs that work against distributed dataset. 12

HIVE EXECUTION ENGINES MapReduce Tez Mature batch-processing platform for the petabyte scale! Does not perform well enough for small data or iterative calculations with long data pipelines. Spark Aims to translate complex SQL statements into an optimized, purposebuilt data processing graphs. Strikes a balance between performance, throughput, and scalability. Fast in-memory computing, allows you to leverage available memory by fitting in all intermediate data. 13

AMAZON(AWS) S3 Cost-effective and durable object storage for a variety of content. Allows separation of storage from compute resources, providing ability to scale each independently in pay-as-you-go pricing model. Meets Hadoop s file system requirements and integrates well with EMR. 14

EMR ELASTIC MAP REDUCE Managed hadoop framework, easy to deploy and manage Hadoop clusters. Easy, fast, and cost-effective way to distribute and process vast amounts of data across dynamically scalable Amazon EC2 Instances. 15

EMR INSTANCE TYPES A wide selection of virtual Linux servers with varying combinations of CPU, memory, storage, and networking capacity. Each instance type includes one or more instance sizes, allowing you to scale your resources to target workload. Available in Spot (your bid price) or On-demand model. [Source Amazon] 16

CLOUD BASED HADOOP ARCHITECTURE 17

UNDERLYING DESIGN PATTERN Transient Clusters with S3 as HDFS Cluster lives just for the duration of the job, shutdown when the job is done. Persist input and output data in S3. Run multiple jobs in multiple Amazon EMR clusters over the same data set (in S3) without overloading your HDFS nodes. [Source Amazon] 18

UNDERLYING DESIGN PATTERN BENEFITS Control your cost, pay for what you use. Minimum maintenance, cluster goes away when the job is done. Persisting data in S3 enables easy reprocessing if the spot Instances are taken away due to an outbid. [Source Amazon] 19

LESSON LEARNED Design your Hive batch analytics with focus on enabling direct data access and maximizing resource utilization. 20

HIVE DESIGNING FOR PERFORMANCE Enable Direct Data Access Partition data along natural query boundaries (e.g. trade date) and process only the data you need. Improve join performance by bucketing and sorting data ahead of time and reduce I/O scans during join process. Use broadcast joins when joining small tables. 21

HIVE DESIGNING FOR PERFORMANCE Tune Hive Configurations Increase parallelism by tuning MapReduce split size to use all the available map slots in the cluster. Increase the replication factor on dimension tables (not for additional resiliency but for better performance). Compress intermediate data and reduce data transfer between mappers/ reducers. 22

LESSON LEARNED Measure and profile your clusters from the outset and adjust continuously keeping pace with changing data size, processing flow and execution frameworks. 23

CLUSTER PROFILING - SAMPLE Batch Process: Order Audit Trail Report Card Validation Source Dataset: 1 month of market orders data Over 100 billion rows Around 5 terabytes in size Input stored in S3 Instance Choices: m1.xlarge (General Purpose) - 4 CPUs, 15 GB RAM, 4 x 420 Disk Storage, High network performance c3.8xlarge (Compute Optimized) - 32 CPUs, 60 GB RAM, 2 x 320 SSD Storage, High network performance Run Options: Attribute Option-1 Option-2 Option-3 Option-4 Instance Type m1.xl C3.8xl C3.8xl C3.8xl Instance Classification General Purpose Compute Optimized Compute Optimized Compute Optimized Cluster Size 100 16 32 100 24

CLUSTER PROFILE COMPARISON RESULTS 100 M1s Vs 16 C3s Vs 32 C3s Vs 100 C3s Attribute Option-1 Option-2 Option-3 Option-4 Instance Type m1.xl C3.8xl C3.8xl C3.8xl Instance Classification General Purpose Compute Optimized Compute Optimized Compute Optimized Cluster Size 100 16 32 100 Cost $57.33 $111.86 $127.05 $312.00 Time 3 Hrs 12 mins 8 Hrs 4 Hrs 56 mins 3 Hrs Approximate Spot Price per Instance $0.14 $0.78 $0.66 $0.78 Peak Disk Usage 0.04% 2.10% 1.20% 0.45% Peak Memory Usage 56% 43% 45% 41% Peak CPU Usage 100% 95% 98% 92% 25

LESSON LEARNED Abstract execution framework (MR, Tez, Spark) from business logic in your app architecture. This allows switching frameworks, keeping up with Hive community. 26

LESSON LEARNED Hive UDFs (User Defined Functions) can solve your complex problems when SQL falls short. 27

HIVE UDFS Used when Hive functionality falls short e.g. window function with ignore nulls not supported in Hive e.g. date formatting functions Hive 1.2 has better support Used when non-procedural SQL can t accomplish the task e.g. de-dupe many-to-many time series pairings A->B 10AM, A->C 11AM, D->B 11AM, D->C 11AM, E->C 12PM 28

LESSON LEARNED Choose an optimal storage format (row, columnar) and compression type (splittable, space efficient, fast). 29

FILE STORAGE FORMAT AND COMPRESSION STORAGE FORMATS Columnar or Row based storage RCFILE, ORC, PARQUET Columnar formats, skip loading unwanted columns! COMPRESSION TYPES Reduce the number of bytes written to/read from HDFS Some are fast but offer less space reduction, some are space efficient but slower, some are splittable and some are not. Compression Extension Splittable Encoding/Decoding Speed (Scale 1-4) Gzip gz no 1 4 lzo lzo yes if indexed 2 2 bzip2 bz2 yes 3 3 snappy snappy no 4 1 Space Savings % (scale 1-4) [Source Amazon] 30

LESSON LEARNED Being at the bleeding edge of sql-on-hadoop has its risks, you need the right strategy to mitigate risks/challenges along the way. 31

RISKS AND MITIGATION STRATEGIES Hive, released in late 2008, Tez/Spark backend added recently. Traditional platforms like Oracle tested and iterated through four decades. Discovered issues with Hadoop/Hive during migration that are related to compression, storage types and memory management. Performance issues with interactive analytics Impacts Data Scientists & Business Analysts. 32

RISKS AND MITIGATION STRATEGIES Extensive parallel run comparison (apples-to-apples) against legacy production data to identify issues before production roll out. Partnered with Hadoop and Cloud vendors to address Hadoop/Hive issues and feature requests quickly. Push of a button automated regression test suites to kick the tires and test functionality for any minor software increments. Analyzing Presto/Tez to solve interactive analytics problems. 33

RISKS AND MITIGATION STRATEGIES Took advantage of cloud elasticity to complete parallel runs against production volume at scale to produce results swiftly. 34

END STATE COST Cost savings in infrastructure relative to our legacy environment. Minimal upfront spending, pay-as-you-go pricing model. Variety of machine types and size to choose from based on cost and performance needs. 35

END STATE FLEXIBILITY Dynamic infrastructure provides great flexibility with faster reprocessing and testing needs. Simplified software procurement and license management. Ease of exploratory runs without affecting production workload. 36

END STATE SCALABILITY Scale out at will easily on high volume days. Cloud elasticity enables running multiple days at scale in parallel. Reprocessing for historic days is completed in hours compared to weeks in our legacy environment. 37

QUESTIONS? Jaipaul Agonus FINRA Jaipaul.agonus@finra.org linkedin.com/in/jaipaulagonus 38