Quantifind s story: Building custom interactive data analytics infrastructure

Similar documents
Rakam: Distributed Analytics API

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

The Sierra Clustered Database Engine, the technology at the heart of

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Spark Job Server. Evan Chan and Kelvin Chu. Date

The Big Data Paradigm Shift. Insight Through Automation

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Wisdom from Crowds of Machines

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Enterprise Application Monitoring with

Fast Analytics on Big Data with H20

Table Of Contents. 1. GridGain In-Memory Database

Garbage Collection in the Java HotSpot Virtual Machine

SDFS Overview. By Sam Silverberg

Building a Scalable News Feed Web Service in Clojure

A Performance Analysis of Distributed Indexing using Terrier

CitusDB Architecture for Real-Time Big Data

Streamdrill: Analyzing Big Data Streams in Realtime

Spark: Cluster Computing with Working Sets

Kafka & Redis for Big Data Solutions

Apache HBase. Crazy dances on the elephant back

Moving From Hadoop to Spark

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

In Memory Accelerator for MongoDB

Storage in Database Systems. CMPSCI 445 Fall 2010

Apache Spark and Distributed Programming

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

terracotta technical whitepaper:

MID-TIER DEPLOYMENT KB

Apache Flink. Fast and Reliable Large-Scale Data Processing

ANALYTICS BUILT FOR INTERNET OF THINGS

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone

Can the Elephants Handle the NoSQL Onslaught?

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Liferay Performance Tuning

CSCI E 98: Managed Environments for the Execution of Programs

Lecture 1: Data Storage & Index

From 1000/day to 1000/sec

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Big Data Patterns. Ron Bodkin Founder and President, Think Big

Big Data With Hadoop

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Chapter 3 Operating-System Structures

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Monitoring IBM WebSphere extreme Scale (WXS) Calls With dynatrace

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing

Using distributed technologies to analyze Big Data

Hypertable Architecture Overview

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015

Advanced Big Data Analytics with R and Hadoop

HADOOP PERFORMANCE TUNING

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

COS 318: Operating Systems

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Tuning Tableau and Your Database for Great Performance PRESENT ED BY

BigMemory: Providing competitive advantage through in-memory data management

OpenProdoc. Benchmarking the ECM OpenProdoc v 0.8. Managing more than documents/hour in a SOHO installation. February 2013

Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload

Technical Overview Simple, Scalable, Object Storage Software

Developing MapReduce Programs

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Trend Micro Big Data Platform and Apache Bigtop. 葉 祐 欣 (Evans Ye) Big Data Conference 2015

Cloud Computing at Google. Architecture

Web Browsing Quality of Experience Score

Distributed Computing and Big Data: Hadoop and MapReduce

Bigdata High Availability (HA) Architecture

Best Practices for Hadoop Data Analysis with Tableau

General Introduction

Practical Cassandra. Vitalii

Cassandra. Jonathan Ellis

Vendor briefing Business Intelligence and Analytics Platforms Gartner 15 capabilities

Database Scalability and Oracle 12c

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

BigMemory & Hybris : Working together to improve the e-commerce customer experience

In-Memory Databases MemSQL

Apache Kylin Introduction Dec 8,

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Database 2 Lecture I. Alessandro Artale

Scalable Internet Services and Load Balancing

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Big Data looks Tiny from the Stratosphere

Transcription:

Quantifind s story: Building custom interactive data analytics infrastructure Ryan LeCompte @ryanlecompte Scala Days 2015

Background Software Engineer at Quantifind @ryanlecompte ryan@quantifind.com http://github.com/ryanlecompte

Outline What does Quantifind do? Technical challenges Motivating use cases Existing solutions that we tried Custom infrastructure Lessons learned

What does Quantifind do? Find intentful conversations correlating to a customer s KPI and surface actionable insights Input consumer comments (Twitter, Facebook, etc) 3rd-party financial data Output top-level actionable insights interactive exploration

Technical Challenges Operate on a multi-terabyte annotated data set containing billions of consumer comments from millions of users over thousands of different dimensions We want to slice and dice & compute over any dimension We want to perform user-level operations; not everything can be satisfied by a pre-computed aggregation

Use Cases Flexible time series generation N-gram co-occurrences Cohort analysis

Time series Generate a time series of consumer conversations Flexible/arbitrary binning (year, month, day, hour, minute, second) with user de-duping Generate time series over data matching specific criteria (e.g., text/terms, other dimensions)

N-gram Co-occurrences N-grams are sliding groups of N adjacent words in a document We wish to find all n-grams that co-occur with a particular search term or phrase over an arbitrary time period or other dimension I am hungry right now 1-grams: I, am, hungry, right, now 2-grams: I am, am hungry, hungry right, right now 3-grams: I am hungry, am hungry right, hungry right now

Cohort Analysis Capture a particular group of users satisfying certain conditions at some point in time Next, for those same users analyze their conversations after a certain event happens Example: Capture users commenting that they want to see the movie Gone Girl before it s released Analyze new conversations for only this cohort after the movie s release date

Previous Approaches Spark Postgres Elastic Search

Spark Experienced challenges when trying to compute certain n-gram co-occurrences for the entire data set on our 12 node cluster Tough to reduce the search space of the entire data set without manual work Challenging developer experience Contention for cluster resources Re-packaging Spark JARs after code changes

Postgres Tables are partitioned by vertical/time Query performance suffers greatly when concurrent and disparate requests are being processed Managing table partitions is painful Limited to SQL-style operations; tough to bring custom computation close to the data

Elastic Search Challenging to do summary statistics without the need to load columns into memory Rebuilding indexes can be annoying Still tough to build custom logic that operates directly on the data

What do we really want? We really want to access our raw data in memory to create flexible functionality (e.g., time series, cohort analysis, and n-gram co-occurrences) We want the operations to be on-the-fly, optimized, and reasonably fast (no more batch jobs) So, let s take our data and pack it into a very compact binary format so that it fits in memory!

Can t fit everything in a single machine We can shard the compacted data in-memory across multiple machines and communicate via Akka Cluster Make scatter/gather-style requests against the raw data and compute results on the fly

RTC: Real-time Cluster HTTP API Handler RTC Master RTC Worker RTC Worker RTC Worker Packed Data Packed Data Packed Data

RTC Components Pack File Writer Master Worker

RTC Pack File Writer Off-line job that consumes the raw annotated data set Packs the consumer comments and users into the custom binary protocol/format (.pack files) Packs the data such that it s easily distributable across the RTC nodes

RTC Master Handles incoming RTC requests and routes them to the appropriate scatter/gather handler Distributes groups of.pack files to RTC workers once they register with the master (via Akka Cluster) Caches frequently accessed permutations/requests

RTC Worker Discovers and registers with the RTC master (via Akka Cluster) Loads its assigned.pack files from the filesystem and stores the data in memory Handles delegated requests from the RTC master

Akka Cluster Makes it super easy to send messages between actors running on different machines Allowed us to focus on our core use cases without having to worry about the network layer Facilitates distributed scatter/gather style computations across multiple machines JVM A Actors Messages JVM B Actors

Compact Data Custom binary protocol for consumer comment and user records Each record has a fixed header along with a set of known fields and values User and comment records are packed separately; joined on the fly

Custom Protocol Header User Id Timestamp Num Text Bytes Text Bytes Num Terms Terms 2 bytes 8 bytes 8 bytes 2 bytes Num Text Bytes * 1 byte 2 bytes Num Terms * 8 bytes Certain fields are 64-bit SIP hashed vs. storing full text Fields that are one of N values are stored in a smaller type (e.g. Byte / Short)

JVM Garbage Collection Not explicitly managing memory & worrying about allocations is fantastic for most JVM development However, the garbage collector can be your enemy when trying to manage a lot of objects in large JVM heaps GC pauses greatly affect application performance

Off-heap Memory JVM gives us access to a C/C++ style means of allocating memory via sun.misc.unsafe JVM does not analyze memory allocated via Unsafe when performing garbage collection Memory allocated via Unsafe must be explicitly managed by the developer, including deallocating it when no longer needed

RTC: Real-time Cluster HTTP API Handler RTC Master RTC Worker RTC Worker RTC Worker Off-heap data Off-heap data Off-heap data

Working with off-heap data // skip header and user id, only extract timestamp def timestamp: Long = unsafe.getlong(2 + 8) Primary goal: never materialize data that you don t need in order to satisfy the incoming request We never materialize a full record; instead, we only access the fields that are needed for the given incoming request (e.g. time series, n-grams, etc)

Off-heap Binary Searching def binarysearch( unsafe: Unsafe, offset: Long, fromindex: Int, toindex: Int, searchterm: Long): Int = { var low = fromindex var high = toindex - 1 var search = true var mid = 0 while (search && low <= high) { mid = (low + high) >>> 1 val term = unsafe.getlong(offset + (mid << 3)) if (term < searchterm) low = mid + 1 else if (term > searchterm) high = mid - 1 else search = false } } if (search) -(low + 1) else mid

The Search Space Each worker maintains various arrays of offsets (facet indices) to records stored in off-heap regions When a request comes in, the worker will determine which offsets it needs to visit in order to satisfy the request There can potentially be hundreds of millions of offsets to visit for a particular request; we need to visit as quickly as possible

RTC: Real-time Cluster HTTP API Handler RTC Master RTC Worker RTC Worker RTC Worker Facet indices Off-heap data Facet indices Off-heap data Facet indices Off-heap data

Facet Index Example Most of our queries care about a particular time range (e.g., a year, month, weeks, etc) How can we avoid searching over data that occurred outside of the time range that we care about?

Time Facet Index Workers have the opportunity to build custom facet indices while loading.pack files We can build up a TreeMap[Long, Array[Long]] Keys are timestamps Values are offsets to off-heap records occurring at that time Range-style operations are O(log n) We only process offsets that satisfy the date range

Concurrent Visits Array[Long] Record Offset 0 Record Offset 500 Record Offset 1000 Record Offset 1500 Thread 1 Thread 2 Thread 3 Thread 4 Each worker divides up their off-heap record offsets into ranges that are visited concurrently by multiple threads Results are partially merged on the worker and then sent back to master for final merge

Lessons Learned Reduce allocations as much as possible per-request Avoid large JVMs with many objects stuck in tenured generation Keep expensive to compute data around in LRU caches Protect yourself from GC pressure by wrapping large LRU cache values with scala.ref.softreference Chunk large requests into smaller requests to avoid having workers store too much intermediate per-request state Use Trove collections / primitive arrays

Was it worth it building a custom solution? Able to create optimized core RTC functionality since we own the entire system and operate on raw data that isn t aggregated Able to adapt to new challenges and business requirements; no longer fighting existing open source solutions Future improvements include adding more redundancy/ scalability, even more compact data representation, etc

Summary Akka Cluster + Non-aggregated off-heap data + Fast Scala Code = WIN

Thanks!

We re hiring! http://www.quantifind.com/careers/