Wisdom from Crowds of Machines



Similar documents
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Scalable Architecture on Amazon AWS Cloud

xpaaerns on Spark, Shark, Tachyon and Mesos

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Kafka: a Distributed Messaging System for Log Processing

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Search and Real-Time Analytics on Big Data

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Hadoop & Spark Using Amazon EMR

Using Kafka to Optimize Data Movement and System Integration. Alex

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Performance Testing of Big Data Applications

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Architectures for massive data management

Data Pipeline with Kafka

Building a logging pipeline with Open Source tools. Iñigo Ortiz de Urbina Cazenave

Apache Kafka Your Event Stream Processing Solution

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Rakam: Distributed Analytics API

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Analytics - Accelerated. stream-horizon.com

WSO2 Message Broker. Scalable persistent Messaging System

Real Time Big Data Processing

AquaLogic Service Bus

Big Data Analytics - Accelerated. stream-horizon.com

[Hadoop, Storm and Couchbase: Faster Big Data]

Development of nosql data storage for the ATLAS PanDA Monitoring System

Time series IoT data ingestion into Cassandra using Kaa

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Hadoop Ecosystem B Y R A H I M A.

Testing Spark: Best Practices

Benchmarking Cassandra on Violin

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Outdated Architectures Are Holding Back the Cloud

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Big data blue print for cloud architecture

GigaSpaces Real-Time Analytics for Big Data

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

From Spark to Ignition:

Resource Utilization of Middleware Components in Embedded Systems

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Creating Big Data Applications with Spring XD

Upcoming Announcements

Hypertable Architecture Overview

In Memory Accelerator for MongoDB

CSE-E5430 Scalable Cloud Computing Lecture 11

Unified Big Data Processing with Apache Spark. Matei

Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework

Migration Scenario: Migrating Batch Processes to the AWS Cloud

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Putting Apache Kafka to Use!

the missing log collector Treasure Data, Inc. Muga Nishizawa

Oracle Database 12c Plug In. Switch On. Get SMART.

Scaling Out With Apache Spark. DTL Meeting Slides based on

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Big Data Technology Core Hadoop: HDFS-YARN Internals

Virtualizing Apache Hadoop. June, 2012

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Cloud computing - Architecting in the cloud

Hadoop: Embracing future hardware

Comparing NoSQL Solutions In a Real-World Scenario: Aerospike, Cassandra Open Source, Cassandra DataStax, Couchbase and Redis Labs

Learning Management Redefined. Acadox Infrastructure & Architecture

Benchmarking Hadoop & HBase on Violin

Efficient Management of System Logs using a Cloud Radoslav Bodó, Daniel Kouřil CESNET. ISGC 2013, March 2013

Kafka & Redis for Big Data Solutions

Social Networks and the Richness of Data

Insight into Performance Testing J2EE Applications Sep 2008

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Bigdata : Enabling the Semantic Web at Web Scale

Liferay Performance Tuning

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

Cloud Based Application Architectures using Smart Computing

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Can the Elephants Handle the NoSQL Onslaught?

Speak<geek> Tech Brief. RichRelevance Distributed Computing: creating a scalable, reliable infrastructure

Quantifind s story: Building custom interactive data analytics infrastructure

Energy Efficient MapReduce

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Functional Kafka. Bachelor Thesis Spring Semester 2015 Computer Science Department HSR - University of Applied Sciences Rapperswil

The Cloud to the rescue!

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

CDH AND BUSINESS CONTINUITY:

NOT IN KANSAS ANY MORE

Big Data Pipeline and Analytics Platform

Transcription:

Wisdom from Crowds of Machines Analytics and Big Data Summit September 19, 2013 Chetan Conikee Irfan Ahmad

About Us CloudPhysics' mission is to discover the underlying principles that govern systems behavior and use this knowledge to transform computing. Chetan Conikee Chief Data Officer Founder, SeekVence Data Architect - Intuit Founding Engineer - Business Signatures (Entrust), CashEdge (Fiserv), Smartpipes (Sophos) Irfan Ahmad Chief Technology Officer VMware tech lead - DRS (Distributed Resource Scheduling) Team Transmeta (microprocessors)

Collective Intelligence Perf Stat Datastore Inventory Datastore Event, Task Datastore Secure multi-tenant architecture Scale Out Query Engine Parsers Analyzers Parsers Schema Normalization Ingest Predictive Analytics Apps Benchmarks Modeling Rules Engines Scalable Cache Card Card Card Card Card Web/App Servers HTTPS High-resolution performance, config, fault, change, event, task data Collector vapp HTTPS www.cloudphysics.com CloudPhysics, Inc. Copyright 2013. All rights reserved.

We collect 100 billion+ data points /day Avg 1.3 million properties/datacenter Continuous time-series data stream VMs, servers, networks, storage Configuration, state, tasks, events

Our dialect is Productive Fast Functional Statically typed Actors - an elegant way to handle concurrency... and inter-op Java with Scala

Our Software ToolSet Apache Kafka

Data Pipeline Criteria Secure Data Retention NO data-loss guarantee Scale out consumption process on demand Query near-to-real-time Query historical data on demand (replay)

Secure Data Retention Data organized and partitioned by time Stored directly in highly scalable, distributed object storage Encryption for data at rest - AES-256 Processing pipeline starts from object storage

How do we scale? courtesy: http://www.capricemotorinn.com/dam1_.html

Circa 2012 A cluster of Play! + Akka app servers TechCrunch moment at VMworld 2012 Exponential customer growth

DOJO Data Collection/Processing Pipeline Generation-1 Storage KairosDB Cassandra Cluster CloudPhysics Observer(s) AWS S3-Object Store Performance Metrics Store Object Storage 1 4 5 MongoDB Sharded Cluster HTTP Cluster (Message Router) 2 HTTP Cluster (Message Processer) Configuration Store 3 5 KairosDB Cassandra Cluster HTTP Play! Akka Cluster HTTP Play! Akka Cluster 5 Event Metrics Store

HTTP Cluster HTTP Endpoints - Play!, Akka and Scala Actor(s) 1 Enqueue Router MessageBox Dispatcher 2 Dequeue Request 2 3 4 3 Dispatch 4 Process 1 6 5 5 Complete 6 Next (Repeat 1) Lightweight Objects with Universal Concurrency primitives Message Passing Concurrency & no shared state Location Transparent & distributable by design Scale UP and Scale OUT on demand You get a PERFECT FABRIC for the CLOUD

Actor Model Design Issues MessageBox and MessageProcessing units are tightly coupled Increased backlog of streamed messages into the MessageBox can overload an Actor s memory footprint Negatively Impact responsiveness leading to application crashes

courtesy: http://www.earthyreport.com

Let s discuss design patterns https://twitter.com/jamie_allen Effective Akka Jamie Allen https://twitter.com/derekwyatt Akka Concurrency Derek Wyatt A must read for serious Akka hakkers Work-Stealing or Work-Pull Pattern

Re-Design Use Akka durable unbounded mailbox (filesystem or Redis based) Use Akka work-stealing pattern and distribute load based on worker s processing queue depth (deployed zookeeper for cluster co-ordination and service discovery) Akka Cluster was early in development (http://doc.akka.io/docs/akka/snapshot/common/cluster.html#cluster)

Decoupling Messaging from Processing - Why? Consumer(s) Redundancy Scalability Resiliency Ordering Guarantee Producer Queue Push Choices JMS Push Asynchronous Communication AWS SQS ZeroMQ Replay RabbitMQ Kafka

We chose Apache Kafka Open Sourced by LinkedIn SNA 2011 and written in Scala Kafka has: Producer Producer Broker(s) 1 PUSH Topic(s) Partition(s) Replication Broker-1 Topic-1/Part-1 /Part-2 Topic-2/Part-1 Broker-2 Topic-1/Part-1 /Part-2 Topic-2/Part-1 Broker-3 Topic-1/Part-1 /Part-2 Topic-2/Part-1 Zookeeper Ensemble Consumer Grouping Buffering of Messages in Producer & Consumer 2 PULL Consumer-Group Consumer 1... n Consumer-Group Consumer 1 Zookeeper

Why Kafka? Disk Layout Sequential file I/O on commit log Index Simple storage Logs in Broker Message-1 Message-2 Topic-1:Partition-1 Segment-1 Topic-1:Partition-2 Segment-1 Zoom.. Message-n Segment-n Segment-n read() append() Batch writes and reads Zero-copy transfer from files to socket Compression(Batched)- GZIP and Snappy Message caching in file system page cache - Avoid double buffering and GC overhead

DOJO Data Collection/Processing Pipeline Storage KairosDB Cassandra Cluster Generation-2 Performance Metrics Store CloudPhysics Observer(s) AWS S3-Object Store Object Storage 5 HTTP Play! Akka Cluster Auto Scaling Group 6 1 HTTP Cluster (Consumers) MongoDB Sharded Cluster Configuration Store 2 HTTP Cluster (Producers) Kafka - Broker Pull 4 6 3 T1 T2 Pull 6 KairosDB Cassandra Cluster Event Metrics Store HTTP Play! Akka Cluster Auto Scaling Group T3 4 Pull Kafka HDFS Consumer Storage Collective Intelligence Spark Cluster Compute ALPHA HDFS

Kafka Tuning Considerations Increased Partition Configuration (for high volume topics) Advantages Increases I/O parallelism for writes Increases degree of parallelism for consumers Tune memory of consumer based on (number of threads in consumer group) * (queuedchunks.max) * (fetch.size) Increase Heap size Reduce consumer thread count Lower fetch size or max queue size Overheads More open file handles More offsets to be check-pointed increasing load on zookeeper ensemble

Kafka Performance Benchmarks Consumer Throughput courtesy:http://kafka.apache.org Message size : 200 bytes Batch size : 200 messages Fetch size : 1 MB Flush interval : 600 messages No. of Consumers : 50 Data Consumed : 100 MB No. of messages consumed : 6000000

Storage KairosDB Cassandra Cluster ENSO Data Analysis Pipeline Performance Metrics Store Meta Data Authorization / Access Control 5 2 MongoDB Sharded Cluster 3 Configuration Store Query Engine HTTP Cluster (Producers) 4 7 In-Memory Cache with TTL 4 8 1 Redis KairosDB Cassandra Cluster Event Metrics Store Spray / Akka Cluster 6 Play/Akka based Web Application Time Machine Query Planner Aggregation

Query Engine Query Engine (Enso) An type-safe Scala API/DSL library (domain specific language) that auto-generates MongoDB and Cassandra query operators Allows querying complex deeply nested JSON structures Supports grouping, aggregation and filtering Supports on-demand querying of historical data Supports resource-management query priorities

IT/DevOps Orchestration The Dojo Cluster is a part of an auto scaling group Compute nodes are added/deleted based on throughput characteristics - e.g. lag in consumption from Kafka brokers. We are currently leveraging AnsibleWorks IT Orchestration Engine