Wisdom from Crowds of Machines

Size: px

Start display at page:

Download "Wisdom from Crowds of Machines"

Georgia Pope
10 years ago
Views:

1 Wisdom from Crowds of Machines Analytics and Big Data Summit September 19, 2013 Chetan Conikee Irfan Ahmad

2 About Us CloudPhysics' mission is to discover the underlying principles that govern systems behavior and use this knowledge to transform computing. Chetan Conikee Chief Data Officer Founder, SeekVence Data Architect - Intuit Founding Engineer - Business Signatures (Entrust), CashEdge (Fiserv), Smartpipes (Sophos) Irfan Ahmad Chief Technology Officer VMware tech lead - DRS (Distributed Resource Scheduling) Team Transmeta (microprocessors)

Chetan Conikee Chief Data Officer Founder, SeekVence Data Architect - Intuit Founding Engineer - Business

3 Collective Intelligence Perf Stat Datastore Inventory Datastore Event, Task Datastore Secure multi-tenant architecture Scale Out Query Engine Parsers Analyzers Parsers Schema Normalization Ingest Predictive Analytics Apps Benchmarks Modeling Rules Engines Scalable Cache Card Card Card Card Card Web/App Servers HTTPS High-resolution performance, config, fault, change, event, task data Collector vapp HTTPS CloudPhysics, Inc. Copyright All rights reserved.

Modeling Rules Engines Scalable Cache Card Card Card Card Card Web/App Servers HTTPS High-resolution performance,

4 We collect 100 billion+ data points /day Avg 1.3 million properties/datacenter Continuous time-series data stream VMs, servers, networks, storage Configuration, state, tasks, events

5 Our dialect is Productive Fast Functional Statically typed Actors - an elegant way to handle concurrency... and inter-op Java with Scala

6 Our Software ToolSet Apache Kafka

7 Data Pipeline Criteria Secure Data Retention NO data-loss guarantee Scale out consumption process on demand Query near-to-real-time Query historical data on demand (replay)

8 Secure Data Retention Data organized and partitioned by time Stored directly in highly scalable, distributed object storage Encryption for data at rest - AES-256 Processing pipeline starts from object storage

9 How do we scale? courtesy:

10 Circa 2012 A cluster of Play! + Akka app servers TechCrunch moment at VMworld 2012 Exponential customer growth

11 DOJO Data Collection/Processing Pipeline Generation-1 Storage KairosDB Cassandra Cluster CloudPhysics Observer(s) AWS S3-Object Store Performance Metrics Store Object Storage MongoDB Sharded Cluster HTTP Cluster (Message Router) 2 HTTP Cluster (Message Processer) Configuration Store 3 5 KairosDB Cassandra Cluster HTTP Play! Akka Cluster HTTP Play! Akka Cluster 5 Event Metrics Store

MongoDB Sharded Cluster HTTP Cluster (Message Router) 2 HTTP Cluster (Message Processer)

12 HTTP Cluster HTTP Endpoints - Play!, Akka and Scala Actor(s) 1 Enqueue Router MessageBox Dispatcher 2 Dequeue Request Dispatch 4 Process Complete 6 Next (Repeat 1) Lightweight Objects with Universal Concurrency primitives Message Passing Concurrency & no shared state Location Transparent & distributable by design Scale UP and Scale OUT on demand You get a PERFECT FABRIC for the CLOUD

4 Process 1 6 5 5 Complete 6 Next (Repeat 1) Lightweight Objects with Universal Concurrency

13 Actor Model Design Issues MessageBox and MessageProcessing units are tightly coupled Increased backlog of streamed messages into the MessageBox can overload an Actor s memory footprint Negatively Impact responsiveness leading to application crashes

messages into the MessageBox can overload an Actor s memory

14 courtesy:

15 Let s discuss design patterns Effective Akka Jamie Allen Akka Concurrency Derek Wyatt A must read for serious Akka hakkers Work-Stealing or Work-Pull Pattern

16 Re-Design Use Akka durable unbounded mailbox (filesystem or Redis based) Use Akka work-stealing pattern and distribute load based on worker s processing queue depth (deployed zookeeper for cluster co-ordination and service discovery) Akka Cluster was early in development (

(deployed zookeeper for cluster co-ordination and service discovery) Akka Cluster

17 Decoupling Messaging from Processing - Why? Consumer(s) Redundancy Scalability Resiliency Ordering Guarantee Producer Queue Push Choices JMS Push Asynchronous Communication AWS SQS ZeroMQ Replay RabbitMQ Kafka

Ordering Guarantee Producer Queue Push Choices JMS

18 We chose Apache Kafka Open Sourced by LinkedIn SNA 2011 and written in Scala Kafka has: Producer Producer Broker(s) 1 PUSH Topic(s) Partition(s) Replication Broker-1 Topic-1/Part-1 /Part-2 Topic-2/Part-1 Broker-2 Topic-1/Part-1 /Part-2 Topic-2/Part-1 Broker-3 Topic-1/Part-1 /Part-2 Topic-2/Part-1 Zookeeper Ensemble Consumer Grouping Buffering of Messages in Producer & Consumer 2 PULL Consumer-Group Consumer 1... n Consumer-Group Consumer 1 Zookeeper

Topic-1/Part-1 /Part-2 Topic-2/Part-1 Broker-3 Topic-1/Part-1 /Part-2 Topic-2/Part-1 Zookeeper Ensemble Consumer

19 Why Kafka? Disk Layout Sequential file I/O on commit log Index Simple storage Logs in Broker Message-1 Message-2 Topic-1:Partition-1 Segment-1 Topic-1:Partition-2 Segment-1 Zoom.. Message-n Segment-n Segment-n read() append() Batch writes and reads Zero-copy transfer from files to socket Compression(Batched)- GZIP and Snappy Message caching in file system page cache - Avoid double buffering and GC overhead

Topic-1:Partition-1 Segment-1 Topic-1:Partition-2 Segment-1 Zoom.

20 DOJO Data Collection/Processing Pipeline Storage KairosDB Cassandra Cluster Generation-2 Performance Metrics Store CloudPhysics Observer(s) AWS S3-Object Store Object Storage 5 HTTP Play! Akka Cluster Auto Scaling Group 6 1 HTTP Cluster (Consumers) MongoDB Sharded Cluster Configuration Store 2 HTTP Cluster (Producers) Kafka - Broker Pull T1 T2 Pull 6 KairosDB Cassandra Cluster Event Metrics Store HTTP Play! Akka Cluster Auto Scaling Group T3 4 Pull Kafka HDFS Consumer Storage Collective Intelligence Spark Cluster Compute ALPHA HDFS

Akka Cluster Auto Scaling Group 6 1 HTTP Cluster (Consumers) MongoDB Sharded Cluster Configuration Store 2 HTTP Cluster (Producers)

21 Kafka Tuning Considerations Increased Partition Configuration (for high volume topics) Advantages Increases I/O parallelism for writes Increases degree of parallelism for consumers Tune memory of consumer based on (number of threads in consumer group) * (queuedchunks.max) * (fetch.size) Increase Heap size Reduce consumer thread count Lower fetch size or max queue size Overheads More open file handles More offsets to be check-pointed increasing load on zookeeper ensemble

22 Kafka Performance Benchmarks Consumer Throughput courtesy: Message size : 200 bytes Batch size : 200 messages Fetch size : 1 MB Flush interval : 600 messages No. of Consumers : 50 Data Consumed : 100 MB No. of messages consumed :

23 Storage KairosDB Cassandra Cluster ENSO Data Analysis Pipeline Performance Metrics Store Meta Data Authorization / Access Control 5 2 MongoDB Sharded Cluster 3 Configuration Store Query Engine HTTP Cluster (Producers) 4 7 In-Memory Cache with TTL Redis KairosDB Cassandra Cluster Event Metrics Store Spray / Akka Cluster 6 Play/Akka based Web Application Time Machine Query Planner Aggregation

24 Query Engine Query Engine (Enso) An type-safe Scala API/DSL library (domain specific language) that auto-generates MongoDB and Cassandra query operators Allows querying complex deeply nested JSON structures Supports grouping, aggregation and filtering Supports on-demand querying of historical data Supports resource-management query priorities

25 IT/DevOps Orchestration The Dojo Cluster is a part of an auto scaling group Compute nodes are added/deleted based on throughput characteristics - e.g. lag in consumption from Kafka brokers. We are currently leveraging AnsibleWorks IT Orchestration Engine

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new