Building a Scalable News Feed Web Service in Clojure

Similar documents

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Big Data Analytics - Accelerated. stream-horizon.com

Mark Bennett. Search and the Virtual Machine

Introduction to Big Data Training

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Alfresco Enterprise on AWS: Reference Architecture

CDH installation & Application Test Report

Deploying and Managing SolrCloud in the Cloud ApacheCon, April 8, 2014 Timothy Potter. Search Discover Analyze

So What s the Big Deal?

Amazon Elastic Beanstalk

Comparing NoSQL Solutions In a Real-World Scenario: Aerospike, Cassandra Open Source, Cassandra DataStax, Couchbase and Redis Labs

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Intellicus Enterprise Reporting and BI Platform

2) Xen Hypervisor 3) UEC

Enterprise Edition Scalability. ecommerce Framework Built to Scale Reading Time: 10 minutes

Technology and Cost Considerations for Cloud Deployment: Amazon Elastic Compute Cloud (EC2) Case Study

Real-time Data Replication

Search and Real-Time Analytics on Big Data

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

WHITE PAPER. Domo Advanced Architecture

Cloud Computing: Meet the Players. Performance Analysis of Cloud Providers

Introduction to Apache Cassandra

How AWS Pricing Works

How AWS Pricing Works May 2015

LICENSE4J FLOATING LICENSE SERVER USER GUIDE

AklaBox. The Ultimate Document Platform for your Cloud Infrastructure. Installation Guideline

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Graph Database Proof of Concept Report

Technical Overview Simple, Scalable, Object Storage Software

How To Choose Between A Relational Database Service From Aws.Com

Directions for VMware Ready Testing for Application Software

CERTIFIED MULESOFT DEVELOPER EXAM. Preparation Guide

Amazon EC2 Product Details Page 1 of 5

Open source large scale distributed data management with Google s MapReduce and Bigtable

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Volume SYSLOG JUNCTION. User s Guide. User s Guide

Big Data Architectures. Tom Cahill, Vice President Worldwide Channels, Jaspersoft

Performance Testing. Why is important? An introduction. Why is important? Delivering Excellence in Software Engineering

ZingMe Practice For Building Scalable PHP Website. By Chau Nguyen Nhat Thanh ZingMe Technical Manager Web Technical - VNG

GeoCloud Project Report GEOSS Clearinghouse

Sisense. Product Highlights.

Web Application Hosting in the AWS Cloud Best Practices

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Monitoring Oracle Enterprise Performance Management System Release Deployments from Oracle Enterprise Manager 12c

UBUNTU DISK IO BENCHMARK TEST RESULTS

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

GigaSpaces Real-Time Analytics for Big Data

Reliable Data Tier Architecture for Job Portal using AWS

Development of nosql data storage for the ATLAS PanDA Monitoring System

Performance Testing Process

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

An Approach to Implement Map Reduce with NoSQL Databases

Apache HBase. Crazy dances on the elephant back

Cloud Computing. Adam Barker

System Requirements Table of contents

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Drupal in the Cloud. by Azhan Founder/Director S & A Solutions

Scalable Architecture on Amazon AWS Cloud

Monitoring Nginx Server

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Performance Benchmark for Cloud Databases

Oracle WebLogic Server 11g Administration

Putting Apache Kafka to Use!

An Introduction to Cloud Computing Concepts

Database Scalability and Oracle 12c

Getting Started with SandStorm NoSQL Benchmark

Building a Private Cloud Cloud Infrastructure Using Opensource

Open source Google-style large scale data analysis with Hadoop

Table Of Contents. 1. GridGain In-Memory Database

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Performance Analysis: Benchmarking Public Clouds

SUSE Cloud Installation: Best Practices Using an Existing SMT and KVM Environment

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

FileMaker 12. ODBC and JDBC Guide

HCIbench: Virtual SAN Automated Performance Testing Tool User Guide

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Comparing MySQL and Postgres 9.0 Replication

[Hadoop, Storm and Couchbase: Faster Big Data]

The IBM Cognos Platform for Enterprise Business Intelligence

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Linux A first-class citizen in Windows Azure. Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise

How To Handle Big Data With A Data Scientist

Reference Model for Cloud Applications CONSIDERATIONS FOR SW VENDORS BUILDING A SAAS SOLUTION

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Chapter 11 Cloud Application Development

Performance and Scalability Overview

Performance and Scalability Overview

Open Source Business Intelligence Tools: A Review

Scalable Multiple NameNodes Hadoop Cloud Storage System

Cloud Computing Workload Benchmark Report

How To Scale Out Of A Nosql Database

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

How To Use Big Data For Telco (For A Telco)

Transcription:

Building a Scalable News Feed Web Service in Clojure This is a good time to be in software. The Internet has made communications between computers and people extremely affordable, even at scale. Cloud computing has opened up technology possibilities and capabilities to people and organizations once thought to be too small to even dream of affording it. The open source software movement and component based architectures have made it incredibly easy for individual engineers/entrepreneurs or small technology startups to create great products with little or no funding. published 2014 by Glenn Engstrand Most experts agree that the commodifying influence of the Internet, the cloud, and open source has lowered cost based barriers to entry but what about quality? Do these technologies deliver reliable and accurate capability at scale? That was what I wanted to discover when I created this basic news feed service using recent and popular open source projects and tested its load capabilities when running on Amazon's public cloud.

Clojure News Feed Web Service Copyright 2014 by Glenn Engstrand p 2 of 9 pp Here are the open source projects that I put to the test. Clojure is a Functional Programming language built for the Java Virtual Machine. Cassandra is a column oriented big data NoSql data store well suited for storing large volumes of time series data. PostGreSql is a fully ACID compliant relational database management system. Solr is a web service with the popular search engine Lucene embedded within it. Kafka is an asynchronous messaging service well suited to aggregate event logs. Redis is an in memory key value data store most often used as a write behind cache. https://github.com/gengstrand/clojure news feed What I did to test this assertion was to write a web service that combined all of these technologies. Then I wrote a load testing tool and some reporting utilities to put the service under the load and to measure and evaluate how it performed. This is all open source which you can find here.

Clojure News Feed Web Service Copyright 2014 by Glenn Engstrand p 3 of 9 pp Clojure is a Functional Programming language which means that it is not easy to modify the state of objects once they have been created. One very important capability any web service needs is its ability to report its own performance metrics if queried to do so. On Java based services, that is done via a technology called JMX. This is I wanted the outbound activity content to be keyword searchable which is where Solr comes in. The solr sub project contains all the configuration files needed to index the outbound feed. not easy for Clojure to do so I wrote some Java code to do this for the proposed news feed web service. The support project is where that code is organized. Clojure and Java inter operate very easily so it is no big deal to call Java code from Clojure. The support project also wraps access to the SolrJ API. If you have ever used Facebook, then you should already be familiar with what a news feed is. This news feed is very basic. Participants are linked to each other through what is known as a social graph. A participant's activity is captured in his outbound feed. When a participant posts an activity, his friends (i.e. who he is directly connected to on his social graph) receive a copy of this activity in their inbound feeds. The affordability of open source is good news for small business. The bad news is that it is all about up sell in the cloud.

Clojure News Feed Web Service Copyright 2014 by Glenn Engstrand p 4 of 9 pp metrics name space The feed project is provides bindings to the where all the Clojure code, that implements JMX part of the support project. The core the web service, is organized. Here are the component brings it all main components to the together with the ability feed service. The cache to either load or store data from the underlying redis, cassandra, messaging kafka, mysql, technology. It also logs performance data, both postgres, and search name spaces all provide to JMX and to the Kafka feed topic. The handler bindings to their respective technologies. name space maps HTTP The rdbms name space requests to the proper provides routing logic to functions in the core either mysql or postgres name space. For saves, I would use Clojure's depending on the support for configuration which is loaded with the settings polymorphism and the name space code. This multi method dispatch mechanism. For loads, I service does use would use Clojure's connection pooling to support for higher order the RDBMS which is functions. currently set to 50 maximum pool size. The The news feed performance project is where all the Java code, that implements a Hadoop map reduce job for aggregating the performance data from the news feed web service, is organized.

Clojure News Feed Web Service Copyright 2014 by Glenn Engstrand p 5 of 9 pp The etl project contains a command line utility that takes the per minute aggregated output from the news feed performance map reduce job and populates a mysql database ready for access via the open source OLAP project Mondrian. The load project contains the Clojure code that implements the load applications. This command line tool simulates a specified number of user sessions that create accounts, connect these accounts to each other socially, then post outbound activity. This is all done by calling the various end points to the web service. Once the code was written, it was time to set the environment up in order to perform the load test. I used 5 m1.medium instances from EC2 and 1 db.m1.medium instance from RDS. What got installed on the EC2 instances was Cassandra, Solr, Kafka, Redis, and Jetty. Each service got their own server but I installed ZooKeeper on the same instance as Kafka because Kafka depends on ZooKeeper. Solr can be set up as either single core or multi core. I chose mult core for its increased flexibility. The news feed service can run with either MySql or PostGreSql but I choose RDS running PostGreSql for the load test.

Clojure News Feed Web Service Copyright 2014 by Glenn Engstrand p 6 of 9 pp The news feed web service handles HTTP requests with Compojure which is a routing library for Ring which is a web applications library that can be run to integrate with Jetty. You can embed Jetty inside the JAR file which serves as the build artifact for the project. That is how I chose to build and release this experiment. I ended up using the defaults for most of the configurations of these services. I already explained what you need to do to the Solr configuration. Cassandra is the other service that needs a tweak or two in its cassandra.yaml file. The rpc_address and listen_address will need to be set to the external IP of the host where Cassandra is running. Because the news feed service is using the CQL driver and not the Hector client, you will also need to set start_native_transport to true. If you do decide to deploy this service to the cloud, then you will need to extensively change the feed/etc/confi g.clj file which contains all the connection settings to the various servers. The defaults in that file are all localhost which is fine if you are just exploring the service for yourself. See the various README and docs/intro markdown files for much more detailed information. Now that the service is all set up and running, how do we test its capacity at load? First thing to do is to run the load test application which is a command line tool, written in Clojure, that simulates a concurrent number of sessions. Some percentage of those sessions are just performing keyword search.

Clojure News Feed Web Service Copyright 2014 by Glenn Engstrand p 7 of 9 pp Configuration parameters for the load test application include host, port, participate batch size, minimum number of friends, maximum number of friends, how many words for the subject line, how many words in the activity content, how many activity posts per user, and how many searches per user. I ran this test with 100 concurrent users and 10% searches. The load application got to run on its own m1.large EC 2 instance. The load test has completed and we have a text file containing all the performance metrics. How do we access and analyze this data? I used the etl Clojure project (i.e. Extract, Transform, and Load) to input the summarized performance metrics file into a MySql database ready for access by a Mondrian OLAP cube. Now that the load test is running, how do we collect the performance data? As mentioned previously, the service logs this data to Kafka which provides command line tools to read the data and redirect it to a text file. Once the test concluded, I transferred that file to Hadoop where I ran the map reduce job, from the news feed performance project, which input that raw performance information and output a text file which contains the per minute collection of performance metrics (throughput, mode, and 95th percentile) per entity (Participant, Friends, Inbound, Outbound) and activity (load, store, post, and search). You have to create the database and set up appropriate credentials. Run the etl/etc/starschema.sql script to create the tables and stored procedures that this etl program needs. Modify the etc/src/etc/core.clj program to connect to the database with the proper credentials. See the etl/doc/intro.md file for details on how to set up Mondrian with the news feed performance cube data.

Clojure News Feed Web Service Copyright 2014 by Glenn Engstrand p 8 of 9 pp I suspect that the cause of the big spike can be attributed to the very nature of cloud computing itself. Now that the data has been collected and analyzed, what were the findings? Performance at the beginning was not so good. That was because the Redis cache was cold. Once the cache warmed up, performance was great. Near the end of the first hour, performance really bottomed out briefly. I provisioned and ran the server side software on five EC 2 instances but is that really five different servers? No. Cloud computing uses virtualization. I sshed to each 64 bit Ubuntu LTS box but that is actually just the guest OS running along side other guest OS images on some host server that I never see. On the cloud, spikes in performance are more likely to be attributed to increases in activity in the images that reside along side yours on the real hardware. You can purchase instances with provisioned guarantees of input/output processing per second but that costs more. I did not do that for this test. At first glance, it looks like Cassandra writes and Postgres writes have about the same latency. The steady state numbers for mode are 10 ms for Postgres and 8 ms for Cassandra. The 95th percentiles are 50 ms for Postgres and 47 ms for Cassandra. What you have to take into account is that Cassandra was handling 14 times more writes than Postgres. Solr performance is very consistent. It was handling 33 requests per second with a very steady 135 ms per request for the mode and 213 ms for the 95th percentile. Compare that with the 5X difference in the other data stores.

Clojure News Feed Web Service Copyright 2014 by Glenn Engstrand p 9 of 9 pp The second hour of the load test had consistent performance metrics so I will base most of my findings on that information alone. Overall transaction throughput for posting outbound activity is a very respectable 58 to 83 transactions per second with a latency whose mode is 393 ms and whose 95th percentile is about 1 second. That is not bad at all considering how this is with one web server which you could easily scale out. Does the Internet, cloud computing and open source really disrupt entrepreneurism by providing lean startups with the capability to serve large markets on a budget? Well, we have some mixed results here. The Internet is clearly disruptive in that consumers will gladly pay to benefit from its networked effects. Open source is a big win for small business in that this 1400 lines of code that one guy wrote over the holidays could easily deliver non trivial news feed capability at a scale that the fortune 500 would gladly pay permium for merely a decade ago. Cloud computing however, is a different story. The good news is that it cost me only $10 to run this test. That's about as commodity as it gets. The bad news is that it is all about up sell in the cloud. You would need three times as many servers running on at least two availability zones for a credible High Availability story. Increase the RAM and CPUs and start provisioning IOPs and you will quickly find that the cloud is not about saving money, especially once you have achieved some success and have a customer base to protect.