Openbus Documentation



Similar documents
Architectures for massive data management

Real-time Big Data Analytics with Storm

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

How To Scale Out Of A Nosql Database

Hadoop IST 734 SS CHUNG

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Integrating Big Data into the Computing Curricula

Introducing Storm 1 Core Storm concepts Topology design

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

The Hadoop Eco System Shanghai Data Science Meetup

COURSE CONTENT Big Data and Hadoop Training

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop: The Definitive Guide

HADOOP. Revised 10/19/2015

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Data Pipeline with Kafka

Hadoop Ecosystem B Y R A H I M A.

Using Kafka to Optimize Data Movement and System Integration. Alex

Big Data Analysis: Apache Storm Perspective

Hadoop and Map-Reduce. Swati Gore

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Rakam: Distributed Analytics API

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

A Brief Outline on Bigdata Hadoop

Comparing SQL and NOSQL databases

Big Data Analytics - Accelerated. stream-horizon.com

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Chase Wu New Jersey Ins0tute of Technology

Hadoop. Sunday, November 25, 12

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

NOT IN KANSAS ANY MORE

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Map Reduce & Hadoop Recommended Text:

How To Create A Data Visualization With Apache Spark And Zeppelin

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

HDFS. Hadoop Distributed File System

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Predictive Analytics with Storm, Hadoop, R on AWS

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Workshop on Hadoop with Big Data

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

1.5 Million Log Lines per Second Building and maintaining Flume flows at Conversant

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Apache Hama Design Document v0.6

Future Internet Technologies

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

The future of middleware: enterprise application integration and Fuse

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Hadoop Architecture. Part 1

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Hadoop & Spark Using Amazon EMR

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Distributed File Systems

A Brief Introduction to Apache Tez

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Log management with Logstash and Elasticsearch. Matteo Dessalvi

Hadoop Job Oriented Training Agenda

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Search and Real-Time Analytics on Big Data

Big Data Course Highlights

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Moving From Hadoop to Spark

Qsoft Inc

Apache Hadoop: Past, Present, and Future

Elephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets

A framework for easy development of Big Data applications

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

The Internet of Things and Big Data: Intro

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Transcription:

Openbus Documentation Release 1 Produban February 17, 2014

Contents i

ii

An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents: Contents 1

2 Contents

CHAPTER 1 Introduction The objective of Openbus is to define an architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Those events are of different types, from a variety of sources and with different formats. Depending on the nature of events, we will be processing them in a batch-oriented or near-realtime fashion. To achieve this flexibility and big capability, we have defined Openbus as a concrete implementation of the so called Lambda Architecture for Big Data systems [Marz]. Lambda Architecture defines three main layers for the processing of data streams: Batch layer, Speed layer and Serving layer. In our case, Openbus is comprised of a set of technologies that interact between them to implement these layers: Apache Kafka: this is our data stream. Different systems will be generating messages in Kafka topics. HDFS: this is where our master dataset is stored. MapReduce: This is how our batch layer recomputes batch views. Mapreduce is also used for a batch ETL process that dumps data from Kafka to HDFS. Apache Storm: This is what we use as our speed layer. Events are consumed from Kafka and processed into realtime views HBase: This is where we store the realtime views generated by Storm topologies 1.1 Use Cases Some use cases where openbus could be applied are: Web analytics Social Network Analysis Security Information and Event Management 3

4 Chapter 1. Introduction

CHAPTER 2 Architecture 2.1 High level architecture 2.2 Data stream: Apache Kafka We use Apache kafka as a central hub for collecting different types of events. Multiple systems will be publishing events into Kafka topics. At the moment we are using Avro format for all the published events. 2.2.1 Introduction Kafka is designed as a unified platform for handling all the real-time data feeds a large company might have. Kafka is run as a cluster of broker servers. Messages are stored in categories called topics. Those messages are published by producers and consumed and further processed by consumers. It uses a custom TCP Protocol for communication between clients and servers. A Kafka topic is splitted into one or more partitions. Partitions are distributed over the servers in the Kafka cluster. Each partition is replicated across a configurable number of servers. 2.2.2 Producing Kafka uses a custom TCP based protocol to expose its API. Apart from a JVM client maintained in its own codebase, there are client libraries for the following languages: Python Go C C++ Clojure Ruby NodeJS Storm 5

Scala DSL JRuby Becoming a publisher in Kafka is not very difficult. You will need a partial list of your Kafka brokers (it doesn t have to be exhaustive, since the client uses those endpoints to query about the topic leaders) and a topic name. This is an example of a very simple Kafka producer with Java: import java.util.date; import java.util.properties; import java.util.random; import kafka.javaapi.producer.producer; import kafka.producer.keyedmessage; import kafka.producer.producerconfig; public class BasicProducer { Producer<String, String> producer; public BasicProducer(String brokerlist, boolean requiredacks) { Properties props = new Properties(); props.put("metadata.broker.list", brokerlist); props.put("request.required.acks", requiredacks? "1" : "0"); props.put("serializer.class", "kafka.serializer.stringencoder"); } producer = new Producer<String, String>(new ProducerConfig(props)); public void sendmessage(string topic, String key, String value){ KeyedMessage<String, String> data = new KeyedMessage<String, String>(topic, key, value); producer.send(data); } /** * Creates a simulated random log event and sends it to a kafka topic. * * @param topic topic where the message will be sent */ public void sendrandomlogevent(string topic){ //Build random IP message Random rnd = new Random(); long runtime = new Date().getTime(); String ip = "192.168.2." + rnd.nextint(255); String msg = runtime + ", www.example.com, "+ ip; } } //Send the message to the broker this.sendmessage(topic, ip, msg); 2.2.3 Avro As previously said, we are using the Avro data format to serialize all the data events we produce in our Kafka data stream. This means that prior to send a message into Kafka, we are serializing it in Avro format, using a concrete Avro schema. This schema is embedded in the data we send to Kafka, so every future consumer of the message will be able 6 Chapter 2. Architecture

to deserialize it. In the Openbus code you can find AvroSerializer and AvroDeserialzer Java lasses, that can be of great help when using producing or consuming Avro messages from Kafka. This is the current Avro schema we are using for Log messages: { } "type": "record", "name": "ApacheLog", "namespace": "openbus.schema", "doc": "Apache Log Event", "fields": [ {"name": "host", "type": "string"}, {"name": "log", "type": "string"}, {"name": "user", "type": "string"}, {"name": "datetime", "type": "string"}, {"name": "request", "type": "string"}, {"name": "status", "type": "string"}, {"name": "size", "type": "string"}, {"name": "referer", "type": "string"}, {"name": "useragent", "type": "string"}, {"name": "session", "type": "string"}, {"name": "responsetime", "type": "string"} ] An example of producing Avro messages into Kafka is our AvroProducer class: public class AvroProducer { private Producer<byte[], byte[]> producer; private AvroSerializer serializer; private String topic; public AvroProducer(String brokerlist, String topic, String avroschemapath, String[] fields) { } this.topic=topic; this.serializer = new AvroSerializer(ClassLoader.class.getResourceAsStream(avroSchemaPath Properties props = new Properties(); props.put("metadata.broker.list", brokerlist); this.producer = new kafka.javaapi.producer.producer<>(new ProducerConfig(props)); /** * Send a message * @param values Array of Avro field values to be sent to kafka */ public void send(object[] values) { Message message = new Message(serializer.serialize(values)); //producer.send(new KeyedMessage<byte[], byte[]>(topic, message.buffer().array())); producer.send(new KeyedMessage<byte[], byte[]>(topic, serializer.serialize(values))); } /** * closes producer */ public void close() { 2.2. Data stream: Apache Kafka 7

} producer.close(); } 2.2.4 Consuming 2.3 Batch Layer: Hadoop 2.3.1 All Data: HDFS 2.3.2 Generating Batch views: Mapreduce 2.4 Speed Layer: Storm 2.4.1 Introduction Storm is a distributed, reliable, fault-tolerant system for processing streams of data. The work is delegated to different types of components that are each responsible for a simple specific processing task. The input stream of a Storm cluster is handled by a components called Spout and Bolts. In Storm execution units are Spouts and Bolts that unfold in typologies. Spouts are the data sources, the Bolts are the processors. Spouts and connections between and among Bolts Bolts have settings to indicate how the stream is divided between the threads of execution. In terminology Storm each thread is a separate instance of a parallel Bolt. 2.4.2 Use Cases Storm Processing of Strems RPC (Remote Procedure Call) distributed Continuous Computation 8 Chapter 2. Architecture

2.4.3 Concepts Tuples: An ordered list of elements Stream: Stream of tuples Spout: Producer Stream Bolt: Processor and creator of new Streams Topologies: Map Spouts and Bolts 2.4. Speed Layer: Storm 9

2.4.4 Fisic diagram 2.4.5 Components of a Storm topology Shuffle Grouping: Sending the bolts tuples is random and uniform. Valid for atomic operations Local o Shuffle: If the destination Bolt has one or more tasks in the same work process tuples are preferably sent to these workers Fiels Grouping: The stream is divided by the specified fields in the cluster. An example would be if the tuples have the user field all tuples with the same usuaio always go to the same task 10 Chapter 2. Architecture

All Grouping: Tuples are sent to all tasks of bolts Direct Grouping: The producer decides which task tuple consumer receive this tuple Global Grouping: Send all tuples intancias to a single destination Custom Grouping: Lets implement a custom grouping 2.4.6 Trident Trident is a high-level abstraction for doing realtime computing on top of Storm. Trident has joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies. 2.4.7 Example Topology 2.4.8 Generating Realtime views: Storm Topologies Based topology: Kafka -> Storm -> HBase 2.4. Speed Layer: Storm 11

stream = topology.newstream("spout", openbusbrokerspout.getpartitionedtridentspout()); stream = stream.each(new Fields("bytes"), new AvroLogDecoder(), new Fields(fieldsWebLog)); stream = stream.each(new Fields(fieldsWebLog), new WebServerLogFilter()); stream.each(new Fields("request", "datetime"), new DatePartition(), new Fields("cq", "cf")).groupby(new Fields("request", "cq", "cf")).persistentaggregate(staterequest, new Count(), new Fields("count")).newValuesStream().each(new Fields("request", "cq", "cf", "count"), new LogFilter()); stream.each(new Fields("user", "datetime"), new DatePartition(), new Fields("cq", "cf")).groupby(new Fields("user", "cq", "cf")).persistentaggregate(stateuser, new Count(), new Fields("count")).newValuesStream().each(new Fields("user", "cq", "cf", "count"), new LogFilter()); stream.each(new Fields("session", "datetime"), new DatePartition(), new Fields("cq", "cf")).groupby(new Fields("session", "cq", "cf")).persistentaggregate(statesession, new Count(), new Fields("count")).newValuesStream().each(new Fields("session", "cq", "cf", "count"), new LogFilter()); return topology.build(); Optional HDFS and OpenTSDB if (Constant.YES.equals(conf.get(Conf.PROP_OPENTSDB_USE))) { LOG.info("OpenTSDB: " + conf.get(conf.prop_opentsdb_use)); stream.groupby(new Fields(fieldsWebLog)).aggregate(new Fields(fieldsWebLog), new WebServerLog } if (Constant.YES.equals(conf.get(Conf.PROP_HDFS_USE))) { LOG.info("HDFS: " + conf.get(conf.prop_hdfs_use)); stream.each(new Fields(fieldsWebLog), new HDFSPersistence(), new Fields("result")); } 2.5 Serving Layer 2.5.1 HBase Hbase is the default NoSql database supplied with Hadoop. These are its main features Distributed Column-oriented High availability High performance Data volumen in order of TeraBytes and PetaBytes Horizontal scalability with the addition of nodes in a cluster Random read/write Persistent states in HBase with Trident 12 Chapter 2. Architecture

@SuppressWarnings("rawtypes") TridentConfig configrequest = new TridentConfig( (String)conf.get(Conf.PROP_HBASE_TABLE_REQUEST), (String)conf.get(Conf.PROP_HBASE_ROWID_REQUEST)); @SuppressWarnings("unchecked") StateFactory staterequest = HBaseAggregateState.transactional(configRequest); @SuppressWarnings("rawtypes") TridentConfig configuser = new TridentConfig( (String)conf.get(Conf.PROP_HBASE_TABLE_USER), (String)conf.get(Conf.PROP_HBASE_ROWID_REQUEST)); @SuppressWarnings("unchecked") StateFactory stateuser = HBaseAggregateState.transactional(configUser); @SuppressWarnings("rawtypes") TridentConfig configsession = new TridentConfig( (String)conf.get(Conf.PROP_HBASE_TABLE_SESSION), (String)conf.get(Conf.PROP_HBASE_ROWID_SESSION)); @SuppressWarnings("unchecked") StateFactory statesession = HBaseAggregateState.transactional(configSession); Queries: in HBase in Openbus #!/bin/bash #./hbasereqrows.sh wslog_request daily:20131105 TABLE=$1 DATE=$2 exec hbase shell << EOF scan ${TABLE}, {COLUMNS => [ ${DATE} ]} EOF 2.5. Serving Layer 13

14 Chapter 2. Architecture

CHAPTER 3 Installation Deploying the Openbus architecture in your environment involves the following steps: Install dependencies Build Openbus code Run examples We have tested Openbus in a Red Hat Enterprise Linux 6.4 environment 3.1 Installing dependencies Install Hadoop Install Kafka Install Storm Install Camus 3.2 Building openbus Clone the project from github: #> git clone https://github.com/produban/openbus.git Build the project using maven: #> cd openbus #> mvn compile 15

16 Chapter 3. Installation

CHAPTER 4 Running examples 4.1 Submitting events to the Kafka broker Launch javaavrokafka sample: #> cd $javaavrokafkahome #> java -jar avrokafka-1.0-snapshot-shaded.jar wslog 50 2 3 3-90 Arguments are kafka topic, number of requests, number of users, number of user sessions, number of session requests, date simulation offset (0 for today). 4.2 Running batch ETL processes from Kafka to Hadoop Launch Camus ETL: #> cd $camushome #> hadoop jar camus-example-0.1.0-snapshot-shaded.jar com.linkedin.camus.etl.kafka.camusjob -P <camus where <camus.properties> is a file path pointing to camus configuration as described in https://github.com/produban/camus under configuration section. 4.3 Running real time analysis with Storm topologies Launch Openbus Topology: #> cd $openbusrealtimehome #> storm jar target/openbus-realtime-0.0.1-shaded.jar com.produban.openbus.processor.topology.openbus Arguments are topology, kafka topic, zookeeper host and kafka broker list. 4.4 Visualizing data View Hits per Day/Month/Week: #> cd $openbusrealtimehome/hbase/queryscripts #>./hitsper<period>.sh <date> [requestid] 17

where <Period> can be Day, Month or Week. First arguments is the date in format yyyymmdd for a day of a year, yyyymm for a month of a year and yyyyww for a week of a year. Second argument is optional for filtering with an specific request. View Users per Day/Month/Week: #> cd $openbusrealtimehome/hbase/queryscripts #>./usersper<period>.sh <date> [userid] where <Period> can be Day, Month or Week. First arguments is the date in format yyyymmdd for a day of a year, yyyymm for a month of a year and yyyyww for a week of a year. Second argument is optional for filtering with an specific user. View Sessions per Day/Month/Week: #> cd $openbusrealtimehome/hbase/queryscripts #>./sessionsper<period>.sh <date> [userid] where <Period> can be Day, Month or Week. First arguments is the date in format yyyymmdd for a day of a year, yyyymm for a month of a year and yyyyww for a week of a year. Second argument is optional for filtering with an specific session. 18 Chapter 4. Running examples

Bibliography [Marz] Nathan Marz, James Warren. Big Data. Principles and best practices of scalable realtime data systems Manning MEAP 19