Openbus Documentation
|
|
- Reginald Hunt
- 8 years ago
- Views:
Transcription
1 Openbus Documentation Release 1 Produban February 17, 2014
2
3 Contents i
4 ii
5 An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents: Contents 1
6 2 Contents
7 CHAPTER 1 Introduction The objective of Openbus is to define an architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Those events are of different types, from a variety of sources and with different formats. Depending on the nature of events, we will be processing them in a batch-oriented or near-realtime fashion. To achieve this flexibility and big capability, we have defined Openbus as a concrete implementation of the so called Lambda Architecture for Big Data systems [Marz]. Lambda Architecture defines three main layers for the processing of data streams: Batch layer, Speed layer and Serving layer. In our case, Openbus is comprised of a set of technologies that interact between them to implement these layers: Apache Kafka: this is our data stream. Different systems will be generating messages in Kafka topics. HDFS: this is where our master dataset is stored. MapReduce: This is how our batch layer recomputes batch views. Mapreduce is also used for a batch ETL process that dumps data from Kafka to HDFS. Apache Storm: This is what we use as our speed layer. Events are consumed from Kafka and processed into realtime views HBase: This is where we store the realtime views generated by Storm topologies 1.1 Use Cases Some use cases where openbus could be applied are: Web analytics Social Network Analysis Security Information and Event Management 3
8 4 Chapter 1. Introduction
9 CHAPTER 2 Architecture 2.1 High level architecture 2.2 Data stream: Apache Kafka We use Apache kafka as a central hub for collecting different types of events. Multiple systems will be publishing events into Kafka topics. At the moment we are using Avro format for all the published events Introduction Kafka is designed as a unified platform for handling all the real-time data feeds a large company might have. Kafka is run as a cluster of broker servers. Messages are stored in categories called topics. Those messages are published by producers and consumed and further processed by consumers. It uses a custom TCP Protocol for communication between clients and servers. A Kafka topic is splitted into one or more partitions. Partitions are distributed over the servers in the Kafka cluster. Each partition is replicated across a configurable number of servers Producing Kafka uses a custom TCP based protocol to expose its API. Apart from a JVM client maintained in its own codebase, there are client libraries for the following languages: Python Go C C++ Clojure Ruby NodeJS Storm 5
10 Scala DSL JRuby Becoming a publisher in Kafka is not very difficult. You will need a partial list of your Kafka brokers (it doesn t have to be exhaustive, since the client uses those endpoints to query about the topic leaders) and a topic name. This is an example of a very simple Kafka producer with Java: import java.util.date; import java.util.properties; import java.util.random; import kafka.javaapi.producer.producer; import kafka.producer.keyedmessage; import kafka.producer.producerconfig; public class BasicProducer { Producer<String, String> producer; public BasicProducer(String brokerlist, boolean requiredacks) { Properties props = new Properties(); props.put("metadata.broker.list", brokerlist); props.put("request.required.acks", requiredacks? "1" : "0"); props.put("serializer.class", "kafka.serializer.stringencoder"); } producer = new Producer<String, String>(new ProducerConfig(props)); public void sendmessage(string topic, String key, String value){ KeyedMessage<String, String> data = new KeyedMessage<String, String>(topic, key, value); producer.send(data); } /** * Creates a simulated random log event and sends it to a kafka topic. * topic topic where the message will be sent */ public void sendrandomlogevent(string topic){ //Build random IP message Random rnd = new Random(); long runtime = new Date().getTime(); String ip = " " + rnd.nextint(255); String msg = runtime + ", "+ ip; } } //Send the message to the broker this.sendmessage(topic, ip, msg); Avro As previously said, we are using the Avro data format to serialize all the data events we produce in our Kafka data stream. This means that prior to send a message into Kafka, we are serializing it in Avro format, using a concrete Avro schema. This schema is embedded in the data we send to Kafka, so every future consumer of the message will be able 6 Chapter 2. Architecture
11 to deserialize it. In the Openbus code you can find AvroSerializer and AvroDeserialzer Java lasses, that can be of great help when using producing or consuming Avro messages from Kafka. This is the current Avro schema we are using for Log messages: { } "type": "record", "name": "ApacheLog", "namespace": "openbus.schema", "doc": "Apache Log Event", "fields": [ {"name": "host", "type": "string"}, {"name": "log", "type": "string"}, {"name": "user", "type": "string"}, {"name": "datetime", "type": "string"}, {"name": "request", "type": "string"}, {"name": "status", "type": "string"}, {"name": "size", "type": "string"}, {"name": "referer", "type": "string"}, {"name": "useragent", "type": "string"}, {"name": "session", "type": "string"}, {"name": "responsetime", "type": "string"} ] An example of producing Avro messages into Kafka is our AvroProducer class: public class AvroProducer { private Producer<byte[], byte[]> producer; private AvroSerializer serializer; private String topic; public AvroProducer(String brokerlist, String topic, String avroschemapath, String[] fields) { } this.topic=topic; this.serializer = new AvroSerializer(ClassLoader.class.getResourceAsStream(avroSchemaPath Properties props = new Properties(); props.put("metadata.broker.list", brokerlist); this.producer = new kafka.javaapi.producer.producer<>(new ProducerConfig(props)); /** * Send a message values Array of Avro field values to be sent to kafka */ public void send(object[] values) { Message message = new Message(serializer.serialize(values)); //producer.send(new KeyedMessage<byte[], byte[]>(topic, message.buffer().array())); producer.send(new KeyedMessage<byte[], byte[]>(topic, serializer.serialize(values))); } /** * closes producer */ public void close() { 2.2. Data stream: Apache Kafka 7
12 } producer.close(); } Consuming 2.3 Batch Layer: Hadoop All Data: HDFS Generating Batch views: Mapreduce 2.4 Speed Layer: Storm Introduction Storm is a distributed, reliable, fault-tolerant system for processing streams of data. The work is delegated to different types of components that are each responsible for a simple specific processing task. The input stream of a Storm cluster is handled by a components called Spout and Bolts. In Storm execution units are Spouts and Bolts that unfold in typologies. Spouts are the data sources, the Bolts are the processors. Spouts and connections between and among Bolts Bolts have settings to indicate how the stream is divided between the threads of execution. In terminology Storm each thread is a separate instance of a parallel Bolt Use Cases Storm Processing of Strems RPC (Remote Procedure Call) distributed Continuous Computation 8 Chapter 2. Architecture
13 2.4.3 Concepts Tuples: An ordered list of elements Stream: Stream of tuples Spout: Producer Stream Bolt: Processor and creator of new Streams Topologies: Map Spouts and Bolts 2.4. Speed Layer: Storm 9
14 2.4.4 Fisic diagram Components of a Storm topology Shuffle Grouping: Sending the bolts tuples is random and uniform. Valid for atomic operations Local o Shuffle: If the destination Bolt has one or more tasks in the same work process tuples are preferably sent to these workers Fiels Grouping: The stream is divided by the specified fields in the cluster. An example would be if the tuples have the user field all tuples with the same usuaio always go to the same task 10 Chapter 2. Architecture
15 All Grouping: Tuples are sent to all tasks of bolts Direct Grouping: The producer decides which task tuple consumer receive this tuple Global Grouping: Send all tuples intancias to a single destination Custom Grouping: Lets implement a custom grouping Trident Trident is a high-level abstraction for doing realtime computing on top of Storm. Trident has joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies Example Topology Generating Realtime views: Storm Topologies Based topology: Kafka -> Storm -> HBase 2.4. Speed Layer: Storm 11
16 stream = topology.newstream("spout", openbusbrokerspout.getpartitionedtridentspout()); stream = stream.each(new Fields("bytes"), new AvroLogDecoder(), new Fields(fieldsWebLog)); stream = stream.each(new Fields(fieldsWebLog), new WebServerLogFilter()); stream.each(new Fields("request", "datetime"), new DatePartition(), new Fields("cq", "cf")).groupby(new Fields("request", "cq", "cf")).persistentaggregate(staterequest, new Count(), new Fields("count")).newValuesStream().each(new Fields("request", "cq", "cf", "count"), new LogFilter()); stream.each(new Fields("user", "datetime"), new DatePartition(), new Fields("cq", "cf")).groupby(new Fields("user", "cq", "cf")).persistentaggregate(stateuser, new Count(), new Fields("count")).newValuesStream().each(new Fields("user", "cq", "cf", "count"), new LogFilter()); stream.each(new Fields("session", "datetime"), new DatePartition(), new Fields("cq", "cf")).groupby(new Fields("session", "cq", "cf")).persistentaggregate(statesession, new Count(), new Fields("count")).newValuesStream().each(new Fields("session", "cq", "cf", "count"), new LogFilter()); return topology.build(); Optional HDFS and OpenTSDB if (Constant.YES.equals(conf.get(Conf.PROP_OPENTSDB_USE))) { LOG.info("OpenTSDB: " + conf.get(conf.prop_opentsdb_use)); stream.groupby(new Fields(fieldsWebLog)).aggregate(new Fields(fieldsWebLog), new WebServerLog } if (Constant.YES.equals(conf.get(Conf.PROP_HDFS_USE))) { LOG.info("HDFS: " + conf.get(conf.prop_hdfs_use)); stream.each(new Fields(fieldsWebLog), new HDFSPersistence(), new Fields("result")); } 2.5 Serving Layer HBase Hbase is the default NoSql database supplied with Hadoop. These are its main features Distributed Column-oriented High availability High performance Data volumen in order of TeraBytes and PetaBytes Horizontal scalability with the addition of nodes in a cluster Random read/write Persistent states in HBase with Trident 12 Chapter 2. Architecture
17 @SuppressWarnings("rawtypes") TridentConfig configrequest = new TridentConfig( (String)conf.get(Conf.PROP_HBASE_TABLE_REQUEST), StateFactory staterequest = TridentConfig configuser = new TridentConfig( (String)conf.get(Conf.PROP_HBASE_TABLE_USER), StateFactory stateuser = TridentConfig configsession = new TridentConfig( (String)conf.get(Conf.PROP_HBASE_TABLE_SESSION), StateFactory statesession = HBaseAggregateState.transactional(configSession); Queries: in HBase in Openbus #!/bin/bash #./hbasereqrows.sh wslog_request daily: TABLE=$1 DATE=$2 exec hbase shell << EOF scan ${TABLE}, {COLUMNS => [ ${DATE} ]} EOF 2.5. Serving Layer 13
18 14 Chapter 2. Architecture
19 CHAPTER 3 Installation Deploying the Openbus architecture in your environment involves the following steps: Install dependencies Build Openbus code Run examples We have tested Openbus in a Red Hat Enterprise Linux 6.4 environment 3.1 Installing dependencies Install Hadoop Install Kafka Install Storm Install Camus 3.2 Building openbus Clone the project from github: #> git clone Build the project using maven: #> cd openbus #> mvn compile 15
20 16 Chapter 3. Installation
21 CHAPTER 4 Running examples 4.1 Submitting events to the Kafka broker Launch javaavrokafka sample: #> cd $javaavrokafkahome #> java -jar avrokafka-1.0-snapshot-shaded.jar wslog Arguments are kafka topic, number of requests, number of users, number of user sessions, number of session requests, date simulation offset (0 for today). 4.2 Running batch ETL processes from Kafka to Hadoop Launch Camus ETL: #> cd $camushome #> hadoop jar camus-example snapshot-shaded.jar com.linkedin.camus.etl.kafka.camusjob -P <camus where <camus.properties> is a file path pointing to camus configuration as described in under configuration section. 4.3 Running real time analysis with Storm topologies Launch Openbus Topology: #> cd $openbusrealtimehome #> storm jar target/openbus-realtime shaded.jar com.produban.openbus.processor.topology.openbus Arguments are topology, kafka topic, zookeeper host and kafka broker list. 4.4 Visualizing data View Hits per Day/Month/Week: #> cd $openbusrealtimehome/hbase/queryscripts #>./hitsper<period>.sh <date> [requestid] 17
22 where <Period> can be Day, Month or Week. First arguments is the date in format yyyymmdd for a day of a year, yyyymm for a month of a year and yyyyww for a week of a year. Second argument is optional for filtering with an specific request. View Users per Day/Month/Week: #> cd $openbusrealtimehome/hbase/queryscripts #>./usersper<period>.sh <date> [userid] where <Period> can be Day, Month or Week. First arguments is the date in format yyyymmdd for a day of a year, yyyymm for a month of a year and yyyyww for a week of a year. Second argument is optional for filtering with an specific user. View Sessions per Day/Month/Week: #> cd $openbusrealtimehome/hbase/queryscripts #>./sessionsper<period>.sh <date> [userid] where <Period> can be Day, Month or Week. First arguments is the date in format yyyymmdd for a day of a year, yyyymm for a month of a year and yyyyww for a week of a year. Second argument is optional for filtering with an specific session. 18 Chapter 4. Running examples
23 Bibliography [Marz] Nathan Marz, James Warren. Big Data. Principles and best practices of scalable realtime data systems Manning MEAP 19
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
More informationReal-time Big Data Analytics with Storm
Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap
More informationBuilding Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationFAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara
CS535 Big Data - Fall 2015 W1.B.1 CS535 Big Data - Fall 2015 W1.B.2 CS535 BIG DATA FAQs Wait list Term project topics PART 0. INTRODUCTION 2. A PARADIGM FOR BIG DATA Sangmi Lee Pallickara Computer Science,
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationBig Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island
Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationApache Storm vs. Spark Streaming Two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationIntroducing Storm 1 Core Storm concepts Topology design
Storm Applied brief contents 1 Introducing Storm 1 2 Core Storm concepts 12 3 Topology design 33 4 Creating robust topologies 76 5 Moving from local to remote topologies 102 6 Tuning in Storm 130 7 Resource
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationThe Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationHadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
More informationSQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS
Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical
More informationLecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
More informationXiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
More informationHadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
More informationHADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
More informationGetting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
More informationData Pipeline with Kafka
Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA Senior Software Engineer Agoda.com Contributor Thai Java User Group (THJUG.com) Contributor Agile66 AGENDA Big Data & Data Pipeline Kafka Introduction
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationUsing Kafka to Optimize Data Movement and System Integration. Alex Holmes @
Using Kafka to Optimize Data Movement and System Integration Alex Holmes @ https://www.flickr.com/photos/tom_bennett/7095600611 THIS SUCKS E T (circa 2560 B.C.E.) L a few years later... 2,014 C.E. i need
More informationBig Data Analysis: Apache Storm Perspective
Big Data Analysis: Apache Storm Perspective Muhammad Hussain Iqbal 1, Tariq Rahim Soomro 2 Faculty of Computing, SZABIST Dubai Abstract the boom in the technology has resulted in emergence of new concepts
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationRakam: Distributed Analytics API
Rakam: Distributed Analytics API Burak Emre Kabakcı May 30, 2014 Abstract Today, most of the big data applications needs to compute data in real-time since the Internet develops quite fast and the users
More informationThe Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationComparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationChase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationNOT IN KANSAS ANY MORE
NOT IN KANSAS ANY MORE How we moved into Big Data Dan Taylor - JDSU Dan Taylor Dan Taylor: An Engineering Manager, Software Developer, data enthusiast and advocate of all things Agile. I m currently lucky
More informationSpark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationHow To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
More informationReal-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011
Real-time Streaming Analysis for Hadoop and Flume Aaron Kimball odiago, inc. OSCON Data 2011 The plan Background: Flume introduction The need for online analytics Introducing FlumeBase Demo! FlumeBase
More informationHDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationBig Data. A general approach to process external multimedia datasets. David Mera
Big Data A general approach to process external multimedia datasets David Mera Laboratory of Data Intensive Systems and Applications (DISA) Masaryk University Brno, Czech Republic 7/10/2014 Table of Contents
More informationThe Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
More informationHBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367
HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive
More informationBIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane
BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements
More informationChukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
More informationPredictive Analytics with Storm, Hadoop, R on AWS
Douglas Moore Principal Consultant & Architect February 2013 Predictive Analytics with Storm, Hadoop, R on AWS Leading Provider Data Science and Engineering Services Accelerating Your Time to Value using
More informationFinding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014
Finding the Needle in a Big Data Haystack Wolfgang Hoschek (@whoschek) JAX 2014 1 About Wolfgang Software Engineer @ Cloudera Search Platform Team Previously CERN, Lawrence Berkeley National Laboratory,
More informationCS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationIntroduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
More information1.5 Million Log Lines per Second Building and maintaining Flume flows at Conversant
1.5 Million Log Lines per Second Building and maintaining Flume flows at Conversant Big Data Everywhere Chicago 2014 Mike Keane mkeane@conversant.com SLA for Event Driven Logging R with Flume Quicker insight
More informationWHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka
WHITE PAPER Reference Guide for Deploying and Configuring Apache Kafka Revised: 02/2015 Table of Content 1. Introduction 3 2. Apache Kafka Technology Overview 3 3. Common Use Cases for Kafka 4 4. Deploying
More informationApache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
More informationFuture Internet Technologies
Future Internet Technologies Big (?) Processing Dr. Dennis Pfisterer Institut für Telematik, Universität zu Lübeck http://www.itm.uni-luebeck.de/people/pfisterer FIT Until Now Architectures -Server SPDY
More informationThe Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @
The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @ whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com what s hadoop...
More informationThe future of middleware: enterprise application integration and Fuse
The future of middleware: enterprise application integration and Fuse Giuseppe Brindisi EMEA Solution Architect/Red Hat AGENDA Agenda Build an enterprise application integration platform that is: Resilient
More informationCOSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
More informationBeyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations
Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationKeywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop
Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Transitioning
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationHBase Schema Design. NoSQL Ma4ers, Cologne, April 2013. Lars George Director EMEA Services
HBase Schema Design NoSQL Ma4ers, Cologne, April 2013 Lars George Director EMEA Services About Me Director EMEA Services @ Cloudera ConsulFng on Hadoop projects (everywhere) Apache Commi4er HBase and Whirr
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationNear Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
More informationReal Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
More informationBIG DATA TOOLS. Top 10 open source technologies for Big Data
BIG DATA TOOLS Top 10 open source technologies for Big Data We are in an ever expanding marketplace!!! With shorter product lifecycles, evolving customer behavior and an economy that travels at the speed
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationProcessing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationBIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO
BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO ANTHONY A. KALINDE SIGMA DATA SCIENCE GROUP ASSOCIATE "REALTIME BEHAVIOURAL DATA COLLECTION CLICKSTREAM EXAMPLE" WHAT IS CLICKSTREAM ANALYTICS?
More informationBIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
More informationLog management with Logstash and Elasticsearch. Matteo Dessalvi
Log management with Logstash and Elasticsearch Matteo Dessalvi HEPiX 2013 Outline Centralized logging. Logstash: what you can do with it. Logstash + Redis + Elasticsearch. Grok filtering. Elasticsearch
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationSearch and Real-Time Analytics on Big Data
Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More informationPro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationQsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
More informationApache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past
More informationElephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets
Paper DH07 Elephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets Geoff Low, Medidata Solutions, London, United Kingdom ABSTRACT As an industry we are data-led. We
More informationA framework for easy development of Big Data applications
A framework for easy development of Big Data applications Rubén Casado ruben.casado@treelogic.com @ruben_casado Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies
More informationLeveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015 Welcome Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years
More informationBig Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
More informationThe Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
More informationHadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
More informationSector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
More information