FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Similar documents
Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Principles and best practices of scalable real-time data systems. Nathan Marz James Warren 6$03/(&+$37(5 MANNING

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Openbus Documentation

Comparing SQL and NOSQL databases

Rakam: Distributed Analytics API

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Building Scalable Big Data Pipelines

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

NoSQL Data Base Basics

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

How To Scale Out Of A Nosql Database

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Kafka & Redis for Big Data Solutions

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop IST 734 SS CHUNG

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

An Approach to Implement Map Reduce with NoSQL Databases

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Challenges for Data Driven Systems

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Elephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

MEAP Edition Manning Early Access Program Big Data version 1

Can the Elephants Handle the NoSQL Onslaught?

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

GigaSpaces Real-Time Analytics for Big Data

Enterprise Operational SQL on Hadoop Trafodion Overview

Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015

Big Data and Fast Data combined is it possible?

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Apache Thrift and Ruby

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Integrating Big Data into the Computing Curricula

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Implement Hadoop jobs to extract business value from large and varied data sets

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Architectures for massive data management

Dominik Wagenknecht Accenture

Moving From Hadoop to Spark

Evaluation of NoSQL databases for large-scale decentralized microblogging

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Using Object Database db4o as Storage Provider in Voldemort

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Integrating VoltDB with Hadoop

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Big Data Analytics - Accelerated. stream-horizon.com

Big Data and Apache Hadoop s MapReduce

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Enabling High performance Big Data platform with RDMA

A programming model in Cloud: MapReduce

How To Use Big Data For Telco (For A Telco)

CSE-E5430 Scalable Cloud Computing Lecture 11

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Hadoop. Sunday, November 25, 12

On a Hadoop-based Analytics Service System

Apache HBase. Crazy dances on the elephant back

Transactions and ACID in MongoDB

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

How To Handle Big Data With A Data Scientist

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

NoSQL and Hadoop Technologies On Oracle Cloud

Evaluator s Guide. McKnight. Consulting Group. McKnight Consulting Group

Open source large scale distributed data management with Google s MapReduce and Bigtable

Luncheon Webinar Series May 13, 2013

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

NOT IN KANSAS ANY MORE

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

BIG DATA ANALYTICS For REAL TIME SYSTEM

Introduction to Hadoop

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

Scalable Architecture on Amazon AWS Cloud

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Transcription:

CS535 Big Data - Fall 2015 W1.B.1 CS535 Big Data - Fall 2015 W1.B.2 CS535 BIG DATA FAQs Wait list Term project topics PART 0. INTRODUCTION 2. A PARADIGM FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data - Fall 2015 W1.B.3 CS535 Big Data - Fall 2015 W1.B.4 This material is built based on Nathan Marz and James Warren, Big Data, Principles and Best Practices of Scalable Real-Time Data System, 2015, Manning Publications, ISBN 9781617290343 Lambda Architecture CS535 Big Data - Fall 2015 W1.B.5 Typical problems for scaling traditional databases Suppose that the application should track the number of page s for any URL a customer wishes to track The customer s web page pings the application s web server with its URL every time a page is received Application tells you top 100 URLs by number of pages Id (integer) User_id (integer) url (varchar(255) Pages(bigint) CS535 Big Data - Fall 2015 W1.B.6 Scaling with a queue Direct access from Web server to the backend DB cannot handle the large amount of frequent write requests Timeout errors Web server many increments in a single request Web server DB DB Page Queue 100 at a time Worker 1

CS535 Big Data - Fall 2015 W1.B.7 CS535 Big Data - Fall 2015 W1.B.8 Scaling by sharding the database What if your data amount increases even more? Your worker cannot keep up with the writes What if you add more workers? Again, the Database will be overloaded Horizontal partitioning or sharding of database Uses multiple database servers and spreads the table across all the servers Chooses the shard for each key by taking the hash of the key modded by the number of shards Other issues Fault-tolerance issues What if one of the database machines is down? A portion of the data is unavailable Corruption issues What if your worker code accidentally generated a bug and stored the wrong number for some of the data portions What if your current number of shards cannot handle your data? Your mapping script should cope with new set of shards Application and data should be re-organized CS535 Big Data - Fall 2015 W1.B.9 CS535 Big Data - Fall 2015 W1.B.10 How will Big Data techniques help? Lambda Architecture The databases and computation systems used in Big Data applications are aware of their distributed nature Sharding and replications will be considered as a fundamental component in the design of Big Data systems Data is dealt as immutable Users will mutate data continuously The raw page information is not modified Applications will be designed in different ways Big Data systems as a series of layers layer Serving layer Speed layer Selecting the target data set query=function(all data) Precomputed query function Quick access to the values you need batch = function(all data) query = function(batch ) Speed layer Server layer layer CS535 Big Data - Fall 2015 W1.B.11 CS535 Big Data - Fall 2015 W1.B.12 Generating batch s All Data layer layer is often a high-latency operation layer layer Stores the master copy of the dataset Precomputes batch s The component that performs the batch processing Stores an immutable, constantly growing master dataset Computes arbitrary functions on that dataset -processing systems e.g. Hadoop e.g. A function on all the pages to precompute an index from a key of [url, day] to count the number of pages for that URL for that day This index can be used to sum up the counts to get the results Api.execute(Api.hfsSeqfile( /tmp/page-counts ), new Subquery(?uri,?count )".predicate(api.hfsseqfile( /data/pages ),"?uri,?user,?timestamp )".predicate(new Count(),?count );" 2

CS535 Big Data - Fall 2015 W1.B.13 CS535 Big Data - Fall 2015 W1.B.14 Serving layer The batch layer emits batch as the result of its functions These s should be loaded somewhere and queried Specialized distributed database that loads in a batch and makes it possible to do random reads on it Speed layer Is there any data not represented in the batch? Data came while the precomputation was running With fully real-time data system update and random reads should be supported e.g. BigQuery, ElephantDB, Dynamo, MongoDB Speed layer looks only at recent data Whereas the batch layer looks at all the data at once realtime = function(realtime, new data) CS535 Big Data - Fall 2015 W1.B.15 Lambda architecture CS535 Big Data - Fall 2015 W1.B.16 How long should the real time be maintained? New Data 01111001111 Speed layer layer Master dataset Serving layer Once the data arrives at the serving layer, the corresponding results in the real-time s are no longer needed You can discard pieces of the realtime s Query: How many? CS535 Big Data - Fall 2015 W1.B.17 CS535 Big Data - Fall 2015 W1.B.18 Extended example with Lambda architecture Web analytics application tracking the number of pages over a range of days The speed layer keeps its own separate of [url, day] Updates its s by incrementing the count in the whenever it receives new data The batch layer recomputes its s by counting the pages What if the algorithm is not incremental? The batch/speed layer will split your data The exact algorithm on the batch layer An approximate algorithm on the speed layer The batch layer repeatedly overrides the speed layer The approximation gets corrected Eventual accuracy To resolve the query, you query both the batch and realtime s With satisfying ranges Sum up the results 3

CS535 Big Data - Fall 2015 W1.B.19 CS535 Big Data - Fall 2015 W1.B.20 Example of a non-incremental algorithm Cardinality estimation Count-distinct problem Counting exact unique counts in the batch layer A Hyper-LogLog as an approximation in the speed layer layer corrects what s computed in the speed layer Eventual accuracy Recent trends in technology (1/3) Physical limits of how fast a single CPU can go Parallelize computation to scale to more data Scale-out solution Elastic clouds Infrastructure as a Service (IaaS) Rent hardware on demand rather than owning your hardware Increase and decrease the size of your cluster nearly instantaneously Simplifies system administration CS535 Big Data - Fall 2015 W1.B.21 CS535 Big Data - Fall 2015 W1.B.22 Recent trends in technology (2/3) Open source ecosystem for Big Data computation systems Hadoop, HDFS Serialization frameworks Serializes an object into a byte array from any language Deserialize that byte array into an object in any language Thrift, Protocol Buffers, and Avro Recent trends in technology (2/3) Open source ecosystem for Big Data- cont. Random-access NoSQL databases Sacrifice the full expressiveness of SQL Specializes in certain kinds of operations Cassandra, Hbase, MongoDB, etc. Messaging/queuing systems Sends and consumes messages between processes in a fault-tolerant manner Apache Kafka Real-time computation system High throughput, low latency, stream-processing systems Apache Storm CS535 Big Data - Fall 2015 W1.B.23 CS535 Big Data - Fall 2015 W1.B.24 Mapping course components CS535 BIG DATA New Data 01111001111 Part 1 PA1 Part 2 PA2 Speed layer layer Master dataset Serving layer Part 2 PART 1. INTRODUCTION 3. DATA MODEL FOR BIG DATA : APACHE THRIFT Query: How many? Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 4

CS535 Big Data - Fall 2015 W1.B.25 CS535 Big Data - Fall 2015 W1.B.26 This material is built based on Mark Slee, Aditya Agarwal and Marc Kwiatkowski, Thrift: Scalable Cross-Language Services Implementation https://thrift.apache.org/static/files/thrift-20070401.pdf Nathan Marz, and James Warren, Big Data, Principles and Best Practices of Scalable Real-Time Data System, 2015, Manning Publications, ISBN 9781617290343 Why we are looking at Thrift Applications and services involve multiple, distributed components Possibly developed in different languages Communicate very intensively with each other The wire formats used for data interchange may evolve over time Goal Support this interoperability and evolution of wire formats But without compromising on performance (i.e., speed) CS535 Big Data - Fall 2015 W1.B.27 CS535 Big Data - Fall 2015 W1.B.28 Apache Thrift Thrift Architecture A framework for creating interoperable and scalable services Data serialization framework Originally developed at Facebook Now an Apache project Users can create their services via a simple Interface definition language (IDL) Consumable and serviceable by numerous languages Codes for clients and servers are automatically generated Binary communication protocol Compact size You can change the protocol and transport without regenerating code Source: http://jnb.ociweb.com/jnb/jnbjun2009.html CS535 Big Data - Fall 2015 W1.B.29 CS535 Big Data - Fall 2015 W1.B.30 https://thrift.apache.org/docs/types Types Base types Note the absence of unsigned integer type. - This is due to the fact that there are no native unsigned integer types in bool" many programming languages. A boolean value, true or false byte" A signed byte i16" A 16-bit signed integer i32" A 32-bit signed integer i64" A 64-bit signed integer double" A 64-bit floating point number string" A text string encoded using UTF-8 encoding 5

CS535 Big Data - Fall 2015 W1.B.31 CS535 Big Data - Fall 2015 W1.B.32 Special Types Structs" Defines a common object to be used across languages Binary: a sequence of unencoded bytes Added to provide better interoperability with java Equivalent to a class in object oriented programming languages But without inheritance Contains a set of strongly typed fields With a unique name identifier within the struct Struct Example {" 1:i32 number=10," 2:i64 bignumber," 3:double decimals," 4:string name= thrifty " }" CS535 Big Data - Fall 2015 W1.B.33 CS535 Big Data - Fall 2015 W1.B.34 Containers (1/2)" Strongly typed data-type that maps to commonly used and commonly available container types in most programming languages. List" An ordered list of elements e.g. translated to Java ArrayList" Set" An unordered set of unique elements e.g. translated to Java HashSet, set in Python, etc. Containers (2/2) Container elements may be any valid Thrift Type The key type for map should be a basic type rather than a struct or container type There are some languages that do not support more complex key types in their native map types List map" A map of strictly unique keys to values e.g. translated to Java HashMap, Python/Ruby dictionary" CS535 Big Data - Fall 2015 W1.B.35 CS535 Big Data - Fall 2015 W1.B.36 Exceptions" Functionally equivalent to structs" Except that they inherit from the native exception base class as appropriate in each target programming language Services" Defined using Thrift types Consists of a set of named functions With a list of parameters and return types service <name> {" <returntype><name>(<arguments>)" [throws (<exceptions>)}" " }" service StringCache{" void set(1:i32 key, 2:string value)," string get(1:i32 key) throws (1:KeyNotFound knf)," void delete(1:i32 key)" }" 6

CS535 Big Data - Fall 2015 W1.B.37 CS535 Big Data - Fall 2015 W1.B.38 TTransport Interface (1/2) Describes how the data is transmitted Transport Thrift decouples the transport layer from the code generation layer Needs to know how to read and write data The origin and destination of the data are irrelevant It may be a socket, a segment of shared memory, or a file on the local disk CS535 Big Data - Fall 2015 W1.B.39 CS535 Big Data - Fall 2015 W1.B.40 TTransport Interface (2/2) TServerTransport interface open opens the transport TServerTrasnport interface accepts or creates primitive transport objects close closes the transport isopen read indicates whether the transport is open reads from the transport Open listen accept opens the transport begins listening for connections returns a new client transport write writes to the transport close closes the transport flush forces any pending writes CS535 Big Data - Fall 2015 W1.B.41 CS535 Big Data - Fall 2015 W1.B.42 TSocket" Implementation of TServerTransport interface Provides a common, simple interface for a TCP/IP stream socket Implemented across all target languages TFileTransport" Abstraction of an on-disk file to a data stream It can be used to write out a set of incoming Thrift requests to a file on disk 7